Analysing large bodies of text can be challenging, especially when with limited previous knowledge about the topics in the text. For one of my ongoing projects, I tackled this problem applying Latent Dirichlet Allocation (LDA). LDA is a machine learning algorithm that treats each text in the whole set as a “bag of words” – a collection of words with different frequencies that are used to describe the article’s content. It then tries to find words and weight for each word to best group all texts into topics, with the least overlap of topics. These topics do not inherently have meaning, but it is possible to plot the words and weights to visualize which topic has been identified to group the texts. As an example the following wordcloud represents one of the topics:

Clearly this is a topic regarding racing! Now that the LDA is trained, we can use it to categorize all the texts that we have.

While this kind of analysis seems intimidating, it is not hard to get started using python, thanks to great libraries and documentation. Here are my favorite examples:

Natural Language Toolkit https://www.nltk.org/ – A great library to work with human language

Gensim: https://radimrehurek.com/gensim/ – Topic modelling library (which also has an LDA implementation)

 

I can only recommend trying out this method when working with texts, so maybe give it a try yourself.

Add a Comment

Your email address will not be published. Required fields are marked *