Clustering texts using Latent Dirichlet Allocation

Analysing large bodies of text can be challenging, especially when with limited previous knowledge about the topics in the text. For one of my ongoing projects, I tackled this problem applying Latent Dirichlet Allocation (LDA). LDA is a machine learning algorithm that treats each text in the whole set as a “bag of words” – a collection of words with different frequencies that are used to describe the article’s content. It then tries to find words and weight for each word to best group all texts into topics, with the least overlap of topics. These topics do not inherently have meaning, but it is possible to plot the words and weights to visualize which topic has been identified to group the texts. As an example the following wordcloud represents one of the topics:

Clearly this is a topic regarding racing! Now that the LDA is trained, we can use it to categorize all the texts that we have.

While this kind of analysis seems intimidating, it is not hard to get started using python, thanks to great libraries and documentation. Here are my favorite examples:

Natural Language Toolkit https://www.nltk.org/ – A great library to work with human language

Gensim: https://radimrehurek.com/gensim/ – Topic modelling library (which also has an LDA implementation)

I can only recommend trying out this method when working with texts, so maybe give it a try yourself.

Written on April 29, 2024 by Finn Jonas Tryggvason

Add a Comment Cancel reply

Share this article

Written by

Finn Jonas Tryggvason

Research Objectives

More blogs

EINST4INE: The European Training Network for InduStry Digital Transformation across Innovation Ecosystems

Einst4ine (at) rmit.edu.au

Privacy

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 956745. Results reflect the author’s view only. The European Commission is not responsible for any use that may be made of the information it contains.

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 956745.

Results reflect the author’s view only. The European Commission is not responsible for any use that may be made of the information it contains.

EINST4INE

EINST4INE