Latent Dirichlet Allocation is a form of unsupervised Machine Learning that is usually used for topic modelling in Natural Language Processing tasks. It is a very popular model for these type of tasks and the algorithm behind it is quite easy to understand and use. Also, the Scikit-Learn library has a very good implementation for the algorithm, so in this article we are going to focus on topic modelling using Latent Dirichlet Allocation.
Interested in more stories like this? Follow me on Twitter at @b_dmarius and I'll post there every new article.
Latent Dirichlet Allocation Overview
- What is Topic Modelling
- Supervised and Unsupervised Machine Learning - What is Unsupervised Machine Learning
- Latent Dirichlet Allocation applications
- Latent Dirichlet Allocation algorithm
- Building a Latent Dirichlet Allocation dataset
- Latent Dirichlet Allocation implementation using Scikit-Learn
What is Topic Modelling
Topic Modelling is an unsupervised Machine Learning task where we try to discover "abstract topics" that can describe a collection of documents. This means we have a collection of texts and we try to find patterns of words and phrases that can help us cluster the documents and group them by "topics".
I put topics into quotes and I call them abstract topics because these are not obvious topics and we don't need them to be. We work on the assumption that similar documents will have similar patterns of words and phrases.
For example, let's say we have a collection of 100 texts. We go through each text and discover that ten of them contain words like "machine learning", "training", "supervised", "unsupervised", "dataset" and so on. We may not know what these words mean and we really don't care.
We only see a pattern here, that 10% of our articles contain these words and we conclude that they should be included in the same topic. We can't actually name the topic and again, this is not needed. We are able to cluster these 10 articles into the same topic. And when we get a new text which we have never seen before, we look into it, we find it contains some of these words, then we'll be able to say "hey, this goes into the same category with the other 10 articles!"
Supervised and Unsupervised Machine Learning - What is Unsupervised Machine Learning
Unsupervised Machine Learning is a type of Machine Learning model where we try to infer data patterns without any prior knowledge and without knowing a priori if we are right and wrong. With this type of models, we try to find patterns in the data and then we can use them to cluster, classify or describe our data.
Latent Dirichlet Allocation is a type of Unsupervised Machine Learning. We don't know the topics of documents before we begin, we can only specify how many topics we want to find. At the end of the parsing we can look into the results and figure out if they are helpful or not.
Latent Dirichlet Allocation applications
Latent Dirichlet Allocation is mostly used in topic modelling. Now we can think about why would we need topic modelling.
With topic modelling we can cluster a collection of documents so that more similar documents are grouped together and less similar documents are put into different categories. This can be used to analyse and understand a dataset.
We can also automatically organise our documents based on this algorithm and then, when a new document appears into a dataset, we can automatically put it in the correct category.
Moreover, this can be used to improve text search and text similarities features in applications that deal with text documents.
Latent Dirichlet Allocation algorithm
Latent Dirichlet Allocation algorithm works with a few simple steps. The only preprocessing we need to do is the one we do in almost all text processing tasks: removing the stopwords(words that, with a high probability, are found in most of the documents and don't bring any value) from all of our documents.
- Establish a number of n topics that will be identified by the LDA algorithm. How can we find the perfect number of topics? Well, it's not very easy and it's usually a trial and error process: we try different values for n until we are satisfied with the results. Or, maybe we are lucky and we have other information about the dataset that allows us to establish the perfect number of topics.
- Assign every word in every document to a temporary topic. This temporary topic will be random at first, but will be updated in the next step.
- For this step we will go through every document and then every word in that document and compute 2 values
- the probability that this document belongs to a certain topic; this is based on how many words(except the currrent word) from this document belong to the topic of the current word
- the proportion of documents that are assigned to the topic of the current word because of the current word.
We will run through step 3 a certain number of times(established before beginning to run the algorithm). At the end, we will look at each document, find the topic that is most prevalent based on its words and assign that document to that topic.
Building a Latent Dirichlet Allocation dataset
Sometimes, when I learn about a new concept, I like to build my own small dataset that I can use to learn faster. I prefer this for 2 reasons:
- no time wasted on cleaning up the data. I know that this is a very important skill for a Machine Learning Engineer or a Data Scientist, but this topic is not the focus here. If I want to learn about an algorithm, I'll build my own small, clean dataset that will allow me to play with it.
- faster trial and error process: building a dataset of my own will allow me to make it big enough to offer results, but small enough to run fast.
For the LDA Algorithm, I'm going do get the summary section of 6 Wikipedia pages(2 about cities, 2 about technology, 2 about books) and use them as documents to be clustered by the LDA algorithm. Then I'll offer a 7th summary from another page and observe it's placed in the correct category.
For the purpose of extracting the text from Wikipedia, I'll use the wikipedia python package.
pip3 install wikipedia
And I'm going to use a small class to download the summary.
import wikipedia class TextFetcher: def __init__(self, title): self.title = title page = wikipedia.page(title) self.text = page.summary def getText(self): return self.text
Then I can use this class to extract the data. As I've mentioned earlier, the only preprocessing that needs to be done is removing the stopwords. For that I'll use the nltk package.
def preprocessor(text): nltk.download('stopwords') tokens = word_tokenize(text) return (" ").join([word for word in tokens if word not in stopwords.words()]) if __name__ == "__main__": textFetcher = TextFetcher("London") text1 = preprocessor(textFetcher.getText()) textFetcher = TextFetcher("Natural Language Processing") text2 = preprocessor(textFetcher.getText()) textFetcher = TextFetcher("The Great Gatsby") text3 = preprocessor(textFetcher.getText()) textFetcher = TextFetcher("Machine Learning") text4 = preprocessor(textFetcher.getText()) textFetcher = TextFetcher("Berlin") text5 = preprocessor(textFetcher.getText()) textFetcher = TextFetcher("For Whom the Bell Tolls") text6 = preprocessor(textFetcher.getText()) docs = [text1, text2, text3, text4, text5, text6]
And since we have our dataset ready, we can move on to the algorithm implementation.
Latent Dirichlet Allocation implementation using Scikit-Learn
The scikit-learn package has an excellent implementation of the LDA Algorithm. We are going to use this for today's purpose.
pip3 install scikit-learn
First step is to convert our words into numbers. Although we are working with words, many text processing tasks are done with numbers, because they are easier to understand for computers.
The CountVectorizer class from the scikit-learn package can convert a word in a vector of real numbers. So let's do that with our dataset.
countVectorizer = CountVectorizer(stop_words='english') termFrequency = countVectorizer.fit_transform(docs) featureNames = countVectorizer.get_feature_names()
Now let's apply the Latent Dirichlet Allocation algorithm on our word vectors and let's print out our results. For each topic we'll print the first 10 words.
lda = LatentDirichletAllocation(n_components=3) lda.fit(termFrequency) for idx, topic in enumerate(lda.components_): print ("Topic ", idx, " ".join(featureNames[i] for i in topic.argsort()[:-10 - 1:-1]))
And the result looks like this:
Topic 0 berlin city capital german learning machine natural germany world data Topic 1 novel fitzgerald gatsby great american published war book following considered Topic 2 london city largest world europe populous area college westminster square
As we've discussed earlier, this info might not tell you much, but it's enough for us to correctly classify a new text about Paris(after we again vectorize this text).
text7 = preprocessor(TextFetcher("Paris").getText()) print (lda.transform(countVectorizer.transform([text7])))
And the result is this:
[[0.17424998 0.10191793 0.72383209]]
Those 3 are probabilities that our text belongs to one of the 3 topics we've generated from the LDA algorithm. We can see that the highest probability(72%) tells us that this text should also belong to the 3rd topic, so in the same topic that talks about cities. We can see that this is a very good result obtained from a very small dataset.
In this article we've discussed about a general overview on the Latent Dirichlet Allocation algorithm. We've then built a small dataset of our own and tested the algorithm. I am very satisfied with this result and I hope you are too.
Thank you so much for reading this! Interested in more stories like this? Follow me on Twitter at @b_dmarius and I'll post there every new article.