Lemmatization and Stemming are two words one hears most of the time when reading about NLP projects. The reason for that is that they are very important tools that can help us develop more impressive and efficient NLP projects. More important, we kind of use them everyday. If you’ve come to this blog through Google, chances are you’ve already used one of these two techniques.
Interested in more stories like this? Follow me on Twitter at @b_dmarius and I'll post there every new article.
- What is lemmatization in NLP
- Lemmatization algorithms
- What is stemming in NLP
- Best libraries for stemming and lemmatization
What is Lemmatization in NLP
Lemmatization in NLP is the process through which several different forms of the same word are mapped to one single form, which we can call the root form or the base form. In more technical terms, the root form is called a lemma. By reducing the number of forms a word can take, we make sure that we reduce our data space and that we don’t have to check every single form of a word. It helps us ignore morphological variations on a single word.
For example, let’s say you have a collection of documents and you want to retrieve every document in which anything related to singing is mentioned. You’ll have to search for “sing”, “singing”, “sing”, “sang”, “sung” and so on. Or as a programmer building a search engine, you’ll have to write a lot of useless and ugly code to handle so many cases. And English language is quite simple and well-structured, but I’m sure there are many more complex languages with lots more complex variations.
This is where lemmatization can help us. We can reduce all of these forms to simply “sing” and use this lemma for our search.
Lemmatization can be done while storing the data or while doing the search. Ideally, you’d have to store both the original form and the lemma and use them as appropriate.
Lemmatization can be used in most of NLP projects or any type of Machine Learning project which deals with words. It is useful because it helps you normalize words and reduce the dimensions of your space. I've used it for building knowledge graphs and it can be pretty much used for any information extraction tasks.
So how are lemmatizers built then? It can be and it usually is a combination of different techniques. With language being such a complex space with many different rules and exceptions, there’s no way to guarantee that a simple technique or algorithm would solve all cases.
Lemmatization approach #1
For basic words, lemmatization is as simple as looking in a dictionary and fetching the right lemma from there.
Let's say you have the words "walked" and "walking" and you want to extract the lemma: "walk". Then you'd have to write some basic rules for this:
- Is the word ending in "ed"? Remove the suffix and keep the rest of the word as lemma
- Is the word ending in "ing"? Remove the suffix and keep the lemma.
Cool, it works! But what if you encounter the word "bed"? Is it already a lemma? Do I have to remove the "ed" and keep only "b"? 🙃
Lemmatization approach #2
Let's get back to the drawing board and see what happened here. We had a rule that was working well and then we hit a problem. The problem was that we had different Parts of Speech and we tried to apply the same rule.
That's why we need an improved approach that will also take into account parts of speech. There are lots of NLP libraries out there that can help us figure out parts of speech for given words.
- Is it a verb -> use one of these rules
- Is it a noun -> use one of the other rules
But this approach can also become very complicated, because even inside a PoS category we may encounter different exceptions.
Lemmatization approach #3
Use approach #2 and improve it using a Machine Learning model trained on an annotated dataset. I know, datasets are difficult to build or expensive to acquire and the perfect, complete dataset may not exist yet(or it hasn't been published). But we need to agree on the fact that NLP problems are very difficult and this approach, although a very difficult one, can help you find the best solution to one of the most difficult problems.
What is Stemming in NLP
Stemming in NLP is the process of removing prefixes and suffixes from words so that they are reduced to simpler forms which are called stems. The purpose of stemming is the same as with lemmatization: to reduce our vocabulary and dimensionality for NLP tasks and to improve speed and efficiency in information retrieval and information processing tasks.
Stemming is a simpler, faster process than lemmatization, but for simpler use cases, it can have the same effect. The difference is that stemming is usually only rule-based approach. And, as we've showed with our earlier example, rule-based approaches can fail very quickly on more complex examples. But for most problems, it works well enough. I've used stemming very successfuly for the TextRank algorithm while performing keywords extraction.
Many search engines use stemming to improve their search results. In fact, a funny fact I found on Wikipedia is that Google introduced stemming in 2003, because previous to that, searching for "fish" would have not returned "fishing" in the results.
Stemming algorithms have been created for a few decades now but very few are popular, with most of the companies that use stemming in their products use their own flavours of the algorithms.
Best libraries for stemming and lemmatization
Now let's see how we can play with stemming and lemmatization with some of the more popular free libraries around the internet.
First let's install nltk.
pip3 install nltk
We are going to use the Porter Stemmer from the NLTK library.
from nltk import PorterStemmer sentence = ["This","sentence","was","transformed", "using", "Porter", "Stemmer"] porterStemmer = PorterStemmer() print (" ".join([porterStemmer.stem(word) for word in sentence])) # Prints "thi sentenc wa transform use porter stemmer"
Now let's compare the results with those obtained through lemmatization. The NLTK lemmatizer was trained using WordNet so we first need to download the WordNet corpus. Lucky for us, NLTK already provides us an utility function to do that.
import nltk nltk.download('wordnet')
Then we can use WordNetLemmatizer.
from nltk.stem import WordNetLemmatizer import nltk nltk.download('wordnet') sentence = ["This","sentence","was","transformed", "using", "WordNet", "Lemmatizer"] lemmatizer = WordNetLemmatizer() print (" ".join([lemmatizer.lemmatize(word) for word in sentence])) # Prints "This sentence wa transformed using WordNet Lemmatizer"
We can see the words obtained through the lemmatizer seem closer to the truth than those obtained with the stemmer. But depending on the needs of you project, stemming might just be enough for you.
SpaCy is another popular open-source NLP library that is very powerful and very useful to use for small or complex projects. SpaCy does not provide any stemming functionality, but lemmatization is done by default when processing sentences with spaCy.
First let's install spacy and download the spacy model for English.
pip3 install spacy python3 -m spacy download en_core_web_sm
Now let's use spacy for lemmatization.
import spacy nlp_model = spacy.load('en_core_web_sm') tokens = nlp_model("This sentence was transformed using Spacy Lemmatization") print (" ".join(token.lemma_ for token in tokens)) # Prints "this sentence be transform use Spacy Lemmatization"
For me this looks better than the results from nltk, and moreover, it's better than I even expected.
Another wonderful library for NLP is Gensim. This library also includes the Porter Stemmer and it's as easy to use as with NLTK.
Let's install gensim.
pip3 install gensim
With only a few lines of code, we are able to replicate the same example.
from gensim.parsing.porter import PorterStemmer sentence = ["This","sentence","was","transformed", "using", "Porter", "Stemmer"] porterStemmer = PorterStemmer() print (" ".join([porterStemmer.stem(word) for word in sentence])) # Prints "thi sentenc wa transform us porter stemmer"
This is almost identical to the result from nltk, only that the nltk library returned "use", but gensim only returned "us". Depending on your preferred stack for NLP project, any result might be good for you.
The last library on our list today is TextBlob so let's quickly install it.
pip3 install textblob
TextBlob lemmatization is also very easy, same like with the other libraries.
from textblob import TextBlob sentence = TextBlob('This sentence was transformed using TextBlob Lemmatization.') print (" ".join([word.lemmatize() for word in sentence.words])) # Prints "This sentence wa transformed using TextBlob Lemmatization"
Personally, I am more satisfied with the results obtained from spaCy. But again, it's a matter of your preferred stack of tools.
So we've come to an end for our complete guide on stemming and lemmatization. We've seen an introductory overview of the 2 techniques and then we've tried to perform stemming and lemmatization in NLTK, spaCy, Gensim and TextBlob, 4 of the more popular open-source NLP libraries around the Python environment. You can see here how I've used lemmatization to build a knowledge graph. You can also check all the code we've used for this article in this Github repository(don't forget to star the repo while you're there).
Thank you so much for reading this! Let me know on Twitter at @b_dmarius how you've used lemmatization and stemming in your projects.