On Friday 4th of March 2022, I initiated a poll on LinkedIn, in order to have an idea of which text normalization technic people tend to use. 75% of voters went for Lemmatization, and the remaining 25% for Stemming, does it means that the 25% are all wrong or 75% are all right? I don’t think so, because both approaches aim for a similar purpose, and have their own pros and cons.
In this article, we will cover Stemming, Lemmatization, two widely used concepts in NLP for text data normalization. We will start by defining both concepts, explaining how they differ, and providing their python implementation using the de>NLTKlibrary.
Understand and implement each concept
In this section, we will understand how each one works through different examples.
Stemming
Stemming is derived from stem, and the stem of a word is the unit to which affixes are attached. Different stemming approaches exist, but we will focus on the most commonly known for English: PorterStemmer, developed in 1980 by Martin Porter. It works by progressively applying a set of rules, until the normalized form is obtained. Let’s see the effect of stemming on the following examples.
# Examples of Stemmingsentence1 = "I love to run because running helps runners stay in good shape"
sentence2="He is busy doing his business"
sentence3="I am there and he will be here soon"
We can find below the results from Line 15 and 18. On the right of the arrow ( →)is the corresponding stem of each word on the left.
Stemming results for sentence 1, 2 & 3 from lines 15, 18 & 21 (Image by Author)
Let’s consider the result from line 15. Imagine searching for documents about “running”, we might also want the tool to return those containing “run”, “runs”, “runners” which in reality can be helpful. That’s exactly the idea behind stemming, by allowing different variations of the same word to map the same stem.
With stemming, the generated base form of a term is not always meaningful. For instance, uncle is transformed to uncl.
By looking at the result from line 18, we notice thatwords with different meanings such as business and busy got mapped to the same stem (busi), which is problematical.
Lemmatization
Lemmatization is derived from lemma, and the lemma of a word corresponds to its dictionary form. Lemma of words are created depending on their meaning (adjective, a noun, or a verb.) in the text they are being used. Let’s perform lemmatization on the same examples.
Below are the results for both sentences with Lemmatization.
Lemmatization results for sentence 1, 2 & 3 from lines 18, 21 & 24 (Image by Author)
By looking at the previous results from lemmatization, we can notice that the context is preserved and all the terms are meaningful, which was not the case with stemming. Lemmatization is I think the right approach to adopt when dealing with use cases such as chatbots, word-sense disambiguation, just to name a few, where the context of a word is important.
Some benefits and drawbacks
You can find below some benefits ( ✅ ) and drawbacks ( ❌ ) of Stemming and Lemmatization. The list is not exhaustive.
Stemming
✅ A simple rule-based algorithm, straightforward and fast.
✅ Lowers the dimensions of word vectors for eventually better machine learning performance.
❌ Unable to map terms that have different forms based on their grammatical construction (e.g. from sentence 3, is, am, be represent the same root verb, be).
❌ Words with different meanings could be mapped to the same stems (e.g. busi for business and busy).
Lemmatization
✅ Generates better results by providing meaningful and actual dictionary words.
❌ More complex process compared to stemming, and much longer computation time.
Final Thoughts?
Even though Stemming has some benefits, it should not be blindly adopted, and the user should be aware of its drawbacks. Lemmatization’s processing time could be minimized, because of all the modern computation power. One should always clearly understand the business requirements, in order to wisely adopt the right approach.
Conclusion
Congratulations! 🎉 🍾 You have just learned what Stemming and Lemmatization are and when to use them. I hope you have enjoyed reading this article, and that it will be helpful for your future use cases. Hit the follow button for updates. Also, please find below additional resources to further your learning.
Feel free to add me on LinkedIn or follow me on Twitter, and YouTube. It is always a pleasure to discuss AI, ML, Data Science, NLP stuffs!