Introduction
In this tutorial, we will be going through an in-depth understanding of lemmatization in the NLTK library. We will first understand in general what is lemmatization, why it is used, and then see the use of NLTK lemmatizer using the WordNetLemmatizer module with few examples. We will also see the most common scenario of NLTK lemmatization not working and how to resolve it. Finally, we will discuss understand the difference between stemming and lemmatization.
What is Lemmatization in NLP
Lemmatization is the process of grouping together different inflected forms of words having the same root or lemma for better NLP analysis and operations. The lemmatization algorithm removes affixes from the inflected words to convert them into the base words (lemma form). For example, “running” and “runs” are converted to its lemma form “run”.
Lemmatization looks similar to stemming initially but unlike stemming, lemmatization first understands the context of the word by analyzing the surrounding words and then convert them into lemma form. For example, the lemmatization of the word bicycles can either be bicycle or bicycle depending upon the use of the word in the sentence.
Why Lemmatization is used?
Since Lemmatization converts words to their base form or lemma by removing affixes from the inflected words, it helps to create better NLP models like Bag of Word, TF-IDF that depend on the frequency of the words. At the same time, it also increases computational efficiency.
NLTK Lemmatization with WordNetLemmatizer
Wordnet is a popular lexical database of the English language that is used by NLTK internally. WordNetLemmatizer is an NLTK lemmatizer built using the Wordnet database and is quite widely used.
There is, however, one catch due to which NLTK lemmatization does not work and it troubles beginners a lot.
The NLTK lemmatizer requires POS tag information to be provided explicitly otherwise it assumes POS to be a noun by default and the lemmatization will not give the right results.
Let us understand all this with more examples.
Examples of NLTK Lemmatization
i) WordNetLemmatizer() without POS tag
In the two examples below, we first tokenize the text and then lemmatizes each token in for loop by using WordNetLemmatizer.lemmatize().
However, we have used the default settings of the WordNetLemmatizer.lemmatize() and do not provide POS. Due to this, it assumes the default tag as noun ‘n’ internally and hence lemmatization does not work properly.
In 1st example, the lemma returned for “Jumped” is “Jumped” and for “Breathed” it is “Breathed”. Similarly in the 2nd example, the lemma for “running” is returned as “running” only.
Clearly, lemmatization is not working when we are not passing POS tags in the NLTK lemmatizer.
Example 1
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
text = "She jumped into the river and breathed heavily"
wordnet = WordNetLemmatizer()
tokenizer = word_tokenize(text)
for token in tokenizer:
print(token,"--->",wordnet.lemmatize(token))
She ---> She jumped ---> jumped into ---> into the ---> the river ---> river and ---> and breathed ---> breathed heavily ---> heavily
Example 2
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
text = "I am running and I usually use to runs"
wordnet = WordNetLemmatizer()
tokenizer = word_tokenize(text)
for token in tokenizer:
print(token,"--->",wordnet.lemmatize(token))
I ---> I am ---> am running ---> running and ---> and I ---> I usually ---> usually use ---> use to ---> to runs ---> run
ii) WordNetLemmatizer() with POS tags
In the below examples, after tokenization, we have first derived the pos tags of each token and then passed both tokens and pos to WordNetLemmatizer.lemmatize().
This time we can see that lemmatization is done properly for “jumped” as “jump”, “breathed” as “breathe” and “running” as “run”.
Example 1
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize,pos_tag
text = "She jumped into the river and breathed heavily"
wordnet = WordNetLemmatizer()
for token,tag in pos_tag(word_tokenize(text)):
pos=tag[0].lower()
if pos not in ['a', 'r', 'n', 'v']:
pos='n'
print(token,"--->",wordnet.lemmatize(token,pos))
She ---> She jumped ---> jump into ---> into the ---> the river ---> river and ---> and breathed ---> breathe heavily ---> heavily
Example 2
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize,pos_tag
text = "I am running and I usually use to runs"
wordnet = WordNetLemmatizer()
for token,tag in pos_tag(word_tokenize(text)):
pos=tag[0].lower()
if pos not in ['a', 'r', 'n', 'v']:
pos='n'
print(token,"--->",wordnet.lemmatize(token,pos))
I ---> I am ---> be running ---> run and ---> and I ---> I usually ---> usually use ---> use to ---> to runs ---> run
Application of Lemmatization
- Lemmatization is used to reduce text redundancy by converting words having the same meaning but different inflected forms to their base form.
- The reduced word density of redundant text helps to create better NLP models that are efficient and also computationally fast.
Stemming vs Lemmatization
Although both look quite similar there are key differences between Stemming vs Lemmatization –
- The output of lemmatization is an actual word like Changing -> Change but stemming may not produce an actual English word like Changing -> Chang.
- The stemming process just follows the step-by-step implementation of algorithms like SnowBall, Porter, etc. to derive the stem. Whereas lemmatization makes use of a lookup database like WordNet to derive lemma. For example, the lemmatization of “better” is “well” and this another word is derived as lemma as it looks up in the dictionary. But the stemming result will come as “better” only without a lookup. However, this lookup can at times slow down the lemmatization process.
- Stemming does not take the context of the word into account, for example, “meeting” can be a verb or noun based on the context. But lemmatization does consider the context of the word before generating its lemma.
Stemming vs Lemmatization Example
In the example code below we first tokenize the text and then with the help of for loop stemmed the token with Snowball Stemmer and Porter Stemmer. At the same time, we also Lemmatize the text and convert it into a lemma with the help of Wordnet Lemmatizer.
from nltk.stem import SnowballStemmer,PorterStemmer,WordNetLemmatizer
from nltk import word_tokenize,pos_tag
snowball = SnowballStemmer(language='english')
porter = PorterStemmer()
wordnet = WordNetLemmatizer()
text = ["better","Caring","are","am","worse","struggling",'meeting']
print("{0:10}{1:20}{2:30}".format("Word","Snowball Stemmer","Wordnet Lemmatizer"))
for token,tag in pos_tag(text):
pos=tag[0].lower()
if pos not in ['a', 'r', 'n', 'v']:
pos='n'
print("{0:10}{1:20}{2:30}".format(token,snowball.stem(token),wordnet.lemmatize(token,pos)))
Word Snowball Stemmer Wordnet Lemmatizer better better well Caring care Caring are are be am am be worse wors worse strugling strugl strugling meetings meet meeting
- Also Read – NLTK Tokenize – Complete Tutorial for Beginners
- Also Read – Complete Tutorial for NLTK Stopwords
- Also Read – Ultimate Guide to Sentiment Analysis in Python with NLTK Vader, TextBlob and Pattern
Conclusion
So coming to the end of the article, I hope you now understand the NLTK tokenizer module WordNetLemmatizer and how to use it properly without running into issues of not working due to missing POS tags. We also discussed the application of NTLK in general and compared the difference between Stemming vs Lemmatization.
Reference – NLTK Documentation
-
This is Afham Fardeen, who loves the field of Machine Learning and enjoys reading and writing on it. The idea of enabling a machine to learn strikes me.
View all posts