Learn Lemmatization in NTLK with Examples

Learn Lemmatization in NTLK with Examples

Introduction

In this tutorial, we will be going through an in-depth understanding of lemmatization in the NLTK library. We will first understand in general what is lemmatization, why it is used, and then see the use of NLTK lemmatizer using the WordNetLemmatizer module with few examples. We will also see the most common scenario of NLTK lemmatization not working and how to resolve it. Finally, we will discuss understand the difference between stemming and lemmatization.

What is Lemmatization in NLP

NLTK Lemmatization

Lemmatization is the process of grouping together different inflected forms of words having the same root or lemma for better NLP analysis and operations. The lemmatization algorithm removes affixes from the inflected words to convert them into the base words (lemma form). For example, “running” and “runs” are converted to its lemma form “run”.

Lemmatization looks similar to stemming initially but unlike stemming, lemmatization first understands the context of the word by analyzing the surrounding words and then convert them into lemma form. For example, the lemmatization of the word bicycles can either be bicycle or bicycle depending upon the use of the word in the sentence.

Why Lemmatization is used?

Since Lemmatization converts words to their base form or lemma by removing affixes from the inflected words, it helps to create better NLP models like Bag of Word, TF-IDF that depend on the frequency of the words. At the same time, it also increases computational efficiency.

NLTK Lemmatizer

NLTK Lemmatization with WordNetLemmatizer

Wordnet is a popular lexical database of the English language that is used by NLTK internally. WordNetLemmatizer is an NLTK lemmatizer built using the Wordnet database and is quite widely used.

Ad
Deep Learning Specialization on Coursera

There is, however, one catch due to which NLTK lemmatization does not work and it troubles beginners a lot.

The NLTK lemmatizer requires POS tag information to be provided explicitly otherwise it assumes POS to be a noun by default and the lemmatization will not give the right results.

Let us understand all this with more examples.

Examples of NLTK Lemmatization

Here we will show you two sets of examples of lemmatization using WordNetLemmatizer without POS tags and with POS tags.

i) WordNetLemmatizer() without POS tag

In the two examples below, we first tokenize the text and then lemmatizes each token in for loop by using WordNetLemmatizer.lemmatize().

However, we have used the default settings of the WordNetLemmatizer.lemmatize() and do not provide POS. Due to this, it assumes the default tag as noun ‘n’ internally and hence lemmatization does not work properly.

In 1st example, the lemma returned for “Jumped” is “Jumped” and for “Breathed” it is “Breathed”. Similarly in the 2nd example, the lemma for “running” is returned as “running” only.

Clearly, lemmatization is not working when we are not passing POS tags in the NLTK lemmatizer.

Example 1

In [1]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

text = "She jumped into the river and breathed heavily"
wordnet = WordNetLemmatizer()
tokenizer = word_tokenize(text)

for token in tokenizer:
    print(token,"--->",wordnet.lemmatize(token))
[Out] :
She ---> She
jumped ---> jumped
into ---> into
the ---> the
river ---> river
and ---> and
breathed ---> breathed
heavily ---> heavily

Example 2

In [2]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

text = "I am running and I usually use to runs"

wordnet = WordNetLemmatizer()
tokenizer = word_tokenize(text)

for token in tokenizer:
    print(token,"--->",wordnet.lemmatize(token))
[Out] :
I ---> I
am ---> am
running ---> running
and ---> and
I ---> I
usually ---> usually
use ---> use
to ---> to
runs ---> run

ii) WordNetLemmatizer() with POS tags

In the below examples, after tokenization, we have first derived the pos tags of each token and then passed both tokens and pos to WordNetLemmatizer.lemmatize().

This time we can see that lemmatization is done properly for “jumped” as “jump”, “breathed” as “breathe” and “running” as “run”.

Example 1

In [3]:
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize,pos_tag

text = "She jumped into the river and breathed heavily"
wordnet = WordNetLemmatizer()

for token,tag in pos_tag(word_tokenize(text)):
    pos=tag[0].lower()
        
    if pos not in ['a', 'r', 'n', 'v']:
        pos='n'
    
    print(token,"--->",wordnet.lemmatize(token,pos))
[Out] :
She ---> She
jumped ---> jump
into ---> into
the ---> the
river ---> river
and ---> and
breathed ---> breathe
heavily ---> heavily

Example 2

In [4]:
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize,pos_tag

text = "I am running and I usually use to runs"
wordnet = WordNetLemmatizer()

for token,tag in pos_tag(word_tokenize(text)):
    pos=tag[0].lower()
        
    if pos not in ['a', 'r', 'n', 'v']:
        pos='n'
    
    print(token,"--->",wordnet.lemmatize(token,pos))
[Out] :
I ---> I
am ---> be
running ---> run
and ---> and
I ---> I
usually ---> usually
use ---> use
to ---> to
runs ---> run

Application of Lemmatization

  • Lemmatization is used to reduce text redundancy by converting words having the same meaning but different inflected forms to their base form.
  • The reduced word density of redundant text helps to create better NLP models that are efficient and also computationally fast.

Stemming vs Lemmatization

Although both look quite similar there are key differences between Stemming vs Lemmatization –

  • The output of lemmatization is an actual word like Changing -> Change but stemming may not produce an actual English word like Changing -> Chang.
  • The stemming process just follows the step-by-step implementation of algorithms like SnowBall, Porter, etc. to derive the stem. Whereas lemmatization makes use of a lookup database like WordNet to derive lemma. For example, the lemmatization of “better” is “well” and this another word is derived as lemma as it looks up in the dictionary. But the stemming result will come as “better” only without a lookup. However, this lookup can at times slow down the lemmatization process.
  • Stemming does not take the context of the word into account, for example, “meeting” can be a verb or noun based on the context. But lemmatization does consider the context of the word before generating its lemma.

Stemming vs Lemmatization Example

In the example code below we first tokenize the text and then with the help of for loop stemmed the token with Snowball Stemmer and Porter Stemmer. At the same time, we also Lemmatize the text and convert it into a lemma with the help of Wordnet Lemmatizer.

In [5]:
from nltk.stem import SnowballStemmer,PorterStemmer,WordNetLemmatizer
from nltk import word_tokenize,pos_tag

snowball = SnowballStemmer(language='english')
porter = PorterStemmer()
wordnet = WordNetLemmatizer()

text = ["better","Caring","are","am","worse","struggling",'meeting']
print("{0:10}{1:20}{2:30}".format("Word","Snowball Stemmer","Wordnet Lemmatizer"))
for token,tag in pos_tag(text):
    
    pos=tag[0].lower()
    if pos not in ['a', 'r', 'n', 'v']:
        pos='n'
        
    print("{0:10}{1:20}{2:30}".format(token,snowball.stem(token),wordnet.lemmatize(token,pos)))
[Out] :
Word      Snowball Stemmer    Wordnet Lemmatizer            
better    better              well                          
Caring    care                Caring                        
are       are                 be                            
am        am                  be                            
worse     wors                worse                         
strugling strugl              strugling                     
meetings  meet                meeting                       

Conclusion

So coming to the end of the article, I hope you now understand the NLTK tokenizer module WordNetLemmatizer and how to use it properly without running into issues of not working due to missing POS tags. We also discussed the application of NTLK in general and compared the difference between Stemming vs Lemmatization.

Reference – NLTK Documentation

LEAVE A REPLY

Please enter your comment!
Please enter your name here