Beginner’s Guide to Stemming in Python NLTK

Introduction

In this tutorial, we will understand how to perform stemming in Python NLTK library for your NLP project. We will cover the brief understanding of what is stemming and track down its history. Finally, we will explore different types of stemmer along with various examples of stemming in NLTK.

What is Stemming?

Stemming in Python NLTK

Stemming is an NLP process that reduces the inflection in words to their root forms which in turn helps to preprocess text, words, and documents for text normalization.

As per Wikipedia,  inflection is the modification of a word to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, and mood. So a word may exist in various inflected forms but such inflected forms of words in the same text add redundancy for NLP operation.

Hence we use stemming to reduce words to their root form or stem, however, such stems may not even be a valid word in the language.

For example, the stem of these three words connections, connected, connects, is “connect”. On the other hand, the stem of trouble, troubled, troubles is “troubl” which is not an actual word in itself.

History of Stemming

The first published stemmer was written by Julie Beth Lovins in 1968. This paper was remarkable in its early times and greatly influenced later works in this area. Her paper refers to three earlier major attempts at stemming algorithms, by Professor John W. Tukey of Princeton University, the algorithm developed at Harvard University by Michael Lesk, under the direction of Professor Gerard Salton, and a third algorithm developed by James L. Dolby of R and D Consultants, Los Altos, California.

A later stemmer was written by Martin Porter and was published in the July 1980 issue of the journal Program. This stemmer was very widely used and became the de facto standard algorithm used for English stemming. Dr. Porter received the Tony Kent Strix award in 2000 for his work on stemming and information retrieval.

Why Stemming is Important?

As we discussed above, the English language has many variations of a single word. Having these variations in a text corpus creates redundancy of data while creating NLP or machine learning models. Such models may not turn out to be effective.

To create a robust model, it’s vital to normalize text by eliminating redundancy in words and converting them to root form using stemming.

Application of Stemming

Stemming is used in information retrieval, text mining SEOs, Web search results, indexing, tagging systems, and vocabulary analysis. For example, searching for prediction and predicted shows similar results in Google.

Types of Stemmer in NLTK

There are many types of Stemming algorithms and all the types of stemmers are available in Python NLTK. Let us see them below.

1. Porter Stemmer – PorterStemmer()

Porter Stemmer or Porter algorithm was developed by Martin Porter in 1980. The algorithm employs five phases of word reduction, each with its own set of mapping rules. Porter Stemmer is the oldest stemmer is known for its simplicity and speed. The resulting stem is often a shorter word having the same root meaning.

In NLTK, there is a module PorterStemmer() that supports the Porter Stemming algorithm. Let’s explore with the help of an example.

Example of PorterStemmer()

In the example below, we create an instance of PorterStemmer() to stem the list of words using the Porter algorithm.

In [1]:
from nltk.stem import PorterStemmer

porter = PorterStemmer()
words = ['Connects','Connecting','Connections','Connected','Connection','Connectings','Connect']

for word in words:
    print(word,"--->",porter.stem(word))
[Out] :
Connects ---> connect
Connecting ---> connect
Connections ---> connect
Connected ---> connect
Connection ---> connect
Connectings ---> connect
Connect ---> connect

2. Snowball Stemmer – SnowballStemmer()

Snowball Stemmer is also developed by Martin Porter. The algorithm used here is more accurate and is known as “English Stemmer” or “Porter2 Stemmer”. It offers a slight improvement over the original Porter Stemmer, both in logic and speed.

In NLTK, there is a module SnowballStemmer() that supports the Snowball stemming algorithm. Let’s explore this type of stemming with the help of an example.

Example of SnowballStemmer()

In the example below, we first create an instance of SnowballStemmer() to stem the list of words using the Snowball algorithm.

In [2]:
from nltk.stem import SnowballStemmer

snowball = SnowballStemmer(language='english')
words = ['generous','generate','generously','generation']

for word in words:
    print(word,"--->",snowball.stem(word))
[Out] :
generous ---> generous
generate ---> generat
generously ---> generous
generation ---> generat

The snowball stemmer supports the following languages

In [3]:
print(SnowballStemmer.languages)
[Out] :
('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')

3. Lancaster Stemmer – LancasterStemmer()

Lancaster Stemmer is simple but it tends to produce results with over stemming. Over-stemming causes the stems to be not linguistic, or they may have no meaning.

In NLTK, there is a module LancasterStemmer() that supports the Lancaster stemming algorithm. Let’s understand this with the help of an example.

Example of LancasterStemmer()

In the example below, we first create an instance of LancasterStemmer() and then stem the list of words using the Lancaster algorithm.

In [4]:
from nltk.stem import LancasterStemmer

lancaster = LancasterStemmer()
words = ['eating','eats','eaten','puts','putting']

for word in words:
    print(word,"--->",lancaster.stem(word))
[Out] :
eating ---> eat
eats ---> eat
eaten ---> eat
puts ---> put
putting ---> put

4. Regexp Stemmer – RegexpStemmer()

Regex stemmer uses regular expressions to identify morphological affixes. Any substrings that match the regular expressions will be removed.

In NLTK, there is a module RegexpStemmer() that supports the Regex stemming algorithm. Let’s understand this with the help of an example.

Example of RegexpStemmer()

In this example, we first create an instance of RegexpStemmer() and then stem the list of words using the Regex stemming algorithm.

In [5]:
from nltk.stem import RegexpStemmer

regexp = RegexpStemmer('ing$|s$|e$|able$', min=4)
words = ['mass','was','bee','computer','advisable']

for word in words:
    print(word,"--->",regexp.stem(word))
[Out] :
mass ---> mas
was ---> was
bee ---> bee
computer ---> computer
advisable ---> advis

Porter vs Snowball vs Lancaster vs Regex Stemming in NLTK

Let us do a comparison of results of different types of stemming in NLTK with the help of two examples below –

Example 1:

In [6]:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer, RegexpStemmer

porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer(language='english')
regexp = RegexpStemmer('ing$|s$|e$|able$', min=4)

word_list = ["friend", "friendship", "friends", "friendships"]
print("{0:20}{1:20}{2:20}{3:30}{4:40}".format("Word","Porter Stemmer","Snowball Stemmer","Lancaster Stemmer",'Regexp Stemmer'))
for word in word_list:
    print("{0:20}{1:20}{2:20}{3:30}{4:40}".format(word,porter.stem(word),snowball.stem(word),lancaster.stem(word),regexp.stem(word)))
[Out] :
Word                Porter Stemmer      Snowball Stemmer    Lancaster Stemmer             Regexp Stemmer                          
friend              friend              friend              friend                        friend                                  
friendship          friendship          friendship          friend                        friendship                              
friends             friend              friend              friend                        friend                                  
friendships         friendship          friendship          friend                        friendship                              

Example 2:

In [7]:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer, RegexpStemmer

porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer(language='english')
regexp = RegexpStemmer('ing$|s$|e$|able$', min=4)

word_list = ['run','runs','running','runner','ran','easily','fairly']
print("{0:20}{1:20}{2:20}{3:30}{4:40}".format("Word","Porter Stemmer","Snowball Stemmer","Lancaster Stemmer",'Regexp Stemmer'))
for word in word_list:
    print("{0:20}{1:20}{2:20}{3:30}{4:40}".format(word,porter.stem(word),snowball.stem(word),lancaster.stem(word),regexp.stem(word)))
[Out] :
Word                Porter Stemmer      Snowball Stemmer    Lancaster Stemmer             Regexp Stemmer                          
run                 run                 run                 run                           run                                     
runs                run                 run                 run                           run                                     
running             run                 run                 run                           runn                                    
runner              runner              runner              run                           runner                                  
ran                 ran                 ran                 ran                           ran                                     
easily              easili              easili              easy                          easily                                  
fairly              fairli              fair                fair                          fairly                                  

Stemming a Text File with NLTK

Tille now we showed you small examples of stemming certain words, but what if you have a text file and you want to perform stemming on the entire file. Let us understand how to do it.

In the example below, we created a function called stemming to first tokenizes the text using word_tokenize and then stem down the token to base form using SnowballStemmer.

We then appended it into a list and at last, we join the items in the list and then returned them.

In [8]:
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer

def stemming(text):
    
    snowball = SnowballStemmer(language='english')
    
    list=[]
    for token in word_tokenize(text):
        list.append(snowball.stem(token))
    
    return ' '.join(list)
In [9]:
with open('text_file.txt') as f:
    text=f.read()
print(stemming(text))
[Out] :
data scienc is an interdisciplinari field that use scientif method , process , algorithm and system to extract knowledg and insight from structur and unstructur data , and appli knowledg and action insight from data across a broad rang of applic domain . data scienc is relat to data mine , machin learn and big data . data scienc is a `` concept to unifi statist , data analysi , informat , and their relat method '' in order to `` understand and analyz actual phenomena '' with data . it use techniqu and theori drawn from mani field within the context of mathemat , statist , comput scienc , inform scienc , and domain knowledg . ture award winner jim gray imagin data scienc as a `` fourth paradigm '' of scienc ( empir , theoret , comput , and now data-driven ) and assert that `` everyth about scienc is chang becaus of the impact of inform technolog '' and the data delug .

 

Conclusion

In this tutorial, we explained to you how to perform stemming in Python NLTK library for your NLP project. We explored different types of stemmers in NLTK along with their examples. Then we did a comparative study of results produced by Porter vs Snowball vs Lancaster vs Regex Stemming. In the end, we also showed you how to perform the stemming of a text file using NLTK.

Reference – NLTK Documentation

  • Afham Fardeen

    This is Afham Fardeen, who loves the field of Machine Learning and enjoys reading and writing on it. The idea of enabling a machine to learn strikes me.

Follow Us

Leave a Reply

Your email address will not be published. Required fields are marked *