Introduction
In this tutorial, we will understand how to perform stemming in Python NLTK library for your NLP project. We will cover the brief understanding of what is stemming and track down its history. Finally, we will explore different types of stemmer along with various examples of stemming in NLTK.
What is Stemming?
Stemming is an NLP process that reduces the inflection in words to their root forms which in turn helps to preprocess text, words, and documents for text normalization.
As per Wikipedia, inflection is the modification of a word to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, and mood. So a word may exist in various inflected forms but such inflected forms of words in the same text add redundancy for NLP operation.
Hence we use stemming to reduce words to their root form or stem, however, such stems may not even be a valid word in the language.
For example, the stem of these three words connections, connected, connects, is “connect”. On the other hand, the stem of trouble, troubled, troubles is “troubl” which is not an actual word in itself.
- Also Read – Learn Lemmatization in NTLK with Examples
History of Stemming
The first published stemmer was written by Julie Beth Lovins in 1968. This paper was remarkable in its early times and greatly influenced later works in this area. Her paper refers to three earlier major attempts at stemming algorithms, by Professor John W. Tukey of Princeton University, the algorithm developed at Harvard University by Michael Lesk, under the direction of Professor Gerard Salton, and a third algorithm developed by James L. Dolby of R and D Consultants, Los Altos, California.
A later stemmer was written by Martin Porter and was published in the July 1980 issue of the journal Program. This stemmer was very widely used and became the de facto standard algorithm used for English stemming. Dr. Porter received the Tony Kent Strix award in 2000 for his work on stemming and information retrieval.
Why Stemming is Important?
As we discussed above, the English language has many variations of a single word. Having these variations in a text corpus creates redundancy of data while creating NLP or machine learning models. Such models may not turn out to be effective.
To create a robust model, it’s vital to normalize text by eliminating redundancy in words and converting them to root form using stemming.
Application of Stemming
Stemming is used in information retrieval, text mining SEOs, Web search results, indexing, tagging systems, and vocabulary analysis. For example, searching for prediction and predicted shows similar results in Google.
Types of Stemmer in NLTK
There are many types of Stemming algorithms and all the types of stemmers are available in Python NLTK. Let us see them below.
1. Porter Stemmer – PorterStemmer()
Porter Stemmer or Porter algorithm was developed by Martin Porter in 1980. The algorithm employs five phases of word reduction, each with its own set of mapping rules. Porter Stemmer is the oldest stemmer is known for its simplicity and speed. The resulting stem is often a shorter word having the same root meaning.
In NLTK, there is a module PorterStemmer() that supports the Porter Stemming algorithm. Let’s explore with the help of an example.
Example of PorterStemmer()
In the example below, we create an instance of PorterStemmer() to stem the list of words using the Porter algorithm.
from nltk.stem import PorterStemmer
porter = PorterStemmer()
words = ['Connects','Connecting','Connections','Connected','Connection','Connectings','Connect']
for word in words:
print(word,"--->",porter.stem(word))
Connects ---> connect Connecting ---> connect Connections ---> connect Connected ---> connect Connection ---> connect Connectings ---> connect Connect ---> connect
2. Snowball Stemmer – SnowballStemmer()
Snowball Stemmer is also developed by Martin Porter. The algorithm used here is more accurate and is known as “English Stemmer” or “Porter2 Stemmer”. It offers a slight improvement over the original Porter Stemmer, both in logic and speed.
In NLTK, there is a module SnowballStemmer() that supports the Snowball stemming algorithm. Let’s explore this type of stemming with the help of an example.
Example of SnowballStemmer()
In the example below, we first create an instance of SnowballStemmer() to stem the list of words using the Snowball algorithm.
from nltk.stem import SnowballStemmer
snowball = SnowballStemmer(language='english')
words = ['generous','generate','generously','generation']
for word in words:
print(word,"--->",snowball.stem(word))
generous ---> generous generate ---> generat generously ---> generous generation ---> generat
The snowball stemmer supports the following languages
print(SnowballStemmer.languages)
('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')
3. Lancaster Stemmer – LancasterStemmer()
Lancaster Stemmer is simple but it tends to produce results with over stemming. Over-stemming causes the stems to be not linguistic, or they may have no meaning.
In NLTK, there is a module LancasterStemmer() that supports the Lancaster stemming algorithm. Let’s understand this with the help of an example.
Example of LancasterStemmer()
In the example below, we first create an instance of LancasterStemmer() and then stem the list of words using the Lancaster algorithm.
from nltk.stem import LancasterStemmer
lancaster = LancasterStemmer()
words = ['eating','eats','eaten','puts','putting']
for word in words:
print(word,"--->",lancaster.stem(word))
eating ---> eat eats ---> eat eaten ---> eat puts ---> put putting ---> put
4. Regexp Stemmer – RegexpStemmer()
Regex stemmer uses regular expressions to identify morphological affixes. Any substrings that match the regular expressions will be removed.
In NLTK, there is a module RegexpStemmer() that supports the Regex stemming algorithm. Let’s understand this with the help of an example.
Example of RegexpStemmer()
In this example, we first create an instance of RegexpStemmer() and then stem the list of words using the Regex stemming algorithm.
from nltk.stem import RegexpStemmer
regexp = RegexpStemmer('ing$|s$|e$|able$', min=4)
words = ['mass','was','bee','computer','advisable']
for word in words:
print(word,"--->",regexp.stem(word))
mass ---> mas was ---> was bee ---> bee computer ---> computer advisable ---> advis
Porter vs Snowball vs Lancaster vs Regex Stemming in NLTK
Let us do a comparison of results of different types of stemming in NLTK with the help of two examples below –
Example 1:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer, RegexpStemmer
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer(language='english')
regexp = RegexpStemmer('ing$|s$|e$|able$', min=4)
word_list = ["friend", "friendship", "friends", "friendships"]
print("{0:20}{1:20}{2:20}{3:30}{4:40}".format("Word","Porter Stemmer","Snowball Stemmer","Lancaster Stemmer",'Regexp Stemmer'))
for word in word_list:
print("{0:20}{1:20}{2:20}{3:30}{4:40}".format(word,porter.stem(word),snowball.stem(word),lancaster.stem(word),regexp.stem(word)))
Word Porter Stemmer Snowball Stemmer Lancaster Stemmer Regexp Stemmer friend friend friend friend friend friendship friendship friendship friend friendship friends friend friend friend friend friendships friendship friendship friend friendship
Example 2:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer, RegexpStemmer
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer(language='english')
regexp = RegexpStemmer('ing$|s$|e$|able$', min=4)
word_list = ['run','runs','running','runner','ran','easily','fairly']
print("{0:20}{1:20}{2:20}{3:30}{4:40}".format("Word","Porter Stemmer","Snowball Stemmer","Lancaster Stemmer",'Regexp Stemmer'))
for word in word_list:
print("{0:20}{1:20}{2:20}{3:30}{4:40}".format(word,porter.stem(word),snowball.stem(word),lancaster.stem(word),regexp.stem(word)))
Word Porter Stemmer Snowball Stemmer Lancaster Stemmer Regexp Stemmer run run run run run runs run run run run running run run run runn runner runner runner run runner ran ran ran ran ran easily easili easili easy easily fairly fairli fair fair fairly
Stemming a Text File with NLTK
Tille now we showed you small examples of stemming certain words, but what if you have a text file and you want to perform stemming on the entire file. Let us understand how to do it.
In the example below, we created a function called stemming to first tokenizes the text using word_tokenize and then stem down the token to base form using SnowballStemmer.
We then appended it into a list and at last, we join the items in the list and then returned them.
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
def stemming(text):
snowball = SnowballStemmer(language='english')
list=[]
for token in word_tokenize(text):
list.append(snowball.stem(token))
return ' '.join(list)
with open('text_file.txt') as f:
text=f.read()
print(stemming(text))
data scienc is an interdisciplinari field that use scientif method , process , algorithm and system to extract knowledg and insight from structur and unstructur data , and appli knowledg and action insight from data across a broad rang of applic domain . data scienc is relat to data mine , machin learn and big data . data scienc is a `` concept to unifi statist , data analysi , informat , and their relat method '' in order to `` understand and analyz actual phenomena '' with data . it use techniqu and theori drawn from mani field within the context of mathemat , statist , comput scienc , inform scienc , and domain knowledg . ture award winner jim gray imagin data scienc as a `` fourth paradigm '' of scienc ( empir , theoret , comput , and now data-driven ) and assert that `` everyth about scienc is chang becaus of the impact of inform technolog '' and the data delug .
- Also Read – Learn Lemmatization in NTLK with Examples
- Also Read – NLTK Tokenize – Complete Tutorial for Beginners
- Also Read – Complete Tutorial for NLTK Stopwords
- Also Read – Ultimate Guide to Sentiment Analysis in Python with NLTK Vader, TextBlob and Pattern
Conclusion
In this tutorial, we explained to you how to perform stemming in Python NLTK library for your NLP project. We explored different types of stemmers in NLTK along with their examples. Then we did a comparative study of results produced by Porter vs Snowball vs Lancaster vs Regex Stemming. In the end, we also showed you how to perform the stemming of a text file using NLTK.
Reference – NLTK Documentation
-
This is Afham Fardeen, who loves the field of Machine Learning and enjoys reading and writing on it. The idea of enabling a machine to learn strikes me.
View all posts