Complete Tutorial for NLTK Stopwords

Complete Tutorial for NLTK Stopwords

In this tutorial, we are going to learn what are stopwords in NLP and how to use them for cleaning text with the help of the NLTK stopwords library. We will also show how to add in our own special stopwords in case we are dealing with a unique dataset where we have an acronym that gets used too much and we want to add that as a stopword.

What are Stopwords in NLP ?

Stopwords are the most frequently occurring words like “a”, “the”, “to”, “for”, etc. that do not really add value while doing various NLP operations. For example, words like “a” and “the” appear very frequently in the regular texts but they really don’t require the part of speech tagging as thoroughly as other nouns, verbs, and modifiers. Hence these stopwords can be simply removed from the text.

Why Remove Stopwords in NLP?

As we discussed, stopwords are words that occur in abundance and don’t add any additional or valuable information to the text. So reducing the data set size by removing stopwords is without any doubt increases the performance of the NLP model.

Training an NLP model takes time if we have a big corpus, so if we have fewer tokens to be trained after removing stopwords then the training time also becomes fast.

Removing stopwords also increases the efficiency of NLP models. Even Tf-Idf gives less importance to more occurring words, hence removing stopwords also makes the tfidf step more efficient. Also if we are doing text classification, the presence of stopwords can dilute the meaning of the text making the classification model less efficient.

Ad
Deep Learning Specialization on Coursera

Stopwords in NLTK

NLTK holds a built-in list of around 179 English Stopwords. The default list of these stopwords can be loaded by using stopwords.word() module of NLTK. This list can be modified as per our needs.

A very common usage of stopwords.word() is in the text preprocessing phase or pipeline before actual NLP techniques like text classification.

NLTK stopwords

Remove Stopwords from String with NLTK

In the examples below, we will show how to remove stopwords from the string with NLTK. We first created “stopwords.word()” object with English vocabulary and stored the list of stopwords in a variable. Then we created an empty list to store words that are not stopwords.

Using a for loop that iterates over the text (that has been split on whitespace) we checked whether the word is present in the stopword list, if not we appended it in the list.

At last, we join the list of words that don’t contain stopwords using “join()” function and thus we have a final output where all stopwords are removed from the string using the NLTK stopwords list.

Example -1

In [1]:
from nltk.corpus import stopwords

text = "Spread love everywhere you go. Let no one ever come to you without leaving happier"
en_stopwords = stopwords.words('english')

lst=[]
for token in text.split():
    if token.lower() not in en_stopwords:    //checking whether the word is not 
        lst.append(token)                    //present in the stopword list.
        
#Join items in the list
print(' '.join(lst))
[Out] :
Spread love everywhere go . Let one ever come without leaving happier

Example -2

In [2] :
from nltk.corpus import stopwords

text = "Life is what happens when you're busy making other plans"
en_stopwords = stopwords.words('english')

lst=[]
for token in text.split():
    if token.lower() not in en_stopwords:
        lst.append(token)
        
print(' '.join(lst))
[out] :
Life happens busy making plans

Adding Stop Words to Default NLTK Stopwords List

There are 179 English stopwords however, we can add our own stopwords to the list of stopwords. To add a word to NLTK stop words list, we first create a list from the “stopwords.word(‘english’)” object. Next, we use the extend method on the list to add our list of words to the default stopwords list.

Example

The following script adds a list of words to the NLTK stop word collection. Initially, the length of words in stopwords.words(‘english’) object is 179 but on adding 3 more words the length of the list becomes 182.

In [3]:

en_stopwords = stopwords.words('english')
print(len(en_stopwords))
new_stopwords = ["you're","i'll","we'll"]
en_stopwords.extend(new_stopwords)
len(en_stopwords)
[Out] :
179
182

NLTK Stopwords for other Languages

Other than English nltk supports these languages having stopwords. We can get the list of supported languages below.

In [5]:
from nltk.corpus import stopwords
print(stopwords.fileids())
[Out] :
['arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']

Filtering stopwords from the text file

In the code below we have removed the stopwords in the same process as discussed above, the only difference is that we have imported the text by using the Python file operation “with open()”

In [6]:
from nltk.corpus import stopwords 

en_stopwords = stopwords.words('english') 
with open("text_file.txt") as f:
    text=f.read()
    
lst=[]
for token in text.split():
    if token.lower() not in en_stopwords:
        lst.append(token)

print('Original Text')        
print(text,'\n\n')

print('Text after removing stop words')
print(' '.join(lst)) 
[Out] :
Original Text
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains. Data science is related to data mining, machine learning and big data. Data science is a "concept to unify statistics, data analysis, informatics, and their related methods" in order to "understand and analyze actual phenomena" with data. It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, and domain knowledge. Turing Award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational, and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge. 


Text after removing stop words
Data science interdisciplinary field uses scientific methods , processes , algorithms systems extract knowledge insights structured unstructured data , apply knowledge actionable insights data across broad range application domains . Data science related data mining , machine learning big data . Data science `` concept unify statistics , data analysis , informatics , related methods '' order `` understand analyze actual phenomena '' data . It uses techniques theories drawn many fields within context mathematics , statistics , computer science , information science , domain knowledge . Turing Award winner Jim Gray imagined data science `` fourth paradigm '' science ( empirical , theoretical , computational , data-driven ) asserted `` everything science changing impact information technology '' data deluge .

 

Conclusion

Reaching the end of this tutorial, where we learned what are stopwords in NLP and how to use them in NTK. We showed examples of using NLTK stopwords with sample text and text files and also explained how to add custom stopwords in the default NLTK stopwords list.

Reference – NLTK Documentation

 

LEAVE A REPLY

Please enter your comment!
Please enter your name here