In this tutorial, we are going to learn what are stopwords in NLP and how to use them for cleaning text with the help of the NLTK stopwords library. We will also show how to add in our own special stopwords in case we are dealing with a unique dataset where we have an acronym that gets used too much and we want to add that as a stopword.
What are Stopwords in NLP ?
Stopwords are the most frequently occurring words like “a”, “the”, “to”, “for”, etc. that do not really add value while doing various NLP operations. For example, words like “a” and “the” appear very frequently in the regular texts but they really don’t require the part of speech tagging as thoroughly as other nouns, verbs, and modifiers. Hence these stopwords can be simply removed from the text.
Why Remove Stopwords in NLP?
As we discussed, stopwords are words that occur in abundance and don’t add any additional or valuable information to the text. So reducing the data set size by removing stopwords is without any doubt increases the performance of the NLP model.
Training an NLP model takes time if we have a big corpus, so if we have fewer tokens to be trained after removing stopwords then the training time also becomes fast.
Removing stopwords also increases the efficiency of NLP models. Even Tf-Idf gives less importance to more occurring words, hence removing stopwords also makes the tfidf step more efficient. Also if we are doing text classification, the presence of stopwords can dilute the meaning of the text making the classification model less efficient.
Stopwords in NLTK
NLTK holds a built-in list of around 179 English Stopwords. The default list of these stopwords can be loaded by using stopwords.word() module of NLTK. This list can be modified as per our needs.
A very common usage of stopwords.word() is in the text preprocessing phase or pipeline before actual NLP techniques like text classification.

Remove Stopwords from Text with NLTK
In the examples below, we will show how to remove stopwords from the string with NLTK. We first created “stopwords.word()” object with English vocabulary and stored the list of stopwords in a variable. Then we created an empty list to store words that are not stopwords.
Using a for loop that iterates over the text (that has been split on whitespace) we checked whether the word is present in the stopword list, if not we appended it in the list.
At last, we join the list of words that don’t contain stopwords using “join()” function and thus we have a final output where all stopwords are removed from the string using the NLTK stopwords list.
Example -1
from nltk.corpus import stopwords
text = "Spread love everywhere you go. Let no one ever come to you without leaving happier"
en_stopwords = stopwords.words('english')
lst=[]
for token in text.split():
if token.lower() not in en_stopwords: //checking whether the word is not
lst.append(token) //present in the stopword list.
#Join items in the list
print(' '.join(lst))
Example -2
from nltk.corpus import stopwords
text = "Life is what happens when you're busy making other plans"
en_stopwords = stopwords.words('english')
lst=[]
for token in text.split():
if token.lower() not in en_stopwords:
lst.append(token)
print(' '.join(lst))
Adding Stop Words to Default NLTK Stopwords List
There are 179 English stopwords however, we can add our own stopwords to the list of stopwords. To add a word to NLTK stop words list, we first create a list from the “stopwords.word(‘english’)” object. Next, we use the extend method on the list to add our list of words to the default stopwords list.
Example
The following script adds a list of words to the NLTK stop word collection. Initially, the length of words in stopwords.words(‘english’) object is 179 but on adding 3 more words the length of the list becomes 182.
In [3]:
en_stopwords = stopwords.words('english')
print(len(en_stopwords))
new_stopwords = ["you're","i'll","we'll"]
en_stopwords.extend(new_stopwords)
len(en_stopwords)
NLTK Stopwords for other Languages
Other than English, NLTK supports these languages having stopwords. We can get the list of supported languages below.
from nltk.corpus import stopwords
print(stopwords.fileids())
Removing stopwords from Text File
In the code below we have removed the stopwords in the same process as discussed above, the only difference is that we have imported the text by using the Python file operation “with open()”
from nltk.corpus import stopwords
en_stopwords = stopwords.words('english')
with open("text_file.txt") as f:
text=f.read()
lst=[]
for token in text.split():
if token.lower() not in en_stopwords:
lst.append(token)
print('Original Text')
print(text,'\n\n')
print('Text after removing stop words')
print(' '.join(lst))
- Also Read – NLTK Tokenize – Complete Tutorial for Beginners
- Also Read – 11 Amazing Python NLP Libraries You Should Know
- Also Read – Ultimate Guide to Sentiment Analysis in Python with NLTK Vader, TextBlob and Pattern
Conclusion
Reaching the end of this tutorial, where we learned what are stopwords in NLP and how to use them in NTK. We showed examples of using NLTK stopwords with sample text and text files and also explained how to add custom stopwords in the default NLTK stopwords list.
Reference – NLTK Documentation