Tutorial for Stopwords in Spacy Library

Introduction

In this tutorial, we will learn about stopwords in Spacy library and how to use them in your NLP projects. But before going into implementation and examples we will understand what is stopwords and why it is always advised to remove them. We will also show how to add your own custom stopwords or remove some default stopwords in Spacy.

What are Stopwords?

Stopwords in Spacy

In English vocabulary, there are many words like “I”, “the” and “you” that appear very frequently in the text but they do not add any valuable information for NLP operations and modeling. These words are called stopwords and they are almost always advised to be removed as part of text preprocessing.

When we remove stopwords it reduces the size of the text corpus which increases the performance and robustness of the NLP model. But sometimes removing the stopwords may have an adverse effect if it changes the meaning of the sentence. For example, if we consider the example “This is not a good way to talk” which is a negative sentence. When we remove stopwords from this sentence it becomes a positive sentence: “good way talk”.

Stopwords in Spacy Library

i) Stopwords List in Spacy

The Spacy library has a default list of 326 stopwords. The below code displays the list of stopwords in Spacy.

In [1]:
import spacy
#loading the english language small model of spacy
en = spacy.load('en_core_web_sm')
stopwords = en.Defaults.stop_words

print(len(stopwords))
print(stopwords)
[Out] :
326
{‘really’, ‘sometimes’, ‘go’, ‘since’, ‘whither’, ‘they’, ‘its’, ‘them’, ‘well’, ‘meanwhile’, ‘seems’, ‘and’, ‘latterly’, ‘regarding’, ‘somehow’, ‘sixty’, ‘whole’, ‘anyway’, ‘else’, ‘few’, ‘’m’, ‘beside’, ‘to’, ‘namely’, ‘someone’, ‘see’, ‘moreover’, ‘wherein’, ‘for’, ‘former’, ‘bottom’, ‘it’, ‘next’, ‘six’, ‘along’, ‘once’, ‘might’, ‘whenever’, ‘below’, ‘another’, ‘yourself’, ‘each’, ‘just’, ‘ourselves’, ‘everyone’, ‘any’, ‘across’, ‘get’, ‘that’, ‘eight’, ‘we’, ‘which’, ‘therefore’, ‘may’, “‘s”, ‘keep’, ‘among’, ‘give’, ‘such’, ‘are’, ‘indeed’, ‘everywhere’, ‘same’, ‘herself’, ‘yourselves’, ‘alone’, ‘were’, ‘was’, ‘take’, ‘seem’, ‘say’, ‘why’, ‘show’, ‘between’, ‘during’, ‘elsewhere’, ‘or’, ‘though’, ‘forty’, ‘made’, ‘used’, ‘others’, ‘whereafter’, ‘formerly’, ‘several’, ‘via’, ‘does’, ‘please’, ‘three’, ‘also’, ‘fifty’, ‘afterwards’, ‘’s’, ‘noone’, ‘do’, ‘perhaps’, ‘further’, ‘i’, ‘beforehand’, ‘myself’, ’empty’, ‘‘ll’, ‘yet’, ‘thereby’, ‘been’, ‘both’, ‘never’, ‘put’, ‘without’, ‘him’, ‘a’, ‘nothing’, ‘thereafter’, ‘make’, ‘then’, ‘whom’, ‘must’, ‘sometime’, ‘against’, ‘through’, ‘being’, ‘four’, ‘back’, ‘become’, ‘our’, ‘himself’, ‘because’, ‘anything’, ‘’re’, ‘nor’, ‘therein’, ‘due’, ‘until’, ‘own’, ‘ca’, ‘most’, ‘now’, ‘while’, ‘of’, ‘only’, ‘am’, ‘itself’, ‘too’, ‘‘m’, ‘nobody’, ‘if’, ‘one’, ‘whereas’, ‘twelve’, ‘together’, ‘can’, ‘who’, ‘even’, ‘be’, ‘she’, ‘besides’, ‘herein’, ‘off’, ‘‘d’, ‘last’, ‘no’, ‘whereupon’, ‘the’, “‘m”, ‘thru’, ‘out’, ‘hereupon’, ‘by’, ‘us’, ‘already’, ‘became’, ‘here’, ‘hers’, ‘onto’, ‘beyond’, ‘down’, ‘enough’, ‘did’, ‘some’, ‘over’, ‘serious’, ‘quite’, ‘move’, ‘around’, ‘nowhere’, ‘amongst’, ‘but’, ‘so’, ‘wherever’, ‘twenty’, ‘often’, ‘part’, ‘again’, ‘where’, ‘re’, ‘within’, ‘at’, “n’t”, ‘yours’, ‘front’, ‘unless’, ‘could’, ‘anyone’, ‘third’, ‘whatever’, ‘doing’, “‘d”, ‘nevertheless’, ‘before’, ‘rather’, ‘fifteen’, ‘her’, ‘me’, ‘thereupon’, ‘mostly’, ‘throughout’, ‘hence’, “‘re”, ‘mine’, ‘ten’, ‘hundred’, ‘nine’, ‘call’, ‘when’, ‘about’, ‘will’, ‘whereby’, ‘this’, ‘upon’, ‘you’, ‘should’, ‘always’, ‘themselves’, ‘not’, ‘has’, ‘behind’, ‘on’, ‘anywhere’, ‘side’, ‘their’, ‘hereby’, ‘latter’, ‘after’, ‘‘ve’, ‘none’, ‘these’, ‘name’, ‘n’t’, ‘every’, ‘although’, ‘‘s’, ‘however’, ‘he’, ‘becoming’, ‘how’, ‘whose’, ‘still’, ‘hereafter’, ‘whether’, ‘towards’, ‘more’, ‘everything’, ‘whoever’, ‘seemed’, ‘cannot’, ‘up’, ‘otherwise’, ‘in’, ‘would’, ‘under’, ‘done’, ‘thence’, ‘whence’, ‘seeming’, ‘either’, ‘other’, ‘with’, ‘into’, ‘amount’, ‘five’, ‘much’, ‘‘re’, ‘except’, ‘his’, ‘thus’, “‘ll”, ‘what’, ‘almost’, ‘becomes’, ‘least’, ‘ever’, ‘above’, ‘is’, ‘first’, ‘there’, ‘somewhere’, ‘top’, ‘’ve’, “‘ve”, ‘than’, ‘n‘t’, ‘have’, ‘toward’, ‘per’, ‘all’, ‘ours’, ‘full’, ‘’d’, ‘anyhow’, ‘as’, ‘’ll’, ‘many’, ‘various’, ‘your’, ‘had’, ‘eleven’, ‘from’, ‘something’, ‘less’, ‘those’, ‘using’, ‘an’, ‘two’, ‘my’, ‘very’, ‘neither’}

ii) Checking for Stopwords

We can check whether a word is a stopword or not by using the is_stop method of Spacy.

In the example below, we checked whether a token is a stopword using is_stop and in output, we can see that the stopwords are returned as True otherwise False.

In [2]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Tommorow will be too late, its now or never")

for token in doc:
    print(token.text,token.is_stop)
[Out] :
Tommorow False
will True
be True
too True
late False
, False
its True
now True
or True
never True

iii) Remove Stopwords using Spacy

In the example below, we import the spacy library and loaded the English language model of the Spacy object, and store the list of stopwords in a variable. Then we create an empty list to store words that are not stopwords.

Using a for loop that iterates over the text (that has been split on whitespace) we check whether the word is present in the stopword list, if not we append it in the list.

At last, we join the list of words that don’t contain stopwords using the “join()” function, and thus we have a final output where all stopwords are removed from the string.

In [3]:
import spacy
#loading the english language small model of spacy
en = spacy.load('en_core_web_sm')
stopwords = en.Defaults.stop_words

text = " we will show how to remove stopwords using spacy library"

lst=[]
for token in text.split():
    if token.lower() not in stopwords:    #checking whether the word is not 
        lst.append(token)                    #present in the stopword list.
        
#Join items in the list
print("Original text  : ",text)
print("Text after removing stopwords  :   ",' '.join(lst))
[Out] :
Original text  :   we will show how to remove stopwords from our using spacy library
Text after removing stopwords  :    remove stopwords spacy library

iv) Adding Stopwords to Default Spacy List

By default, Spacy has 326 English stopwords, but at times you may like to add your own custom stopwords to the default list. We will show you how in the below example.

To add a custom stopword in Spacy, we first load its English language model and use add() method to add stopwords.

This code shows how to add a single stopword:

In [4]:
import spacy    

nlp = spacy.load("en_core_web_sm")
nlp.Defaults.stop_words.add("my_new_stopword")

To add several stopwords at once:

In [5]:
import spacy

nlp = spacy.load("en_core_web_sm")
nlp.Defaults.stop_words |= {"Afham","Farden"}

v) Remove Stopwords from Default Spacy List

There may be some scenarios where you will like to preserve some stopwords in your text. In this case, you may remove those stopwords from Spacy default list by the remove() method as shown in the below examples.

To remove a single stopword:

In [6]:
import spacy

nlp = spacy.load("en_core_web_sm")
nlp.Defaults.stop_words.remove("what")

To remove several stopwords at once:

In [7]:
import spacy    

nlp = spacy.load("en_core_web_sm")
nlp.Defaults.stop_words -= {"who", "when"}

vi) Filtering Stopwords from Text File

In the code below we have removed the stopwords from an entire text file using Spacy as explained in the above sections. The only difference is that we have imported the text by using the Python file operation “with open()”

In [8]:
import spacy

en = spacy.load('en_core_web_sm')
stopwords = en.Defaults.stop_words

with open("text_file.txt") as f:
    text=f.read()
    
lst=[]
for token in text.split():
    if token.lower() not in en_stopwords:
        lst.append(token)

print('Original Text')        
print(text,'\n\n')

print('Text after removing stop words')
print(' '.join(lst))
Original Text 
Harry Potter is a series of seven fantasy novels written by British author, J. K. Rowling. The novels chronicle the lives of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley, all of whom are students at Hogwarts School of Witchcraft and Wizardry.

Text after removing stop words
Harry Potter series seven fantasy novels written British author, J. K. Rowling. novels chronicle lives young wizard, Harry Potter, friends Hermione Granger Ron Weasley, students Hogwarts School Witchcraft Wizardry.

 

Conclusion

Coming to the end of this tutorial, we saw various examples of removing stopwords using Spacy from text and file. We also explained how to add your own custom stopwords or delete some default stopwords in Spacy.

Reference – Spacy Documentation

 

  • Afham Fardeen

    This is Afham Fardeen, who loves the field of Machine Learning and enjoys reading and writing on it. The idea of enabling a machine to learn strikes me.

Follow Us

Leave a Reply

Your email address will not be published. Required fields are marked *