Introduction
In this tutorial, we will learn about stopwords in Spacy library and how to use them in your NLP projects. But before going into implementation and examples we will understand what is stopwords and why it is always advised to remove them. We will also show how to add your own custom stopwords or remove some default stopwords in Spacy.
What are Stopwords?
In English vocabulary, there are many words like “I”, “the” and “you” that appear very frequently in the text but they do not add any valuable information for NLP operations and modeling. These words are called stopwords and they are almost always advised to be removed as part of text preprocessing.
When we remove stopwords it reduces the size of the text corpus which increases the performance and robustness of the NLP model. But sometimes removing the stopwords may have an adverse effect if it changes the meaning of the sentence. For example, if we consider the example “This is not a good way to talk” which is a negative sentence. When we remove stopwords from this sentence it becomes a positive sentence: “good way talk”.
Stopwords in Spacy Library
i) Stopwords List in Spacy
The Spacy library has a default list of 326 stopwords. The below code displays the list of stopwords in Spacy.
import spacy
#loading the english language small model of spacy
en = spacy.load('en_core_web_sm')
stopwords = en.Defaults.stop_words
print(len(stopwords))
print(stopwords)
ii) Checking for Stopwords
We can check whether a word is a stopword or not by using the is_stop method of Spacy.
In the example below, we checked whether a token is a stopword using is_stop and in output, we can see that the stopwords are returned as True otherwise False.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Tommorow will be too late, its now or never")
for token in doc:
print(token.text,token.is_stop)
Tommorow False will True be True too True late False , False its True now True or True never True
iii) Remove Stopwords using Spacy
In the example below, we import the spacy library and loaded the English language model of the Spacy object, and store the list of stopwords in a variable. Then we create an empty list to store words that are not stopwords.
Using a for loop that iterates over the text (that has been split on whitespace) we check whether the word is present in the stopword list, if not we append it in the list.
At last, we join the list of words that don’t contain stopwords using the “join()” function, and thus we have a final output where all stopwords are removed from the string.
import spacy
#loading the english language small model of spacy
en = spacy.load('en_core_web_sm')
stopwords = en.Defaults.stop_words
text = " we will show how to remove stopwords using spacy library"
lst=[]
for token in text.split():
if token.lower() not in stopwords: #checking whether the word is not
lst.append(token) #present in the stopword list.
#Join items in the list
print("Original text : ",text)
print("Text after removing stopwords : ",' '.join(lst))
Original text : we will show how to remove stopwords from our using spacy library Text after removing stopwords : remove stopwords spacy library
iv) Adding Stopwords to Default Spacy List
By default, Spacy has 326 English stopwords, but at times you may like to add your own custom stopwords to the default list. We will show you how in the below example.
To add a custom stopword in Spacy, we first load its English language model and use add() method to add stopwords.
This code shows how to add a single stopword:
import spacy
nlp = spacy.load("en_core_web_sm")
nlp.Defaults.stop_words.add("my_new_stopword")
To add several stopwords at once:
import spacy
nlp = spacy.load("en_core_web_sm")
nlp.Defaults.stop_words |= {"Afham","Farden"}
v) Remove Stopwords from Default Spacy List
There may be some scenarios where you will like to preserve some stopwords in your text. In this case, you may remove those stopwords from Spacy default list by the remove() method as shown in the below examples.
To remove a single stopword:
import spacy
nlp = spacy.load("en_core_web_sm")
nlp.Defaults.stop_words.remove("what")
To remove several stopwords at once:
import spacy
nlp = spacy.load("en_core_web_sm")
nlp.Defaults.stop_words -= {"who", "when"}
vi) Filtering Stopwords from Text File
In the code below we have removed the stopwords from an entire text file using Spacy as explained in the above sections. The only difference is that we have imported the text by using the Python file operation “with open()”
import spacy
en = spacy.load('en_core_web_sm')
stopwords = en.Defaults.stop_words
with open("text_file.txt") as f:
text=f.read()
lst=[]
for token in text.split():
if token.lower() not in en_stopwords:
lst.append(token)
print('Original Text')
print(text,'\n\n')
print('Text after removing stop words')
print(' '.join(lst))
Original Text
Harry Potter is a series of seven fantasy novels written by British author, J. K. Rowling. The novels chronicle the lives of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley, all of whom are students at Hogwarts School of Witchcraft and Wizardry.
Text after removing stop words
Harry Potter series seven fantasy novels written British author, J. K. Rowling. novels chronicle lives young wizard, Harry Potter, friends Hermione Granger Ron Weasley, students Hogwarts School Witchcraft Wizardry.
- Also Read – Complete Tutorial for NLTK Stopwords
Conclusion
Coming to the end of this tutorial, we saw various examples of removing stopwords using Spacy from text and file. We also explained how to add your own custom stopwords or delete some default stopwords in Spacy.
Reference – Spacy Documentation
-
This is Afham Fardeen, who loves the field of Machine Learning and enjoys reading and writing on it. The idea of enabling a machine to learn strikes me.
View all posts