Introduction
NLTK is a great library for natural language processing with support for all the commonly used functions like stemming, lemmatization, stopwords, POS, etc. But this is just the tip of the iceberg, in fact, there are many useful functions that you probably did not know about in NLTK. In this article, we will go through such NLTK functions like Concordance, Similar, Generate, Dispersion Plot, etc.
So let us get started.
1. Concordance
NLTK concordance is a useful function to search every occurrence of a particular word in the context and also display the context around the search keyword. The function only prints the output, does not allows the user to store it for further preprocessing.
For example, in the below example of concordance function, we try to find all the occurrences of the word “magic”. You can notice that it is surrounded by the text between which it occurs. This is quite helpful not only to search the occurrence of a word or phrase but also to have a quick view of its context.
with open("concordance.txt") as f:
text = f.read()
import nltk
tokens = nltk.word_tokenize(text)
text1 = nltk.Text(tokens)
text1.concordance('magic')
Displaying 12 of 12 matches: erning body known as the Ministry of Magic and subjugate all wizards and Muggle ties are invited to attend exclusive magic schools that teach the necessary ski orating objects and wildlife such as magic wands , magic plants , potions , spe s and wildlife such as magic wands , magic plants , potions , spells , flying b f Witchcraft and Wizardry , a famous magic school in Scotland that educates you es which are well above the level of magic generally executed by people his age recent activities , the Ministry of Magic and many others in the magical world ys to defend themselves against dark magic . [ 18 ] Hermione and Ron form `` Du t a dark corridor in the Ministry of Magic , followed by a burning desire to le ortured and races to the Ministry of Magic . He and his friends face off agains ed Death Eaters ) at the Ministry of Magic . Although the timely arrival of mem nd gained control of the Ministry of Magic . Harry , Ron and Hermione drop out
2. Generate
Generate function generates some random text in the various styles. To do this, we type the name of the text followed by the term generate. For the first time, this command will take a longer time to run, as it gathers statistics about word sequences. We will get different output text each time we run the generate function.
import matplotlib.pyplot as plt
tokens = nltk.word_tokenize(text)
text1 = nltk.Text(tokens)
print(text1.generate())
in 2012 , a sport in the Ministry of Magic . , large , happy , but because of the battle , including fantasy , drama , coming of age , and finds an old potions textbook filled with many annotations and recommendations signed by a mysterious writer titled ; `` the Half- Blood Prince '' . a new Defence Against the Dark Arts . Harry and Ginny 's belongings . the books have sold more than one school year . The first book concludes with Harry 's history , and Voldemort to be one of which was the diary and
3. Count
Count function returns the total count of a given word in the text. In the below example, we can see that count of the word “Harry” is 85 in the sample text.
import nltk
tokens = nltk.word_tokenize(text)
text1 = nltk.Text(tokens)
text1.count('Harry')
85
4. Collocation_list
A collocation is a sequence of words that occur together unusually often. Collocation_list function returns a list of collocation words with the default size of 2.
In the below example, we can see that “Harry Potter” appears in the collocation list, which is easy to see why. Similarly, “Arts Teacher” is also a collocation example below.
import nltk
tokens = nltk.word_tokenize(text)
text1 = nltk.Text(tokens)
text1.collocation_list()
[('Harry', 'Potter'), ('Dark', 'Arts'), ('Wizarding', 'World'), ('wizarding', 'world'), ('Arts', 'teacher'), ('Lord', 'Voldemort'), ('Half-Blood', 'Prince'), ('new', 'Defence'), ('Death', 'Eaters'), ('Sirius', 'Black'), ('non-magical', 'people'), ('Deathly', 'Hallows'), ('Peter', 'Pettigrew'), ('device', 'called'), ('million', 'copies'), ('Hogwarts', 'School'), ('Marvolo', 'Riddle'), ('Tom', 'Marvolo'), ('United', 'States'), ('exists', 'parallel')]
5. Dispersion Plot
A dispersion plot in NLTK is used to visualize the position and number of occurrences of the words in a text corpus.
In the below example of dispersion plot, each stripe represents an instance of a word, and each row represents the entire text. Here we have visualized the occurrence of the words “Harry”, “Ron”, “Death”, and “magic”.
tokens = nltk.word_tokenize(text)
text1 = nltk.Text(tokens)
text1.dispersion_plot(['Harry','Ron','Death','magic'])
6. Similar
The Similar function in NLTK takes an input word and returns other words that appear in a similar range of contexts in the text.
For example, whenever the words Hogwarts, ron, and witchcraft occur, they will be surrounded by the same context that is “of Hogwarts and”, “of ron and” and “of witchcraft and”.
import nltk
tokens = nltk.word_tokenize(text)
text1 = nltk.Text(tokens)
text1.similar('magic')
hogwarts ron witchcraft age death adolescence keys terror parentage fire
7. Common_context
The function common_contexts allows us to examine just the contexts that are shared by two or more words. For example, the words like Hogwarts, magic, and ron are usually surrounded by context “of_and”.
import nltk
tokens = nltk.word_tokenize(text)
text1 = nltk.Text(tokens)
text1.common_contexts(['hogwarts', 'magic','ron'])
of_and
8. Index
Index function returns the first index of the word in the text. Remember that the first token starts from the 0th index.
In our below example, the word “Iron” occurs for the first time at the 37th index.
import nltk
tokens = nltk.word_tokenize(text)
text1 = nltk.Text(tokens)
text1.index('Ron')
37
9. Vocab
Vocab function in NLTK returns the total vocab of the text that is the count of all the unique words present in the text.
In the example below, the Vocab function returns the count of 1129 our text corpus.
import nltk
tokens = nltk.word_tokenize(text)
text1 = nltk.Text(tokens)
len(text1.vocab())
1129
- Also Read – Learn Lemmatization in NTLK with Examples
- Also Read – NLTK Tokenize – Complete Tutorial for Beginners
- Also Read – Complete Tutorial for NLTK Stopwords
- Also Read – Beginner’s Guide to Stemming in Python NLTK
- Also Read – Generating Unigram, Bigram, Trigram and Ngrams in NLTK
Reference – NLTK Documentation
-
This is Afham Fardeen, who loves the field of Machine Learning and enjoys reading and writing on it. The idea of enabling a machine to learn strikes me.
View all posts