Introduction
In this article, we will take you through POS Tagging and Chunking in NLTK library of Python. We will first understand what is POS tagging and why it is used and finally see some examples of it in NLTK. Then we will turn our focus to understand the concept of what is chunking, its application along with some examples in NLTK library.
What is POS Tagging?
POS Tagging is the process of tagging words in a sentence with corresponding parts of speech like noun, pronoun, verb, adverb, preposition, etc.
Tagging the words of a text with parts of speech helps to understand how does the word functions grammatically in the context of the sentence. A word can assume different parts of speech depending on the context of the sentence.
POS Tagging is useful in sentence parsing, information retrieval, sentiment analysis, etc. In fact, it is a prerequisite for the process of Chunking and Named Entity Recognition in NLP.
POS Tagging in NLTK Library
POS Tagging in NLTK library is done using pos_tag() function which takes the tokens of a sentence as input and it returns the POS tag for each word.
List of POS Tags in NLTK
Usually, in schools, we are taught about 9 different types of parts of speech – noun, verb, adverb, article, preposition, pronoun, adjective, conjunction, and interjection. But NLTK actually provides many categories and sub-categories of tags than just the traditional nine.
We can generate all the available POS tags by using nltk.help.upenn_tagset() function.
Below is the complete list of NLTK POS tags –
Pos_tag | tag_name | example | |
---|---|---|---|
0 | $ | dollar | [, − |
1 | ” | closing quotation mark | [‘, ”] |
2 | ( | opening parenthesis | [(, [, {] |
3 | ) | closing parenthesis | [), ], }] |
4 | , | comma | [,] |
5 | — | dash | [–] |
6 | . | sentence terminator | [., !, ?] |
7 | : | colon or ellipsis | [:, ;, …] |
8 | CC | conjunction, coordinating | [&, ‘n, and, both, but, either, et, for, less,… |
9 | CD | numeral, cardinal | [mid-1890, nine-thirty, forty-two, one-tenth, … |
10 | DT | determiner | [all, an, another, any, both, del, each, eithe… |
11 | EX | existential there | [there] |
12 | FW | foreign word | [gemeinschaft, hund, ich, jeux, habeas, Haemen… |
13 | IN | preposition or conjunction, subordinating | [astride, among, uppon, whether, out, inside, … |
14 | JJ | adjective or numeral, ordinal | [third, ill-mannered, pre-war, regrettable, oi… |
15 | JJR | adjective, comparative | [bleaker, braver, breezier, briefer, brighter,… |
16 | JJS | adjective, superlative | [calmest, cheapest, choicest, classiest, clean… |
17 | LS | list item marker | [A, A., B, B., C, C., D, E, F, First, G, H, I,… |
18 | MD | modal auxiliary | [can, cannot, could, couldn’t, dare, may, migh… |
19 | NN | noun, common, singular or mass | [common-carrier, cabbage, knuckle-duster, Casi… |
20 | NNP | noun, proper, singular | [Motown, Venneboerger, Czestochwa, Ranzer, Con… |
21 | NNPS | noun, proper, plural | [Americans, Americas, Amharas, Amityvilles, Am… |
22 | NNS | noun, common, plural | [undergraduates, scotches, bric-a-brac, produc… |
23 | PDT | pre-determiner | [all, both, half, many, quite, such, sure, this] |
24 | POS | genitive marker | [‘, ‘s] |
25 | PRP | pronoun, personal | [hers, herself, him, himself, hisself, it, its… |
26 | PRP$ | pronoun, possessive | [her, his, mine, my, our, ours, their, thy, your] |
27 | RB | adverb | [occasionally, unabatingly, maddeningly, adven… |
28 | RBR | adverb, comparative | [further, gloomier, grander, graver, greater, … |
29 | RBS | adverb, superlative | [best, biggest, bluntest, earliest, farthest, … |
30 | RP | particle | [aboard, about, across, along, apart, around, … |
31 | SYM | symbol | [%, &, ‘, ”, ”., ), )., *, +, ,., <, =, >, @… |
32 | TO | “to” as preposition or infinitive marker | [to] |
33 | UH | interjection | [Goodbye, Goody, Gosh, Wow, Jeepers, Jee-sus, … |
34 | VB | verb, base form | [ask, assemble, assess, assign, assume, atone,… |
35 | VBD | verb, past tense | [dipped, pleaded, swiped, regummed, soaked, ti… |
36 | VBG | verb, present participle or gerund | [telegraphing, stirring, focusing, angering, j… |
37 | VBN | verb, past participle | [multihulled, dilapidated, aerosolized, chaire… |
38 | VBP | verb, present tense, not 3rd person singular | [predominate, wrap, resort, sue, twist, spill,… |
39 | VBZ | verb, present tense, 3rd person singular | [bases, reconstructs, marks, mixes, displeases… |
40 | WDT | WH-determiner | [that, what, whatever, which, whichever] |
41 | WP | WH-pronoun | [that, what, whatever, whatsoever, which, who,… |
42 | WP$ | WH-pronoun, possessive | [whose] |
43 | WRB | Wh-adverb | [how, however, whence, whenever, where, whereby… |
44 | “ | opening quotation mark | [`, “] |
Example of POS Tagging in NLTK
In the below example, we first tokenize the text and pass the tokens to NLTK pos_tag() function.
from nltk import pos_tag
from nltk import word_tokenize
text = "The way to get started is to quit talking and begin doing."
tokenizer = word_tokenize(text)
pos_tag(tokenizer)
[('The', 'DT'), ('way', 'NN'), ('to', 'TO'), ('get', 'VB'), ('started', 'VBN'), ('is', 'VBZ'), ('to', 'TO'), ('quit', 'VB'), ('talking', 'VBG'), ('and', 'CC'), ('begin', 'VB'), ('doing', 'VBG'), ('.', '.')]
Default Tagger in NLTK
NLTK has DefaultTagger function that is used to assign the default tag to the tokens. Let us see this with the help of an example.
Below, we first tokenize the text and then create an instance of DefaultTagger by adding the desired default tag ‘AD’. Finally, we pass the tokenized text to the DefaultTagger instance.
from nltk.tag import DefaultTagger
text = "The way to get started is to quit talking and begin doing."
tokenizer = word_tokenize(text)
tagging = DefaultTagger('Ad')
print(tagging.tag(tokenizer))
[('The', 'Ad'), ('way', 'Ad'), ('to', 'Ad'), ('get', 'Ad'), ('started', 'Ad'), ('is', 'Ad'), ('to', 'Ad'), ('quit', 'Ad'), ('talking', 'Ad'), ('and', 'Ad'), ('begin', 'Ad'), ('doing', 'Ad'), ('.', 'Ad')]
What is Chunking in NLP?
We have seen that we can break down a sentence into tokens of words and then do POS tagging for identifying parts of speech for those words. But just doing this does not give us enough meaningful information about the sentence. Chunking can help us to take us to the next level.
In NLP, chunking is the process of breaking down a text into phrases such as Noun Phrases, Verb Phrases, Adjective Phrases, Adverb phrases, and Preposition Phrases.
Chunking is commonly used to extract Noun Phrases (NP) from the sentence. It should be noted that POS tagging is the prerequisite for the chunking process and the chunks do not overlap with each other.
Chunking is essential for understanding the semantics of the text and helps in information retrieval.
Chunking in NLTK Library
The process of chunking in NLTK is a multi-step process as explained below –
Step1 :
Tokenize the sentence and perform POS Tagging.
Step 2:
Define the grammar to perform chunking. This is a very important step because grammar lays the rule of chunking.
Step 3:
Using this grammar, we create a chunk parser with the help of RegexpParser and apply it to our sentence.
Step 4:
The above step produces the result which can either be printed as it is or we can draw a graph for better visualization.
Example of Chunking in NLTK
Going by the steps we explained above, in the below example, we first tokenize the sample sentence and perform POS Tagging on it. Then we define the grammar for Noun Phrase as NP: {<DT>?<JJ>*<NN>} which means that a chunk will be constructed when an optional Determiner (DT) is followed by any number of Adjective (JJ) or Noun (NN).
We then initialize an instance of nltk.RegexpParser() with this grammar and use it to parse the tokenized sample sentence. This produces the result of chunking which we both print and draw a tree graph out of it.
import nltk
sentence = "the little yellow dog barked at the cat"
tokens = nltk.word_tokenize(sentence)
print(tokens)
tag = nltk.pos_tag(tokens)
print(tag)
grammar = "NP: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(tag)
print(result)
result.draw()
['the', 'little', 'yellow', 'dog', 'barked', 'at', 'the', 'cat'] [('the', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('dog', 'NN'), ('barked', 'VBD'), ('at', 'IN'), ('the', 'DT'), ('cat', 'NN')] (S (NP the/DT little/JJ yellow/JJ dog/NN) barked/VBD at/IN (NP the/DT cat/NN))
- Also Read – Learn Lemmatization in NTLK with Examples
- Also Read – NLTK Tokenize – Complete Tutorial for Beginners
- Also Read – Complete Tutorial for NLTK Stopwords
- Also Read – Beginner’s Guide to Stemming in Python NLTK
- Also Read – Generating Unigram, Bigram, Trigram and Ngrams in NLTK
Conclusion
Hope you found our article insightful to understand the process of POS Tagging and Chunking in NLTK library. We explained to you the basic concept of both of them along with a couple of examples by using the NLTK library.
Reference – NLTK Documentation
-
This is Afham Fardeen, who loves the field of Machine Learning and enjoys reading and writing on it. The idea of enabling a machine to learn strikes me.
View all posts