Tutorial on POS Tagging and Chunking in NLTK Python

Tutorial on POS Tagging and Chunking in NLTK Python

In this article, we will take you through POS Tagging and Chunking in NLTK library of Python. We will first understand what is POS tagging and why it is used and finally see some examples of it in NLTK. Then we will turn our focus to understand the concept of what is chunking, its application along with some examples in NLTK library.

What is POS Tagging?

POS Tagging is the process of tagging words in a sentence with corresponding parts of speech like noun, pronoun, verb, adverb, preposition, etc.

POS Tagging in NLTK

Tagging the words of a text with parts of speech helps to understand how does the word functions grammatically in the context of the sentence. A word can assume different parts of speech depending on the context of the sentence.

POS Tagging is useful in sentence parsing, information retrieval, sentiment analysis, etc. In fact, it is a prerequisite for the process of Chunking and Named Entity Recognition in NLP.

POS Tagging in NLTK Library

POS Tagging in NLTK library is done using pos_tag() function which takes the tokens of a sentence as input and it returns the POS tag for each word.

Deep Learning Specialization on Coursera

List of POS Tags in NLTK

Usually, in schools, we are taught about 9 different types of parts of speech – noun, verb, adverb, article, preposition, pronoun, adjective, conjunction, and interjection. But NLTK actually provides many categories and sub-categories of tags than just the traditional nine.

We can generate all the available POS tags by using nltk.help.upenn_tagset() function.

Below is the complete list of NLTK POS tags –

Pos_tag tag_name example
0 $ dollar [, , , —, A, A, C, HK, HK, M, NZ, NZ, S, U.S., U.S, …
1 closing quotation mark [‘, ”]
2 ( opening parenthesis [(, [, {]
3 ) closing parenthesis [), ], }]
4 , comma [,]
5 dash [–]
6 . sentence terminator [., !, ?]
7 : colon or ellipsis [:, ;, …]
8 CC conjunction, coordinating [&, ‘n, and, both, but, either, et, for, less,…
9 CD numeral, cardinal [mid-1890, nine-thirty, forty-two, one-tenth, …
10 DT determiner [all, an, another, any, both, del, each, eithe…
11 EX existential there [there]
12 FW foreign word [gemeinschaft, hund, ich, jeux, habeas, Haemen…
13 IN preposition or conjunction, subordinating [astride, among, uppon, whether, out, inside, …
14 JJ adjective or numeral, ordinal [third, ill-mannered, pre-war, regrettable, oi…
15 JJR adjective, comparative [bleaker, braver, breezier, briefer, brighter,…
16 JJS adjective, superlative [calmest, cheapest, choicest, classiest, clean…
17 LS list item marker [A, A., B, B., C, C., D, E, F, First, G, H, I,…
18 MD modal auxiliary [can, cannot, could, couldn’t, dare, may, migh…
19 NN noun, common, singular or mass [common-carrier, cabbage, knuckle-duster, Casi…
20 NNP noun, proper, singular [Motown, Venneboerger, Czestochwa, Ranzer, Con…
21 NNPS noun, proper, plural [Americans, Americas, Amharas, Amityvilles, Am…
22 NNS noun, common, plural [undergraduates, scotches, bric-a-brac, produc…
23 PDT pre-determiner [all, both, half, many, quite, such, sure, this]
24 POS genitive marker [‘, ‘s]
25 PRP pronoun, personal [hers, herself, him, himself, hisself, it, its…
26 PRP$ pronoun, possessive [her, his, mine, my, our, ours, their, thy, your]
27 RB adverb [occasionally, unabatingly, maddeningly, adven…
28 RBR adverb, comparative [further, gloomier, grander, graver, greater, …
29 RBS adverb, superlative [best, biggest, bluntest, earliest, farthest, …
30 RP particle [aboard, about, across, along, apart, around, …
31 SYM symbol [%, &, ‘, ”, ”., ), )., *, +, ,., <, =, >, @…
32 TO “to” as preposition or infinitive marker [to]
33 UH interjection [Goodbye, Goody, Gosh, Wow, Jeepers, Jee-sus, …
34 VB verb, base form [ask, assemble, assess, assign, assume, atone,…
35 VBD verb, past tense [dipped, pleaded, swiped, regummed, soaked, ti…
36 VBG verb, present participle or gerund [telegraphing, stirring, focusing, angering, j…
37 VBN verb, past participle [multihulled, dilapidated, aerosolized, chaire…
38 VBP verb, present tense, not 3rd person singular [predominate, wrap, resort, sue, twist, spill,…
39 VBZ verb, present tense, 3rd person singular [bases, reconstructs, marks, mixes, displeases…
40 WDT WH-determiner [that, what, whatever, which, whichever]
41 WP WH-pronoun [that, what, whatever, whatsoever, which, who,…
42 WP$ WH-pronoun, possessive [whose]
43 WRB Wh-adverb [how, however, whence, whenever, where, whereby…
44 opening quotation mark [`, “]

Example of POS Tagging in NLTK

In the below example, we first tokenize the text and pass the tokens to NLTK pos_tag() function.

In [1]:
from nltk import pos_tag
from nltk import word_tokenize

text = "The way to get started is to quit talking and begin doing."
tokenizer = word_tokenize(text)
[Out] :
[('The', 'DT'),
 ('way', 'NN'),
 ('to', 'TO'),
 ('get', 'VB'),
 ('started', 'VBN'),
 ('is', 'VBZ'),
 ('to', 'TO'),
 ('quit', 'VB'),
 ('talking', 'VBG'),
 ('and', 'CC'),
 ('begin', 'VB'),
 ('doing', 'VBG'),
 ('.', '.')]

Default Tagger in NLTK

NLTK has DefaultTagger function that is used to assign the default tag to the tokens. Let us see this with the help of an example.

Below, we first tokenize the text and then create an instance of DefaultTagger by adding the desired default tag ‘AD’. Finally, we pass the tokenized text to the DefaultTagger instance.

In [2]:
from nltk.tag import DefaultTagger

text = "The way to get started is to quit talking and begin doing."
tokenizer = word_tokenize(text)
tagging = DefaultTagger('Ad')

[Out] :
[('The', 'Ad'), ('way', 'Ad'), ('to', 'Ad'), ('get', 'Ad'), ('started', 'Ad'), ('is', 'Ad'), ('to', 'Ad'), ('quit', 'Ad'), ('talking', 'Ad'), ('and', 'Ad'), ('begin', 'Ad'), ('doing', 'Ad'), ('.', 'Ad')]

What is Chunking in NLP?

We have seen that we can break down a sentence into tokens of words and then do POS tagging for identifying parts of speech for those words. But just doing this does not give us enough meaningful information about the sentence. Chunking can help us to take us to the next level.

In NLP, chunking is the process of breaking down a text into phrases such as Noun Phrases, Verb Phrases, Adjective Phrases, Adverb phrases, and Preposition Phrases.

Chunking is commonly used to extract Noun Phrases (NP) from the sentence. It should be noted that POS tagging is the prerequisite for the chunking process and the chunks do not overlap with each other.

Chunking is essential for understanding the semantics of the text and helps in information retrieval.

Chunking in NLTK
Chunking in NLP (Source)


Chunking in NLTK Library

The process of chunking in NLTK is a multi-step process as explained below –

Step1 :

Tokenize the sentence and perform POS Tagging.

Step 2:

Define the grammar to perform chunking. This is a very important step because grammar lays the rule of chunking.

Step 3:

Using this grammar, we create a chunk parser with the help of RegexpParser and apply it to our sentence.

Step 4:

The above step produces the result which can either be printed as it is or we can draw a graph for better visualization.

Example of Chunking in NLTK

Going by the steps we explained above, in the below example, we first tokenize the sample sentence and perform POS Tagging on it. Then we define the grammar for Noun Phrase as  NP: {<DT>?<JJ>*<NN>} which means that a chunk will be constructed when an optional Determiner (DT) is followed by any number of Adjective (JJ) or Noun (NN).

We then initialize an instance of nltk.RegexpParser() with this grammar and use it to parse the tokenized sample sentence. This produces the result of chunking which we both print and draw a tree graph out of it.

In [4]:
import nltk

sentence = "the little yellow dog barked at the cat"
tokens = nltk.word_tokenize(sentence)

tag = nltk.pos_tag(tokens)
grammar = "NP: {<DT>?<JJ>*<NN>}"

cp = nltk.RegexpParser(grammar)
result = cp.parse(tag)
[Out] :
['the', 'little', 'yellow', 'dog', 'barked', 'at', 'the', 'cat']
[('the', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('dog', 'NN'), ('barked', 'VBD'), ('at', 'IN'), ('the', 'DT'), ('cat', 'NN')]
  (NP the/DT little/JJ yellow/JJ dog/NN)
  (NP the/DT cat/NN))

NLTK Chunking Example


Hope you found our article insightful to understand the process of POS Tagging and Chunking in NLTK library. We explained to you the basic concept of both of them along with a couple of examples by using the NLTK library.

Reference – NLTK Documentation



Please enter your comment!
Please enter your name here