Tutorial on POS Tagging and Chunking in NLTK Python

Afham Fardeen
Last Updated On June 12, 2021
Natural Language Processing

Table of Contents

Introduction

In this article, we will take you through POS Tagging and Chunking in NLTK library of Python. We will first understand what is POS tagging and why it is used and finally see some examples of it in NLTK. Then we will turn our focus to understand the concept of what is chunking, its application along with some examples in NLTK library.

What is POS Tagging?

POS Tagging is the process of tagging words in a sentence with corresponding parts of speech like noun, pronoun, verb, adverb, preposition, etc.

Tagging the words of a text with parts of speech helps to understand how does the word functions grammatically in the context of the sentence. A word can assume different parts of speech depending on the context of the sentence.

POS Tagging is useful in sentence parsing, information retrieval, sentiment analysis, etc. In fact, it is a prerequisite for the process of Chunking and Named Entity Recognition in NLP.

Also Read – Beginner’s Guide to Named Entity Recognition (NER) in NLTK Library

POS Tagging in NLTK Library

POS Tagging in NLTK library is done using pos_tag() function which takes the tokens of a sentence as input and it returns the POS tag for each word.

Usually, in schools, we are taught about 9 different types of parts of speech – noun, verb, adverb, article, preposition, pronoun, adjective, conjunction, and interjection. But NLTK actually provides many categories and sub-categories of tags than just the traditional nine.

We can generate all the available POS tags by using nltk.help.upenn_tagset() function.

Below is the complete list of NLTK POS tags –

	Pos_tag	tag_name	example
0	$	dollar	[ $, -$ , −, — $, A$ , A, C $, H K$ , HK, M $, N Z$ , NZ, S $, U . S .$ , U.S, …
1	”	closing quotation mark	[‘, ”]
2	(	opening parenthesis	[(, [, {]
3	)	closing parenthesis	[), ], }]
4	,	comma	[,]
5	—	dash	[–]
6	.	sentence terminator	[., !, ?]
7	:	colon or ellipsis	[:, ;, …]
8	CC	conjunction, coordinating	[&, ‘n, and, both, but, either, et, for, less,…
9	CD	numeral, cardinal	[mid-1890, nine-thirty, forty-two, one-tenth, …
10	DT	determiner	[all, an, another, any, both, del, each, eithe…
11	EX	existential there	[there]
12	FW	foreign word	[gemeinschaft, hund, ich, jeux, habeas, Haemen…
13	IN	preposition or conjunction, subordinating	[astride, among, uppon, whether, out, inside, …
14	JJ	adjective or numeral, ordinal	[third, ill-mannered, pre-war, regrettable, oi…
15	JJR	adjective, comparative	[bleaker, braver, breezier, briefer, brighter,…
16	JJS	adjective, superlative	[calmest, cheapest, choicest, classiest, clean…
17	LS	list item marker	[A, A., B, B., C, C., D, E, F, First, G, H, I,…
18	MD	modal auxiliary	[can, cannot, could, couldn’t, dare, may, migh…
19	NN	noun, common, singular or mass	[common-carrier, cabbage, knuckle-duster, Casi…
20	NNP	noun, proper, singular	[Motown, Venneboerger, Czestochwa, Ranzer, Con…
21	NNPS	noun, proper, plural	[Americans, Americas, Amharas, Amityvilles, Am…
22	NNS	noun, common, plural	[undergraduates, scotches, bric-a-brac, produc…
23	PDT	pre-determiner	[all, both, half, many, quite, such, sure, this]
24	POS	genitive marker	[‘, ‘s]
25	PRP	pronoun, personal	[hers, herself, him, himself, hisself, it, its…
26	PRP$	pronoun, possessive	[her, his, mine, my, our, ours, their, thy, your]
27	RB	adverb	[occasionally, unabatingly, maddeningly, adven…
28	RBR	adverb, comparative	[further, gloomier, grander, graver, greater, …
29	RBS	adverb, superlative	[best, biggest, bluntest, earliest, farthest, …
30	RP	particle	[aboard, about, across, along, apart, around, …
31	SYM	symbol	[%, &, ‘, ”, ”., ), )., *, +, ,., <, =, >, @…
32	TO	“to” as preposition or infinitive marker	[to]
33	UH	interjection	[Goodbye, Goody, Gosh, Wow, Jeepers, Jee-sus, …
34	VB	verb, base form	[ask, assemble, assess, assign, assume, atone,…
35	VBD	verb, past tense	[dipped, pleaded, swiped, regummed, soaked, ti…
36	VBG	verb, present participle or gerund	[telegraphing, stirring, focusing, angering, j…
37	VBN	verb, past participle	[multihulled, dilapidated, aerosolized, chaire…
38	VBP	verb, present tense, not 3rd person singular	[predominate, wrap, resort, sue, twist, spill,…
39	VBZ	verb, present tense, 3rd person singular	[bases, reconstructs, marks, mixes, displeases…
40	WDT	WH-determiner	[that, what, whatever, which, whichever]
41	WP	WH-pronoun	[that, what, whatever, whatsoever, which, who,…
42	WP$	WH-pronoun, possessive	[whose]
43	WRB	Wh-adverb	[how, however, whence, whenever, where, whereby…
44	“	opening quotation mark	[`, “]

Example of POS Tagging in NLTK

In the below example, we first tokenize the text and pass the tokens to NLTK pos_tag() function.

In [1]:

from nltk import pos_tag
from nltk import word_tokenize

text = "The way to get started is to quit talking and begin doing."
tokenizer = word_tokenize(text)
pos_tag(tokenizer)

[Out] :

[('The', 'DT'),
 ('way', 'NN'),
 ('to', 'TO'),
 ('get', 'VB'),
 ('started', 'VBN'),
 ('is', 'VBZ'),
 ('to', 'TO'),
 ('quit', 'VB'),
 ('talking', 'VBG'),
 ('and', 'CC'),
 ('begin', 'VB'),
 ('doing', 'VBG'),
 ('.', '.')]

Default Tagger in NLTK

NLTK has DefaultTagger function that is used to assign the default tag to the tokens. Let us see this with the help of an example.

Below, we first tokenize the text and then create an instance of DefaultTagger by adding the desired default tag ‘AD’. Finally, we pass the tokenized text to the DefaultTagger instance.

In [2]:

from nltk.tag import DefaultTagger

text = "The way to get started is to quit talking and begin doing."
tokenizer = word_tokenize(text)
tagging = DefaultTagger('Ad')

print(tagging.tag(tokenizer))

[Out] :

[('The', 'Ad'), ('way', 'Ad'), ('to', 'Ad'), ('get', 'Ad'), ('started', 'Ad'), ('is', 'Ad'), ('to', 'Ad'), ('quit', 'Ad'), ('talking', 'Ad'), ('and', 'Ad'), ('begin', 'Ad'), ('doing', 'Ad'), ('.', 'Ad')]

What is Chunking in NLP?

We have seen that we can break down a sentence into tokens of words and then do POS tagging for identifying parts of speech for those words. But just doing this does not give us enough meaningful information about the sentence. Chunking can help us to take us to the next level.

In NLP, chunking is the process of breaking down a text into phrases such as Noun Phrases, Verb Phrases, Adjective Phrases, Adverb phrases, and Preposition Phrases.

Chunking is commonly used to extract Noun Phrases (NP) from the sentence. It should be noted that POS tagging is the prerequisite for the chunking process and the chunks do not overlap with each other.

Chunking is essential for understanding the semantics of the text and helps in information retrieval.

Chunking in NLTK Library

The process of chunking in NLTK is a multi-step process as explained below –

Step1 :

Tokenize the sentence and perform POS Tagging.

Step 2:

Define the grammar to perform chunking. This is a very important step because grammar lays the rule of chunking.

Step 3:

Using this grammar, we create a chunk parser with the help of RegexpParser and apply it to our sentence.

Step 4:

The above step produces the result which can either be printed as it is or we can draw a graph for better visualization.

Example of Chunking in NLTK

Going by the steps we explained above, in the below example, we first tokenize the sample sentence and perform POS Tagging on it. Then we define the grammar for Noun Phrase as NP: {<DT>?<JJ>*<NN>} which means that a chunk will be constructed when an optional Determiner (DT) is followed by any number of Adjective (JJ) or Noun (NN).

We then initialize an instance of nltk.RegexpParser() with this grammar and use it to parse the tokenized sample sentence. This produces the result of chunking which we both print and draw a tree graph out of it.

In [4]:

import nltk

sentence = "the little yellow dog barked at the cat"
tokens = nltk.word_tokenize(sentence)
print(tokens)

tag = nltk.pos_tag(tokens)
print(tag)
grammar = "NP: {<DT>?<JJ>*<NN>}"

cp = nltk.RegexpParser(grammar)
result = cp.parse(tag)
print(result)
result.draw()

[Out] :

['the', 'little', 'yellow', 'dog', 'barked', 'at', 'the', 'cat']
[('the', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('dog', 'NN'), ('barked', 'VBD'), ('at', 'IN'), ('the', 'DT'), ('cat', 'NN')]
(S
  (NP the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (NP the/DT cat/NN))

NLTK Chunking Example

Also Read – Learn Lemmatization in NTLK with Examples
Also Read – NLTK Tokenize – Complete Tutorial for Beginners

Also Read – Complete Tutorial for NLTK Stopwords
Also Read – Beginner’s Guide to Stemming in Python NLTK
Also Read – Generating Unigram, Bigram, Trigram and Ngrams in NLTK

Conclusion

Hope you found our article insightful to understand the process of POS Tagging and Chunking in NLTK library. We explained to you the basic concept of both of them along with a couple of examples by using the NLTK library.

Reference – NLTK Documentation

Afham Fardeen

This is Afham Fardeen, who loves the field of Machine Learning and enjoys reading and writing on it. The idea of enabling a machine to learn strikes me.
View all posts

Tags: Natural Language Processing, NLP, NLTK, python

Tutorial on POS Tagging and Chunking in NLTK Python

Introduction

What is POS Tagging?

POS Tagging in NLTK Library

Example of POS Tagging in NLTK

Default Tagger in NLTK

What is Chunking in NLP?

Chunking in NLTK Library

Step1 :

Step 2:

Step 3:

Step 4:

Example of Chunking in NLTK

Conclusion

Leave a Reply Cancel reply

Latest Posts

Follow US

Tutorial on POS Tagging and Chunking in NLTK Python

Introduction

What is POS Tagging?

POS Tagging in NLTK Library

List of POS Tags in NLTK

Example of POS Tagging in NLTK

Default Tagger in NLTK

What is Chunking in NLP?

Chunking in NLTK Library

Step1 :

Step 2:

Step 3:

Step 4:

Example of Chunking in NLTK

Conclusion

Leave a Reply Cancel reply

Latest Posts

Follow US