NLTK Natural Language Processing Notes
Unit-I Language Processing and Python: Computing with Language: Texts and Words, A Closer Look
at Python: Texts as Lists of Words, Computing with Language: Simple Statistics, Back to Python:
Making Decisions and Taking Control, Automatic Natural Language Understanding [Reference 1]
Accessing Text Corpora and Lexical Resources: Accessing Text Corpora, Conditional Frequency
Distributions, Lexical Resources, WordNet
UNIT I
Getting Started with NLTK
>>> import nltk
>>> nltk.download()
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
>>>
Any time we want to find out about these texts, we just have to enter their names at
the Python prompt:
>>> text1
<Text: Moby Dick by Herman Melville 1851>
>>> text2
<Text: Sense and Sensibility by Jane Austen 1811>
>>>
Explain about Text and Words in NLTK ?
1) Searching Text
There are many ways to examine the context of a text apart from simply reading it.
A concordance view shows us every occurrence of a given word, together with some
context. Here we look up the word monstrous in Moby Dick by entering text1 followed
by a period, then the term concordance, and then placing "monstrous" in parentheses:
>>> text1.concordance("monstrous")
Building index...
,Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
>>> text1.similar("monstrous")
Building word-context index...
subtly impalpable pitiable curious imperial perilous trustworthy
abundant untoward singular lamentable few maddens horrible loving lazy
mystifying christian exasperate puzzled
>>> text2.similar("monstrous")
Building word-context index...
very exceedingly so heartily a great good amazingly as sweet
remarkably extremely vast
>>>
Observe that we get different results for different texts. Austen uses this word quite
differently from Melville; for her, monstrous has positive connotations, and sometimes
functions as an intensifier like the word very.
The term common_contexts allows us to examine just the contexts that are shared by
two or more words, such as monstrous and very. We have to enclose these words by
square brackets as well as parentheses, and separate them with a comma:
>>> text2.common_contexts(["monstrous", "very"])
be_glad am_glad a_pretty is_pretty a_lucky
2) Counting Vocabulary
Finding out the length of a text from start to finish, in terms of the words
and punctuation symbols that appear. We use the term len to get the length of texts.
>>> len(text3)
44764
>>>
So Genesis is a text3 book which has 44,764 words and punctuation symbols, or “tokens.”
A token is the technical name for a sequence of characters—such as hairy, his, or :)—
that we want to treat as a group. When we count the number of tokens in a text, say, the phrase to
be or not to be, we are counting occurrences of these sequences. Thus, in our example
phrase there are two occurrences of to, two of be, and one each of or and not. But there
are only four distinct vocabulary items in this phrase.
, How many distinct words does the book of Genesis contain?
The vocabulary of a text is just the set of tokens that it uses, since in a set, all duplicates are
collapsed together.
In Python we can obtain the vocabulary
Computing with Language: Texts and Words
items of text3 with the command: set(text3). When you do this, many screens of
words will fly past:
>>> sorted(set(text3))
['!', "'", '(', ')', ',', ',)', '.', '.)', ':', ';', ';)', '?', '?)',
'A', 'Abel', 'Abelmizraim', 'Abidah', 'Abide', 'Abimael', 'Abimelech',
'Abr', 'Abrah', 'Abraham', 'Abram', 'Accad', 'Achbor', 'Adah', ...]
>>> len(set(text3))
2789
>>>
By wrapping sorted() around the Python expression set(text3) , we obtain a sorted
list of vocabulary items, beginning with various punctuation symbols and continuing
with words starting with A. All capitalized words precede lowercase words. We discover the size of
the vocabulary indirectly, by asking for the number of items in the set,
and again we can use len to obtain this number .
Although it has 44,764 tokens, this book has only 2,789 distinct words, or “word types.”
A word type is the form orspelling of the word independently of its specific occurrences
in a text—that is, the word considered as a unique item of vocabulary.
Our count of 2,789 items will include punctuation symbols, so we will generally
call these unique items types instead of word types.
Each word is used 16 times on average (we need to make sure Python
uses floating-point division):
>>> from __future__ import division
>>> len(text3) / len(set(text3))
16.050197203298673
We can count how often a word occurs in a text,
and compute what percentage of the text is taken up by a specific word:
>>> text3.count("smote")
5
>>> 100 * text4.count('a') / len(text4)
1.4643016433938312
You may want to repeat such calculations on several texts, but it is tedious to keep
retyping the formula. Instead, you can come up with your own name for a task, like
“lexical_diversity” or “percentage”, and associate it with a block of code. Now you
only have to type a short name instead of one or more complete lines of Python code,
and you can reuse it as often as you like.
The block of code that does a task for us is called a function, and we define a