Written by students who passed Immediately available after payment Read online or as PDF Wrong document? Swap it for free 4.6 TrustPilot
logo-home
Class notes

BSc Data Science III year V semester NLP class Notes

Rating
-
Sold
-
Pages
24
Uploaded on
19-09-2024
Written in
2024/2025

Language Processing and Python: Computing with Language: Texts and Words, A Closer Look at Python: Texts as Lists of Words, Computing with Language: Simple Statistics, Back to Python: Making Decisions and Taking Control, Automatic Natural Language Understanding [Reference 1] Accessing Text Corpora and Lexical Resources: Accessing Text Corpora, Conditional Frequency Distributions, Lexical Resources, WordNet

Show more Read less
Institution
Course

Content preview

Dr. B.r ambedkAR COLLEGE


NLTK Natural Language Processing Notes
Unit-I Language Processing and Python: Computing with Language: Texts and Words, A Closer Look
at Python: Texts as Lists of Words, Computing with Language: Simple Statistics, Back to Python:
Making Decisions and Taking Control, Automatic Natural Language Understanding [Reference 1]
Accessing Text Corpora and Lexical Resources: Accessing Text Corpora, Conditional Frequency
Distributions, Lexical Resources, WordNet

UNIT I
Getting Started with NLTK

>>> import nltk
>>> nltk.download()


>>> from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
>>>
Any time we want to find out about these texts, we just have to enter their names at
the Python prompt:
>>> text1
<Text: Moby Dick by Herman Melville 1851>
>>> text2
<Text: Sense and Sensibility by Jane Austen 1811>
>>>


Explain about Text and Words in NLTK ?

1) Searching Text

There are many ways to examine the context of a text apart from simply reading it.
A concordance view shows us every occurrence of a given word, together with some
context. Here we look up the word monstrous in Moby Dick by entering text1 followed
by a period, then the term concordance, and then placing "monstrous" in parentheses:

>>> text1.concordance("monstrous")

Building index...

,Displaying 11 of 11 matches:

ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo


>>> text1.similar("monstrous")

Building word-context index...
subtly impalpable pitiable curious imperial perilous trustworthy
abundant untoward singular lamentable few maddens horrible loving lazy
mystifying christian exasperate puzzled

>>> text2.similar("monstrous")
Building word-context index...
very exceedingly so heartily a great good amazingly as sweet
remarkably extremely vast
>>>

Observe that we get different results for different texts. Austen uses this word quite
differently from Melville; for her, monstrous has positive connotations, and sometimes
functions as an intensifier like the word very.


The term common_contexts allows us to examine just the contexts that are shared by
two or more words, such as monstrous and very. We have to enclose these words by
square brackets as well as parentheses, and separate them with a comma:

>>> text2.common_contexts(["monstrous", "very"])
be_glad am_glad a_pretty is_pretty a_lucky




2) Counting Vocabulary
Finding out the length of a text from start to finish, in terms of the words

and punctuation symbols that appear. We use the term len to get the length of texts.

>>> len(text3)
44764

>>>
So Genesis is a text3 book which has 44,764 words and punctuation symbols, or “tokens.”
A token is the technical name for a sequence of characters—such as hairy, his, or :)—
that we want to treat as a group. When we count the number of tokens in a text, say, the phrase to
be or not to be, we are counting occurrences of these sequences. Thus, in our example
phrase there are two occurrences of to, two of be, and one each of or and not. But there
are only four distinct vocabulary items in this phrase.

, How many distinct words does the book of Genesis contain?

The vocabulary of a text is just the set of tokens that it uses, since in a set, all duplicates are
collapsed together.

In Python we can obtain the vocabulary

Computing with Language: Texts and Words

items of text3 with the command: set(text3). When you do this, many screens of
words will fly past:

>>> sorted(set(text3))

['!', "'", '(', ')', ',', ',)', '.', '.)', ':', ';', ';)', '?', '?)',
'A', 'Abel', 'Abelmizraim', 'Abidah', 'Abide', 'Abimael', 'Abimelech',
'Abr', 'Abrah', 'Abraham', 'Abram', 'Accad', 'Achbor', 'Adah', ...]
>>> len(set(text3))
2789
>>>
By wrapping sorted() around the Python expression set(text3) , we obtain a sorted
list of vocabulary items, beginning with various punctuation symbols and continuing
with words starting with A. All capitalized words precede lowercase words. We discover the size of
the vocabulary indirectly, by asking for the number of items in the set,
and again we can use len to obtain this number .

Although it has 44,764 tokens, this book has only 2,789 distinct words, or “word types.”
A word type is the form orspelling of the word independently of its specific occurrences
in a text—that is, the word considered as a unique item of vocabulary.
Our count of 2,789 items will include punctuation symbols, so we will generally
call these unique items types instead of word types.

Each word is used 16 times on average (we need to make sure Python
uses floating-point division):

>>> from __future__ import division
>>> len(text3) / len(set(text3))
16.050197203298673

We can count how often a word occurs in a text,
and compute what percentage of the text is taken up by a specific word:

>>> text3.count("smote")
5
>>> 100 * text4.count('a') / len(text4)
1.4643016433938312


You may want to repeat such calculations on several texts, but it is tedious to keep
retyping the formula. Instead, you can come up with your own name for a task, like
“lexical_diversity” or “percentage”, and associate it with a block of code. Now you
only have to type a short name instead of one or more complete lines of Python code,
and you can reuse it as often as you like.

The block of code that does a task for us is called a function, and we define a

Written for

Institution
Course

Document information

Uploaded on
September 19, 2024
Number of pages
24
Written in
2024/2025
Type
Class notes
Professor(s)
Lalith mohan
Contains
Introduction to nlp

Subjects

$10.99
Get access to the full document:

Wrong document? Swap it for free Within 14 days of purchase and before downloading, you can choose a different document. You can simply spend the amount again.
Written by students who passed
Immediately available after payment
Read online or as PDF

Get to know the seller
Seller avatar
blalithmohan

Get to know the seller

Seller avatar
blalithmohan
Follow You need to be logged in order to follow users or courses
Sold
-
Member since
1 year
Number of followers
0
Documents
1
Last sold
-

0.0

0 reviews

5
0
4
0
3
0
2
0
1
0

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Working on your references?

Create accurate citations in APA, MLA and Harvard with our free citation generator.

Working on your references?

Frequently asked questions