Class notes

BSc Data Science III year V semester NLP class Notes

Rating

Sold

Pages

Uploaded on

19-09-2024

Written in

2024/2025

Language Processing and Python: Computing with Language: Texts and Words, A Closer Look at Python: Texts as Lists of Words, Computing with Language: Simple Statistics, Back to Python: Making Decisions and Taking Control, Automatic Natural Language Understanding [Reference 1] Accessing Text Corpora and Lexical Resources: Accessing Text Corpora, Conditional Frequency Distributions, Lexical Resources, WordNet

Show more Read less

Institution

Course

Content preview

Dr. B.r ambedkAR COLLEGE

NLTK Natural Language Processing Notes
Unit-I Language Processing and Python: Computing with Language: Texts and Words, A Closer Look
at Python: Texts as Lists of Words, Computing with Language: Simple Statistics, Back to Python:
Making Decisions and Taking Control, Automatic Natural Language Understanding [Reference 1]
Accessing Text Corpora and Lexical Resources: Accessing Text Corpora, Conditional Frequency
Distributions, Lexical Resources, WordNet

UNIT I
Getting Started with NLTK

>>> import nltk
>>> nltk.download()

>>> from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
>>>
Any time we want to find out about these texts, we just have to enter their names at
the Python prompt:
>>> text1
<Text: Moby Dick by Herman Melville 1851>
>>> text2
<Text: Sense and Sensibility by Jane Austen 1811>
>>>

Explain about Text and Words in NLTK ?

1) Searching Text

There are many ways to examine the context of a text apart from simply reading it.
A concordance view shows us every occurrence of a given word, together with some
context. Here we look up the word monstrous in Moby Dick by entering text1 followed
by a period, then the term concordance, and then placing "monstrous" in parentheses:

>>> text1.concordance("monstrous")

Building index...

,Displaying 11 of 11 matches:

ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo

>>> text1.similar("monstrous")

Building word-context index...
subtly impalpable pitiable curious imperial perilous trustworthy
abundant untoward singular lamentable few maddens horrible loving lazy
mystifying christian exasperate puzzled

>>> text2.similar("monstrous")
Building word-context index...
very exceedingly so heartily a great good amazingly as sweet
remarkably extremely vast
>>>

Observe that we get different results for different texts. Austen uses this word quite
differently from Melville; for her, monstrous has positive connotations, and sometimes
functions as an intensifier like the word very.

The term common_contexts allows us to examine just the contexts that are shared by
two or more words, such as monstrous and very. We have to enclose these words by
square brackets as well as parentheses, and separate them with a comma:

>>> text2.common_contexts(["monstrous", "very"])
be_glad am_glad a_pretty is_pretty a_lucky

2) Counting Vocabulary
Finding out the length of a text from start to finish, in terms of the words

and punctuation symbols that appear. We use the term len to get the length of texts.

>>> len(text3)
44764

>>>
So Genesis is a text3 book which has 44,764 words and punctuation symbols, or “tokens.”
A token is the technical name for a sequence of characters—such as hairy, his, or :)—
that we want to treat as a group. When we count the number of tokens in a text, say, the phrase to
be or not to be, we are counting occurrences of these sequences. Thus, in our example
phrase there are two occurrences of to, two of be, and one each of or and not. But there
are only four distinct vocabulary items in this phrase.

, How many distinct words does the book of Genesis contain?

The vocabulary of a text is just the set of tokens that it uses, since in a set, all duplicates are
collapsed together.

In Python we can obtain the vocabulary

Computing with Language: Texts and Words

items of text3 with the command: set(text3). When you do this, many screens of
words will fly past:

>>> sorted(set(text3))

['!', "'", '(', ')', ',', ',)', '.', '.)', ':', ';', ';)', '?', '?)',
'A', 'Abel', 'Abelmizraim', 'Abidah', 'Abide', 'Abimael', 'Abimelech',
'Abr', 'Abrah', 'Abraham', 'Abram', 'Accad', 'Achbor', 'Adah', ...]
>>> len(set(text3))
2789
>>>
By wrapping sorted() around the Python expression set(text3) , we obtain a sorted
list of vocabulary items, beginning with various punctuation symbols and continuing
with words starting with A. All capitalized words precede lowercase words. We discover the size of
the vocabulary indirectly, by asking for the number of items in the set,
and again we can use len to obtain this number .

Although it has 44,764 tokens, this book has only 2,789 distinct words, or “word types.”
A word type is the form orspelling of the word independently of its specific occurrences
in a text—that is, the word considered as a unique item of vocabulary.
Our count of 2,789 items will include punctuation symbols, so we will generally
call these unique items types instead of word types.

Each word is used 16 times on average (we need to make sure Python
uses floating-point division):

>>> from __future__ import division
>>> len(text3) / len(set(text3))
16.050197203298673

We can count how often a word occurs in a text,
and compute what percentage of the text is taken up by a specific word:

>>> text3.count("smote")
5
>>> 100 * text4.count('a') / len(text4)
1.4643016433938312

You may want to repeat such calculations on several texts, but it is tedious to keep
retyping the formula. Instead, you can come up with your own name for a task, like
“lexical_diversity” or “percentage”, and associate it with a block of code. Now you
only have to type a short name instead of one or more complete lines of Python code,
and you can reuse it as often as you like.

The block of code that does a task for us is called a function, and we define a

Report Copyright Violation

Written for

Institution: Osmania University
Course: BSC

All documents for this subject (1)

Document information

Uploaded on: September 19, 2024
Number of pages: 24
Written in: 2024/2025
Type: Class notes
Professor(s): Lalith mohan
Contains: Introduction to nlp

Subjects

nlp notes
nlp class notes
nlp theory
bsc nlp notes osmania university

$10.99

Get access to the full document:

Written by students who passed

Immediately available after payment

Read online or as PDF

Get to know the seller

blalithmohan

Get to know the seller

blalithmohan

View profile

Sold

Member since

1 year

Number of followers

Documents

Last sold

0.0

0 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller blalithmohan. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $10.99. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 47251 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 16 years now

BSc Data Science III year V semester NLP class Notes

Content preview

Written for

Document information

Subjects

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Working on your references?

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?