Written by students who passed Immediately available after payment Read online or as PDF Wrong document? Swap it for free 4.6 TrustPilot
logo-home
Exam (elaborations)

Natural Language Processing (CS4990) Questions and Answers 100% Solved

Rating
-
Sold
-
Pages
29
Grade
A+
Uploaded on
31-10-2024
Written in
2024/2025

Natural Language Processing (CS4990)

Institution
Course

Content preview

Natural Language Processing (CS4990)

Why Python? - answer- Shallow learning curve
- Good string handling
- Combines OO, aspect-oriented and FP paradigms
- Extensive standard libraries (e.g. NLTK)
- Great support for Deep Learning

Human language - answer Ultimate interface for interaction and communication.
But something to understand, because it's:
- highly ambiguous at all level
- complex and a subtle use of context to convey meaning
- fuzzy and probabilistic

Understanding a language requires domain knowledge, discourse knowledge, world
knowledge and linguistic knowledge

Word level ambiguity - answer- Spelling (e.g. colour vs color)
- Pronunciation
• 1 word can have multiple pronunciations (e.g. abstract, desert)
• Multiple words can share the same pronunciation (e.g. flower/flour)
- Meaning (1 word can have multiple meanings, i.e. homonyms; e.g. date, crane,
leaves)

Natural Language Processing (NLP) - answerA subfield of linguistics, CompSci,
Information Engineering and AI concerned with the interactions between computers and
human (natural) languages, in particular how to program computers to process and
analyse large amounts of natural language data

NLP tasks & applications - answer- Writing assistance (spell/grammar/style checking,
auto completion).
- Text classification (spam detection, sentiment analyses, fake news/propaganda
detection, news topic classification, customer reviews category classification).
- Information retrieval (search engine)
- NL Understanding (argumentation mining, question-answering, NL inference,
humorous/ironic/metaphoric language analysis).
- NL generation (document summarisation, machine translation, sentence
paraphrasing/simplification, dialogue/exercise generation)

NLP limits & outlook - answer- Language problems are hard - for most of them, there's
still no fully accurate solution (like Physics, History and Psychology).

Data types (based on structures) - answer- Structured data

,- Semi-structured data
- Unstructured data

Corpus (=body) - answerA large body of text.
It usually contains raw text and any metadata associated with the text (e.g. timestamp,
source, index, ...).
It's also known as a dataset

Text cleaning & normalisation - answerRemove useless information (e.g. email
headers) and extract useful information (e.g. words, word sequences, verbs, nouns,
adjectives, names, locations, orgs, ...).

1. Tokenization (sentence, words)
2. Stemming / Lemmatization
3. Stop-words removal

Tokenization - answerProcess of splitting sentences into their constituents, i.e. tokens
(generally done by white-space or punctuation character separation in English), which
are meaningful segments.

Type - answerElement in the vocabulary. Also known as the form or spelling of the
token (including words and punctuation) independently of its specific occurrences in a
text.

Token - answerInstance of a type in a text, which is a sequence of characters that is
treated as a single group (i.e. words and punctuation).

E.g. To be or not to be
- 2x to, be
- 1x or, not

Simple tokenization - answerSplit with white-space (for English texts).

Pros: simple and natively supported by Python.
Cons: it fails to tokenize punctuation and hyphenated words (e.g. "state-of-the-art").

Natural Language Tool Kit (NLTK) - answer(FOSS) Python library to make programs
that work with NL.
It can perform different operations such as tokenization, stemming, classification,
parsing, tagging and semantic reasoning.

Word tokenizer (from NLTK) - answerNLTK' standard tokenizer.
Pros: successfully tokenizes punctuations, split hashtags into separate words (e.g.
#70thRepublic_Day into "#" and "70thRepublic_Day")
Cons: it fails to identify widely used symbol combinations (e.g. ":)" is split into 2
symbols)

, Tweet tokenizer (from NLTK) - answerPros: correctly handles hashtags and mentions
(`@somone`)
Cons: it fails at abbreviations (e.g. U.K)

Sentence tokenization - answerFor long documents, we may not be interested in words
but instead in sentences therein:
- Check whether a sentence's sentiment is positive or negative.
- Check whether a sentence contains propaganda content.
- Check the grammatical correctness of a sentence
- ...

Stemming - answerProcess of reducing the inflection in words to their root forms such
as mapping a group of words to the same stem even if the stem itself isn't a valid word
in the language.
NLTK includes 2 widely used ones: Port Stemmer and Lancaster Stemmer (younger
and more aggressive); they both regard an input text as a single word.

Pros: quick to run (because it's based on simple rules) and suitable for processing a
large amount of text
Cons: the resulting words may not carry any meaning (or be actual words)

Inflection (in grammar) - answerModification of a word to express different grammatical
categories such as tense, case, voice, aspect, person, number, gender and mood.
It expresses 1+ grammatical categories with a prefix, suffix, infix or another internal
modification such as vowel change.
Examples: removing ..ed, ..ize, ..s, ..de, mid from words.

Lemmatization - answerProcess of reducing the inflection in words to their dictionary
root forms (called lemma), ensuring that the root word belongs to the language.

Pros: the derived root word is meaningful
Cons: it's more expensive (hence slow) to run

Lemma - answerCanonical/dictionary/citation form of a set of words

Stop-words - answerSmall, unimportant words in a NL query, (such as "am", "the", "to"
and "are") which support words and sentences and help to construct sentences. But
they don't affect the meaning of the sentence.

Regular Expression (RegEx or RegExp) - answerA sequence of characters that define a
search pattern.
Usually, such patterns are used by string searching algorithms for "find" and "find and
replace" operations on strings or for input validation.
A key characteristic of it is its economical script (especially with the Kleene closures
with + and *).

Written for

Course

Document information

Uploaded on
October 31, 2024
Number of pages
29
Written in
2024/2025
Type
Exam (elaborations)
Contains
Questions & answers

Subjects

$14.99
Get access to the full document:

Wrong document? Swap it for free Within 14 days of purchase and before downloading, you can choose a different document. You can simply spend the amount again.
Written by students who passed
Immediately available after payment
Read online or as PDF


Also available in package deal

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
julianah420 Phoenix University
Follow You need to be logged in order to follow users or courses
Sold
691
Member since
3 year
Number of followers
329
Documents
35563
Last sold
1 week ago
NURSING,TESTBANKS,ASSIGNMENT,AQA AND ALL REVISION MATERIALS

On this page, you find all documents, package deals, and flashcards offered by seller julianah420

4.2

156 reviews

5
102
4
21
3
12
2
5
1
16

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Working on your references?

Create accurate citations in APA, MLA and Harvard with our free citation generator.

Working on your references?

Frequently asked questions