SCSB1231 – DATA AND INFORMATION SCIENCE
UNIT 4 - HANDLING TEXT DATA
Bag-of-words- Regular Expressions - Sentence Splitting and Tokenization - Punctuations
and Stop words, Incorrect spellings - Properties of words and Word cloud -
Lemmatization and Term- Document TxD computation -Sentiment Analysis (Case
Study)
BAG OF WORDS
Bag of words can be defined as a Natural Language Processing technique used for text
modeling or we can say that it is a method of feature extraction with text data from
documents. It involves mainly two things firstly, a vocabulary of known words and, then a
measure of the presence of known words.
The process of converting NLP text into numbers is called vectorization in machine
learning language. A lot of different ways are available in converting text into vectors
which are:
Counting the number of times each word appears in a document, and Calculating the
frequency that each word appears in a document out of all the words in the document.
The bag-of-words model is a way of representing text data when modelling text with machine
learning algorithms.
The bag-of-words model is simple to understand and implement and has seen great success in
problems such as language modelling and document classification.
Bag-of-words is a model for feature extraction in natural language processing.
1. The Problem with Text
2. What is a Bag-of-Words?
3. Example of the Bag-of-Words Model
4. Managing Vocabulary
5. Scoring Words
6. Limitations of Bag-of-Words
,The Problem with Text
A problem with modeling text is that it is messy, and techniques like machine learning algorithms
prefer well defined fixed-length inputs and outputs.
Machine learning algorithms cannot work with raw text directly; the text must be converted into
numbers. Specifically, vectors of numbers.
In language processing, the vectors x are derived from textual data, in order to reflect various
linguistic properties of the text.
This is called feature extraction or feature encoding.
A popular and simple method of feature extraction with text data is called the bag-of-words model
of text.
What is a Bag-of-Words?
A bag-of-words model, or BoW for short, is a way of extracting features from text for use in
modeling, such as with machine learning algorithms.
The approach is very simple and flexible, and can be used in a myriad of ways for extracting
features from documents.
A bag-of-words is a representation of text that describes the occurrence of words within a
document. It involves two things:
1. A vocabulary of known words.
2. A measure of the presence of known words.
It is called a “bag” of words, because any information about the order or structure of words
in the document is discarded. The model is only concerned with whether known words occur
in the document, not where in the document.
A very common feature extraction procedure for sentences and documents is the bag-of-words
approach (BOW). In this approach, we look at the histogram of the words within the text, i.e.
considering each word count as a feature.
The intuition is that documents are similar if they have similar content. Further, that from the
content alone we can learn something about the meaning of the document.
The bag-of-words can be as simple or complex as you like. The complexity comes both in
deciding how to design the vocabulary of known words (or tokens) and how to score the presence
of known words.
,Example of the Bag-of-Words Model
Let’s make the bag-of-words model concrete with a worked example.
Step 1: Collect Data
Below is a snippet of the first few lines of text from the book “A Tale of Two Cities” by Charles
Dickens, taken from Project Gutenberg.
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
For this small example, let’s treat each line as a separate “document” and the 4 lines as our entire
corpus of documents.
Step 2: Design the Vocabulary
Now we can make a list of all of the words in our model vocabulary.
The unique words here (ignoring case and punctuation) are:
• “it”
• “was”
• “the”
• “best”
• “of”
• “times”
• “worst”
• “age”
• “wisdom”
• “foolishness”
That is a vocabulary of 10 words from a corpus containing 24 words.
Step 3: Create Document Vectors
The next step is to score the words in each document.
The objective is to turn each document of free text into a vector that we can use as input or output
for a machine learning model.
Because we know the vocabulary has 10 words, we can use a fixed-length document
representation of 10, with one position in the vector to score each word.
The simplest scoring method is to mark the presence of words as a boolean value, 0 for absent, 1
for present.
, Using the arbitrary ordering of words listed above in our vocabulary, we can step through the first
document (“It was the best of times“) and convert it into a binary vector.
The scoring of the document would look as follows:
• “it” = 1
• “was” = 1
• “the” = 1
• “best” = 1
• “of” = 1
• “times” = 1
• “worst” = 0
• “age” = 0
• “wisdom” = 0
• “foolishness” = 0
As a binary vector, this would look as follows:
1 [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
The other three documents would look as follows:
1 "it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
2 "it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
3 "it was the age of foolishness" = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]
All ordering of the words is nominally discarded and we have a consistent way of extracting
features from any document in our corpus, ready for use in modelling.
New documents that overlap with the vocabulary of known words, but may contain words outside
of the vocabulary, can still be encoded, where only the occurrences of known words are scored
and unknown words are ignored.
Managing Vocabulary
As the vocabulary size increases, so does the vector representation of documents.
In the previous example, the length of the document vector is equal to the number of known
words.
You can imagine that for a very large corpus, such as thousands of books, that the length of the
vector might be thousands or millions of positions. Further, each document may contain very few
of the known words in the vocabulary.
This results in a vector with lots of zero scores, called a sparse vector or sparse representation.
UNIT 4 - HANDLING TEXT DATA
Bag-of-words- Regular Expressions - Sentence Splitting and Tokenization - Punctuations
and Stop words, Incorrect spellings - Properties of words and Word cloud -
Lemmatization and Term- Document TxD computation -Sentiment Analysis (Case
Study)
BAG OF WORDS
Bag of words can be defined as a Natural Language Processing technique used for text
modeling or we can say that it is a method of feature extraction with text data from
documents. It involves mainly two things firstly, a vocabulary of known words and, then a
measure of the presence of known words.
The process of converting NLP text into numbers is called vectorization in machine
learning language. A lot of different ways are available in converting text into vectors
which are:
Counting the number of times each word appears in a document, and Calculating the
frequency that each word appears in a document out of all the words in the document.
The bag-of-words model is a way of representing text data when modelling text with machine
learning algorithms.
The bag-of-words model is simple to understand and implement and has seen great success in
problems such as language modelling and document classification.
Bag-of-words is a model for feature extraction in natural language processing.
1. The Problem with Text
2. What is a Bag-of-Words?
3. Example of the Bag-of-Words Model
4. Managing Vocabulary
5. Scoring Words
6. Limitations of Bag-of-Words
,The Problem with Text
A problem with modeling text is that it is messy, and techniques like machine learning algorithms
prefer well defined fixed-length inputs and outputs.
Machine learning algorithms cannot work with raw text directly; the text must be converted into
numbers. Specifically, vectors of numbers.
In language processing, the vectors x are derived from textual data, in order to reflect various
linguistic properties of the text.
This is called feature extraction or feature encoding.
A popular and simple method of feature extraction with text data is called the bag-of-words model
of text.
What is a Bag-of-Words?
A bag-of-words model, or BoW for short, is a way of extracting features from text for use in
modeling, such as with machine learning algorithms.
The approach is very simple and flexible, and can be used in a myriad of ways for extracting
features from documents.
A bag-of-words is a representation of text that describes the occurrence of words within a
document. It involves two things:
1. A vocabulary of known words.
2. A measure of the presence of known words.
It is called a “bag” of words, because any information about the order or structure of words
in the document is discarded. The model is only concerned with whether known words occur
in the document, not where in the document.
A very common feature extraction procedure for sentences and documents is the bag-of-words
approach (BOW). In this approach, we look at the histogram of the words within the text, i.e.
considering each word count as a feature.
The intuition is that documents are similar if they have similar content. Further, that from the
content alone we can learn something about the meaning of the document.
The bag-of-words can be as simple or complex as you like. The complexity comes both in
deciding how to design the vocabulary of known words (or tokens) and how to score the presence
of known words.
,Example of the Bag-of-Words Model
Let’s make the bag-of-words model concrete with a worked example.
Step 1: Collect Data
Below is a snippet of the first few lines of text from the book “A Tale of Two Cities” by Charles
Dickens, taken from Project Gutenberg.
It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,
For this small example, let’s treat each line as a separate “document” and the 4 lines as our entire
corpus of documents.
Step 2: Design the Vocabulary
Now we can make a list of all of the words in our model vocabulary.
The unique words here (ignoring case and punctuation) are:
• “it”
• “was”
• “the”
• “best”
• “of”
• “times”
• “worst”
• “age”
• “wisdom”
• “foolishness”
That is a vocabulary of 10 words from a corpus containing 24 words.
Step 3: Create Document Vectors
The next step is to score the words in each document.
The objective is to turn each document of free text into a vector that we can use as input or output
for a machine learning model.
Because we know the vocabulary has 10 words, we can use a fixed-length document
representation of 10, with one position in the vector to score each word.
The simplest scoring method is to mark the presence of words as a boolean value, 0 for absent, 1
for present.
, Using the arbitrary ordering of words listed above in our vocabulary, we can step through the first
document (“It was the best of times“) and convert it into a binary vector.
The scoring of the document would look as follows:
• “it” = 1
• “was” = 1
• “the” = 1
• “best” = 1
• “of” = 1
• “times” = 1
• “worst” = 0
• “age” = 0
• “wisdom” = 0
• “foolishness” = 0
As a binary vector, this would look as follows:
1 [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
The other three documents would look as follows:
1 "it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
2 "it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
3 "it was the age of foolishness" = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]
All ordering of the words is nominally discarded and we have a consistent way of extracting
features from any document in our corpus, ready for use in modelling.
New documents that overlap with the vocabulary of known words, but may contain words outside
of the vocabulary, can still be encoded, where only the occurrences of known words are scored
and unknown words are ignored.
Managing Vocabulary
As the vocabulary size increases, so does the vector representation of documents.
In the previous example, the length of the document vector is equal to the number of known
words.
You can imagine that for a very large corpus, such as thousands of books, that the length of the
vector might be thousands or millions of positions. Further, each document may contain very few
of the known words in the vocabulary.
This results in a vector with lots of zero scores, called a sparse vector or sparse representation.