Processing
Heaps' Law : IVI =
kNB Ninb of tokens B = 0 5
,
10ck < 100
Text Tokenization >
-
splitting sentence into an ordered list
of individual words >
-
tokens
·
Space-based tokenization
·
Subward tokenization
BytePair Encoding >
-
subword tokenization method that splits smaller
words into parts by maging the most frequent
pairs of characters or rubwarde lowering -
low er
ing
WordPiece Tokenizer start with individual characters and learns which subword combinations to
-
merge based on
maximizing the likelihood
of the maining data -
playing -
play #ing
SentenceRece Tokenizer >
-
treats entire sentence as a stream
of characters and breaks it into subwords or tokens based
on
frequency -
I am
learning >
-
I-am-learning
Word Normalization
Stemming >
-
chopping off offices (prefix ruffix , , infix)
Lemmatization >
-
canonical form dictionary form
,
Sentence ()
Segmentation >
-
splitting text into set
of sentence
epecific
using sentence splitter .
Regular Expressions
Vector Space Models
represent words & does as vectors that capture relative meaning
·
wes
embeddings as the representation of word
meaning
Bag of Words (BOW) Representation
·
representation model that counts b
of occurences ,
or
frequency of ↓
each word in the
given corpus of dod .
·
complexity-how to create the
vocabulary of known words and how to score the presence
of these words
·
vocabulary
, umove stop-word leit
One-hot
encoding
"the" :
[1 0. 0 0. 0]
, ,
"Cat" :
[0 1 0 0 , , ,
0,
t one-hot
cleaned text cone is" [0 0]
·
> tokens , >
: ,
0 1 , ,
0 ,
encoding
"in" :
[0 ,
0 ,
0, 1 , 0]
Each of size "V" 1)
word in the
vocabulary is represented by a one-hot vector where "noon" :
[0 ,
0 0 , 0,
,
"V" is the total number
of words in the
vocabulary
Each one hot rector >
-
unique .
Term-Document Matrix
·
measurement
of how
frequently a term (word) occurs within the document
·
counting nb
of times a word
appears