NLP midterm
What is the high-level goal of NLP? - answer The high-level goal of NLP is to program
computers to perform useful tasks involving human language.
word type - answer distinct words in a corpus (unique words)
word token - answer a specific instance of a word type in a corpus
tokenization - answer separating tokens from the raw text
n-gram - answer contiguous sequence of n words
morpheme - answer the smallest meaning-bearing unit of a word
stem - answer the central morpheme of a word; provides the main meaning for the word
lemma - answer the base form under which the word is entered in the dictionary (ex: is -
> to be)
Why is it problematic to only count tokens as "words" separated by spaces and
containing only letters (A-Z)? - answer If we use this approach, we will miss many
valuable tokens, including numbers.
Why is it advantageous to tokenize via subworlds? - answer Because this approach
allows us to identify that "dog" and "dogs" have similar meanings. We can infer the
meanings of out-of-vocabulary words if we have words in the vocabulary with shared
subworlds.
When your computer encodes characters using UTF-8, what is it doing? - answer
Variable length encoding, or representing each Unicode codepoint as 1-4 bytes.
Accuracy - answertp + tn / (tp + fp + tn + fn)
Precision - answertp / tp + fp
Recall - answertp / tp + fn
F1 score - answer
What is the high-level goal of NLP? - answer The high-level goal of NLP is to program
computers to perform useful tasks involving human language.
word type - answer distinct words in a corpus (unique words)
word token - answer a specific instance of a word type in a corpus
tokenization - answer separating tokens from the raw text
n-gram - answer contiguous sequence of n words
morpheme - answer the smallest meaning-bearing unit of a word
stem - answer the central morpheme of a word; provides the main meaning for the word
lemma - answer the base form under which the word is entered in the dictionary (ex: is -
> to be)
Why is it problematic to only count tokens as "words" separated by spaces and
containing only letters (A-Z)? - answer If we use this approach, we will miss many
valuable tokens, including numbers.
Why is it advantageous to tokenize via subworlds? - answer Because this approach
allows us to identify that "dog" and "dogs" have similar meanings. We can infer the
meanings of out-of-vocabulary words if we have words in the vocabulary with shared
subworlds.
When your computer encodes characters using UTF-8, what is it doing? - answer
Variable length encoding, or representing each Unicode codepoint as 1-4 bytes.
Accuracy - answertp + tn / (tp + fp + tn + fn)
Precision - answertp / tp + fp
Recall - answertp / tp + fn
F1 score - answer