MODULE 1: INTRODUCTION TO NLP, CORPORA, CLASSIFICATION, EVALUATION, AND NAIVE BAYES ............ 7
1. WHAT NLP IS AND WHY IT IS HARD..................................................................................................................................7
2. CORPORA: WHERE NLP GETS ITS EVIDENCE.....................................................................................................................8
3. CLASSIFICATION: TURNING TEXT INTO DECISIONS .......................................................................................................... 10
4. EVALUATING CLASSIFIERS: HOW DO YOU KNOW WHAT WORKS? ...................................................................................... 12
5. NAIVE BAYES: THE FIRST FULL PROBABILISTIC CLASSIFIER IN YOUR COURSE ...................................................................... 15
6. HOW ALL PARTS OF THIS MODULE CONNECT ................................................................................................................. 18
7. KEY TERMS AND SEARCHABLE DEFINITIONS ................................................................................................................... 19
8. FINAL MENTAL MAP FOR THIS MODULE .......................................................................................................................... 22
MODULE 2: NORMALISATION, TOKENIZATION, LEMMATIZATION, EDIT DISTANCE, AND REGULAR
EXPRESSIONS ...................................................................................................................................................... 24
1. NORMALISATION: REDUCING VARIATION ON PURPOSE .................................................................................................... 24
2. WHY THIS MODULE MATTERS IN THE FULL NLP PIPELINE ................................................................................................. 25
3. TOKENIZATION: WHAT IS A WORD? ............................................................................................................................... 26
4. LEMMATIZATION AND STEMMING: REDUCING WORD FORMS ............................................................................................ 28
5. SPELLING CORRECTION AND EDIT DISTANCE: MEASURING STRING SIMILARITY .................................................................... 30
6. REGULAR EXPRESSIONS: FINDING STRUCTURED PATTERNS IN TEXT .................................................................................. 33
7. HOW THE CONCEPTS CONNECT TO EACH OTHER ........................................................................................................... 35
8. WHAT THIS MODULE SOLVES CHRONOLOGICALLY IN THE COURSE ................................................................................... 36
9. KEY TERMS AND SEARCHABLE DEFINITIONS ................................................................................................................... 37
10. FINAL MENTAL MAP FOR THIS MODULE ........................................................................................................................ 39
MODULE 3: LANGUAGE MODELING, N-GRAMS, MARKOV MODELS, PERPLEXITY, OOV WORDS,
UNOBSERVED TRANSITIONS, AND SMOOTHING ............................................................................................... 40
1. LANGUAGE MODELING: WHAT PROBLEM IS IT SOLVING? .................................................................................................. 40
2. N-GRAMS: THE BASIC BUILDING BLOCKS ....................................................................................................................... 42
3. THE CHAIN RULE: HOW TO GET THE PROBABILITY OF A WHOLE SENTENCE .......................................................................... 43
4. MARKOV MODELS: APPROXIMATING THE HISTORY .......................................................................................................... 44
5. MAXIMUM LIKELIHOOD ESTIMATION IN LANGUAGE MODELS ............................................................................................. 45
6. CHOICE OF N: WHY NOT JUST MAKE N VERY LARGE? ....................................................................................................... 46
7. MARKOV CHAINS: STATES, TRANSITIONS, AND THE TRANSITION MATRIX............................................................................. 46
8. EVALUATING LANGUAGE MODELS: INTRINSIC AND EXTRINSIC .......................................................................................... 47
9. PERPLEXITY: THE MAIN INTRINSIC EVALUATION METRIC ................................................................................................... 48
10. THE VOCABULARY CONSTRAINT: WHY PERPLEXITY COMPARISONS CAN BE MISLEADING .................................................... 49
11. OOV WORDS: OUT-OF-VOCABULARY WORDS ............................................................................................................. 49
12. UNOBSERVED TRANSITIONS: THE DEEPER SPARSITY PROBLEM ....................................................................................... 50
13. SMOOTHING: MOVING PROBABILITY MASS TO UNSEEN EVENTS ...................................................................................... 51
14. HOW ALL THE CONCEPTS CONNECT ........................................................................................................................... 53
15. BIGGER-PICTURE COURSE MAP ................................................................................................................................. 53
16. KEY TERMS AND SEARCHABLE DEFINITIONS ................................................................................................................. 54
17. FINAL MENTAL MAP FOR THIS MODULE ........................................................................................................................ 57
MODULE 4: PART-OF-SPEECH TAGGING, HIDDEN MARKOV MODELS, DECODING, AND THE VITERBI
ALGORITHM ......................................................................................................................................................... 58
1. WHAT IS PART-OF-SPEECH TAGGING AND WHY DO WE NEED IT? ..................................................................................... 58
2. WORD CLASSES, TAGSETS, AND ANNOTATED CORPORA.................................................................................................. 59
3. TYPES, TOKENS, AND AMBIGUITY: WHY TAGGING IS WORTH DOING ................................................................................... 60
4. BASELINES: THE SIMPLEST POSSIBLE TAGGING SYSTEMS ................................................................................................. 60
, 5. CONTEXT AND THE CONNECTION TO LANGUAGE MODELING ............................................................................................ 61
6. UNKNOWN WORDS IN POS TAGGING ........................................................................................................................... 62
7. HIDDEN MARKOV MODELS: THE STATISTICAL MODEL BEHIND SEQUENCE TAGGING ............................................................ 62
8. TRANSITION PROBABILITIES AND EMISSION PROBABILITIES............................................................................................... 64
9. THE BAYESIAN LOGIC INSIDE HMM TAGGING ................................................................................................................ 65
10. THE TWO KEY ASSUMPTIONS OF HMMS...................................................................................................................... 65
11. HOW HMM PARAMETERS ARE ESTIMATED .................................................................................................................. 66
12. DECODING: FINDING THE BEST HIDDEN TAG SEQUENCE ................................................................................................ 67
13. VITERBI ALGORITHM: DYNAMIC PROGRAMMING FOR HMM DECODING ........................................................................... 67
14. STEP-BY-STEP INTUITION WITH A TINY EXAMPLE ........................................................................................................... 69
15. THE TRELLIS: WHAT IT REALLY MEANS ......................................................................................................................... 69
16. HOW THIS MODULE CONNECTS TO EARLIER AND LATER COURSE TOPICS ......................................................................... 70
17. WHAT PROBLEM DOES HMM TAGGING SOLVE, AND WHAT LIMITATIONS REMAIN? ............................................................ 71
18. KEY TERMS AND SEARCHABLE DEFINITIONS ................................................................................................................. 71
19. FINAL MENTAL MAP FOR THIS MODULE ........................................................................................................................ 74
MODULE 5: PARSING, FORMAL GRAMMARS, PROBABILISTIC CONTEXT-FREE GRAMMARS, PROBLEMS OF
PCFGS, AND DEPENDENCY PARSING ................................................................................................................ 76
1. WHAT IS PARSING AND WHAT PROBLEM DOES IT SOLVE? ................................................................................................ 76
2. CONSTITUENTS AND PARSE TREES: HOW SENTENCE STRUCTURE CAN BE REPRESENTED ..................................................... 77
3. TREEBANKS: WHERE PARSERS GET SUPERVISION ........................................................................................................... 78
4. PARSING AS RECOGNITION VS FULL PARSING ................................................................................................................. 78
5. FORMAL GRAMMARS: THE SYSTEM BEHIND PARSING ....................................................................................................... 79
6. CONTEXT-FREE GRAMMARS (CFGS): THE MAIN PHRASE-STRUCTURE FORMALISM ............................................................ 80
7. CHOMSKY NORMAL FORM (CNF): A USEFUL RESTRICTED GRAMMAR FORMAT................................................................... 81
8. TREEBANKS CAN DEFINE GRAMMARS ............................................................................................................................ 81
9. STRUCTURAL AMBIGUITY: WHY ONE SENTENCE CAN HAVE MULTIPLE PARSES ..................................................................... 82
10. PROBABILISTIC CONTEXT-FREE GRAMMARS (PCFGS): ADDING PROBABILITIES TO CFGS ................................................ 82
11. WHY PCFGS ARE USEFUL, AND WHAT THEY STILL MISS ................................................................................................ 83
12. HOW TO IMPROVE PCFGS A BIT: SPLITTING NON-TERMINALS AND PARENT ANNOTATION .................................................. 84
13. DEPENDENCY PARSING: A DIFFERENT VIEW OF SYNTAX ................................................................................................. 85
14. WHY DEPENDENCY PARSING IS USEFUL ...................................................................................................................... 86
15. SYNTACTIC ROLES IN DEPENDENCY PARSING .............................................................................................................. 86
16. DEPENDENCY TREES AS DIRECTED GRAPHS................................................................................................................. 87
17. TRAINING DATA FOR DEPENDENCY PARSING ................................................................................................................ 87
18. CONSTITUENT PARSING VS DEPENDENCY PARSING ...................................................................................................... 87
19. HOW THIS MODULE FITS INTO THE FULL COURSE TIMELINE ............................................................................................ 88
20. WHAT PROBLEM DOES THIS MODULE SOLVE, AND WHAT PROBLEMS REMAIN? ................................................................. 89
21. KEY TERMS AND SEARCHABLE DEFINITIONS ................................................................................................................. 89
22. FINAL MENTAL MAP FOR THIS MODULE ........................................................................................................................ 93
MODULE 6: LEXICA, WORDNET, WORD VECTORS, PMI, AND DIMENSIONALITY REDUCTION ........................ 94
1. THE BIG PROBLEM: WORDS ARE SYMBOLS, BUT MEANING IS GRADED ................................................................................ 94
2. LEXICA: MANUALLY COLLECTED SEMANTIC KNOWLEDGE ................................................................................................ 95
3. THESAURI AND WORDNET: EXPLICIT NETWORKS OF MEANING RELATIONS ........................................................................ 96
4. SIMILARITY IN WORDNET: PATH LENGTH, INFORMATION CONTENT, AND GLOSS OVERLAP ................................................... 97
5. DISTRIBUTIONAL SEMANTICS: MEANING FROM LANGUAGE USE ...................................................................................... 100
6. WORD VECTORS: TURNING WORDS INTO POINTS IN SPACE ............................................................................................ 100
7. SIMILARITY IN VECTOR SPACES: COSINE SIMILARITY ...................................................................................................... 102
8. WHY RAW FREQUENCY COUNTS ARE NOT ENOUGH ...................................................................................................... 103
9. PMI: WEIGHTING INFORMATIVE CO-OCCURRENCES ..................................................................................................... 103
10. DIMENSIONALITY REDUCTION: COMPRESSING THE SEMANTIC SPACE ............................................................................ 105
11. DISTRIBUTIONAL SEMANTIC MODELS: THE FULL PIPELINE ........................................................................................... 107
2
, 12. HOW THE WHOLE MODULE FITS INTO THE COURSE TIMELINE ....................................................................................... 107
13. KNOWLEDGE-BASED SEMANTICS VS DISTRIBUTIONAL SEMANTICS ................................................................................ 108
14. WHAT THIS MODULE SOLVES, AND WHAT LIMITATIONS REMAIN .................................................................................... 109
15. KEY TERMS AND SEARCHABLE DEFINITIONS ............................................................................................................... 109
16. FINAL MENTAL MAP FOR THIS MODULE ...................................................................................................................... 113
MODULE 7: LOGISTIC REGRESSION, FEED-FORWARD NEURAL NETWORKS, WORD2VEC, AND THE
EVALUATION OF DISTRIBUTIONAL SEMANTIC MODELS .................................................................................. 114
1. THE BIG SHIFT: FROM HAND-CRAFTED EVIDENCE TO LEARNED REPRESENTATIONS............................................................ 114
2. LOGISTIC REGRESSION: THE BASIC DISCRIMINATIVE CLASSIFIER ..................................................................................... 115
3. THE CORE EQUATION OF LOGISTIC REGRESSION........................................................................................................... 116
4. THE SIGMOID FUNCTION: TURNING A SCORE INTO A PROBABILITY ................................................................................... 117
5. THE FOUR ESSENTIAL COMPONENTS OF A CLASSIFIER ................................................................................................... 118
6. LOSS FUNCTIONS: HOW THE MODEL KNOWS IT IS WRONG ............................................................................................. 118
7. WHY LOGISTIC REGRESSION IS USEFUL, AND WHAT ITS MAIN LIMITATION IS...................................................................... 119
8. FEED-FORWARD NEURAL NETWORKS: STACKING NONLINEAR TRANSFORMATIONS ........................................................... 120
9. HIDDEN LAYERS: WHAT THEY ACTUALLY DO ................................................................................................................. 121
10. WHY STACKING LAYERS HELPS ................................................................................................................................ 121
11. FULLY CONNECTED NETWORKS ............................................................................................................................... 122
12. THE OUTPUT LAYER OF A NEURAL NETWORK .............................................................................................................. 122
13. PARAMETERS VS HYPERPARAMETERS ....................................................................................................................... 122
14. LOGISTIC REGRESSION VS FEED-FORWARD NEURAL NETWORKS .................................................................................. 123
15. FROM COUNT-BASED WORD VECTORS TO NEURAL EMBEDDINGS ................................................................................. 124
16. WORD2VEC: DENSE EMBEDDINGS FROM THE START ................................................................................................... 124
17. THE CENTRAL INTUITION OF WORD2VEC ................................................................................................................... 125
18. SKIP-GRAM WITH NEGATIVE SAMPLING (SGNS) ....................................................................................................... 125
19. NEGATIVE SAMPLING: WHY IT IS NEEDED ................................................................................................................... 126
20. WHY SGNS LEARNS SEMANTIC SIMILARITY ............................................................................................................... 127
21. REPRESENTATION LEARNING IN WORD2VEC ............................................................................................................. 127
22. TWO EMBEDDINGS PER WORD ................................................................................................................................. 127
23. CBOW: THE OTHER MAJOR WORD2VEC ARCHITECTURE............................................................................................. 128
24. CBOW VS SGNS ................................................................................................................................................. 128
25. HOW WORD2VEC DIFFERS FROM CLASSICAL COUNT-BASED DSMS ............................................................................ 129
26. EVALUATING EMBEDDINGS: HOW DO WE KNOW THEY ARE GOOD? ............................................................................... 129
27. THE INFLUENCE OF CONTEXT WINDOW SIZE .............................................................................................................. 130
28. SEMANTIC DRIFT: MEANING CHANGE OVER TIME ........................................................................................................ 131
29. INTRINSIC EVALUATION OF EMBEDDINGS .................................................................................................................. 132
30. EXTRINSIC EVALUATION OF EMBEDDINGS .................................................................................................................. 132
31. BIAS IN EMBEDDINGS ............................................................................................................................................. 133
32. HOW THIS MODULE FITS INTO THE FULL COURSE TIMELINE .......................................................................................... 133
33. WHAT THIS MODULE SOLVES, AND WHAT LIMITATIONS REMAIN .................................................................................... 134
34. KEY TERMS AND SEARCHABLE DEFINITIONS ............................................................................................................... 135
35. FINAL MENTAL MAP FOR THIS MODULE ...................................................................................................................... 138
MODULE 8: TRANSFORMERS, ATTENTION, BERT-STYLE ENCODERS, AND GENERATIVE LANGUAGE MODELS
............................................................................................................................................................................ 140
1. WHY MLPS WERE NOT ENOUGH ............................................................................................................................... 140
2. WHY SEQUENCE MODELS WERE NEEDED: LANGUAGE HAS LONG-DISTANCE STRUCTURE .................................................. 141
3. RECURRENT NEURAL NETWORKS: THE FIRST NEURAL ANSWER TO SEQUENCE MODELING .................................................. 141
4. PREDICTIVE TRAINING IN RECURRENT MODELS ............................................................................................................ 142
5. THE MAIN WEAKNESS OF RECURRENT NETWORKS ........................................................................................................ 143
6. LSTM: IMPROVING THE RECURRENT APPROACH .......................................................................................................... 143
7. ATTENTION: THE IDEA THAT BROKE THE BOTTLENECK .................................................................................................... 144
3
, 8. TRANSFORMER: ATTENTION BECOMES THE ARCHITECTURE ........................................................................................... 144
9. ENCODER, DECODER, AND ENCODER–DECODER TRANSFORMERS ................................................................................. 145
10. SELF-ATTENTION: THE CORE COMPUTATION .............................................................................................................. 145
11. QUERY, KEY, AND VALUE: HOW SELF-ATTENTION WORKS STEP BY STEP......................................................................... 146
12. MULTI-HEAD ATTENTION: DIFFERENT RELATION TYPES AT ONCE ................................................................................... 147
13. RESIDUAL CONNECTIONS AND THE RESIDUAL STREAM ................................................................................................ 148
14. POSITIONAL ENCODING: HOW TRANSFORMERS KNOW ORDER ..................................................................................... 148
15. MASKED ATTENTION VS FULL ATTENTION .................................................................................................................. 149
16. CROSS-ATTENTION IN ENCODER–DECODER MODELS ................................................................................................. 149
17. BERT: ENCODER-STYLE SELF-SUPERVISED PRETRAINING ........................................................................................... 150
18. USING BERT FOR DOWNSTREAM TASKS ................................................................................................................... 150
19. GENERATIVE LANGUAGE MODELS: DECODER-STYLE NEXT-TOKEN PREDICTION .............................................................. 151
20. FROM HIDDEN STATES TO VOCABULARY PROBABILITIES .............................................................................................. 151
21. GREEDY DECODING VS SAMPLING ............................................................................................................................ 152
22. TEMPERATURE: CONTROLLING PREDICTABILITY VS DIVERSITY ...................................................................................... 152
23. FINE-TUNING GENERATIVE MODELS FOR TASKS ......................................................................................................... 153
24. ZERO-SHOT PROMPTING: USING THE MODEL WITHOUT PARAMETER UPDATES ................................................................ 153
25. ZERO-SHOT VS IN-CONTEXT LEARNING ..................................................................................................................... 153
26. BERT VS GENERATIVE LMS..................................................................................................................................... 154
27. HOW THIS MODULE CONNECTS TO EARLIER COURSE TOPICS ....................................................................................... 155
28. WHAT THIS MODULE SOLVES, AND WHAT LIMITATIONS REMAIN .................................................................................... 155
29. KEY TERMS AND SEARCHABLE DEFINITIONS ............................................................................................................... 156
30. FINAL MENTAL MAP FOR THIS MODULE ...................................................................................................................... 159
MODULE 9: SPOKEN LANGUAGE, SELF-SUPERVISED SPEECH REPRESENTATION LEARNING, HUBERT,
WAV2VEC 2.0, AND ASR WITH CTC .................................................................................................................. 161
1. WHY SPEECH IS FUNDAMENTALLY DIFFERENT FROM WRITING ........................................................................................ 161
2. WHY SPOKEN LANGUAGE IS CONSIDERED PRIMARY ...................................................................................................... 162
3. MAIN SPEECH APPLICATIONS..................................................................................................................................... 163
4. WHY SPEECH AND TEXT NLP HISTORICALLY DEVELOPED SEPARATELY ............................................................................ 163
5. HOW SPEECH IS REPRESENTED IN A COMPUTER ........................................................................................................... 164
6. WHY BERT-LIKE OR GPT-LIKE MODELING IS HARDER FOR SPEECH ................................................................................ 165
7. SELF-SUPERVISED SPEECH REPRESENTATION LEARNING: THE GENERAL RECIPE ............................................................... 166
8. WHY SPOKEN LANGUAGE MODELING PROVED DIFFICULT .............................................................................................. 166
9. HUBERT: THE BERT-LIKE IDEA FOR SPEECH .............................................................................................................. 167
10. K-MEANS CLUSTERING IN HUBERT ......................................................................................................................... 167
11. WAV2VEC 2.0: SIMILAR GOAL, DIFFERENT STRATEGY .................................................................................................. 168
12. THREE FAMILIES OF SELF-SUPERVISED SPEECH OBJECTIVES ........................................................................................ 169
13. OFFLINE VS INTERNAL DISCRETIZATION..................................................................................................................... 170
14. HOW DO WE EVALUATE PRETRAINED SPEECH REPRESENTATIONS?............................................................................... 170
15. FINE-TUNING FOR ASR .......................................................................................................................................... 171
16. TWO WAYS TO BUILD AN ASR SYSTEM ON TOP OF A PRETRAINED SPEECH MODEL .......................................................... 171
17. CONNECTIONIST TEMPORAL CLASSIFICATION (CTC)................................................................................................. 172
18. WHY DYNAMIC PROGRAMMING IS NEEDED IN CTC .................................................................................................... 172
19. DRAWBACK OF CTC COMPARED WITH A FULL DECODER ............................................................................................ 173
20. HOW ASR OUTPUTS ARE EVALUATED ....................................................................................................................... 174
21. HOW THIS MODULE FITS INTO THE FULL COURSE TIMELINE .......................................................................................... 174
22. WHAT THIS MODULE SOLVES, AND WHAT LIMITATIONS REMAIN .................................................................................... 175
23. KEY TERMS AND SEARCHABLE DEFINITIONS ............................................................................................................... 176
24. FINAL MENTAL MAP FOR THIS MODULE ...................................................................................................................... 179
MODULE 10: GENERATIVE SPOKEN DIALOGUE LANGUAGE MODELING, TURN-TAKING, DIALOG STRUCTURE,
DGSLM, AND EVALUATION OF SPOKEN CONVERSATION MODELS ................................................................ 181
4