Cheat Sheet for Midterm Exam:2025
Week 1:
Analycs: Answer important type of quesons 1) Descripve Analycs – What happened? 2) Predicve Analycs: What will happen? 3) Prescripve Analycs
– What acon(s) will be best.
Modeling: 1) Describe real life situaon mathemacally 2) Analyze Math 3) Turn math answer back to real life situaon.
Classicaon is pu+ng things into categories: it can have more than 2 categories. Classification needs data.
Choosing a classier: We need to choose a line, that is further from making mistakes. If impossible to separate good from bad, we need to use so1
classi.er instead of hard classi.er. Most costly a bad decision, further the line.
Data Denions: Data Tables: Rows: Data Point, Column: A4ributes, feature, covariate, predictor. Response/Outcome: Special type of column:
“Answer” for each data point. Data Types: Structured: Data stored in structure (qualitave: credit score, categorical: Hair Color (non-numerical) Zip
Codes (numerical), Binary: Two values (Y/N, Male/Female), Time series data: Over me (Daily Sales)) Unstructured: not easily described (Wri4en
text).
Support Vector Machine (SVM):
Hard Classi.caon: So1 classi.caon:
Equaon: Done to maximize line, subject to that all data points are => 1(m). For so1 classi.caon, error and classi.caon can be calculated, amount
less than 0 = amount of error. To trade between them chose value of lambda: minimize the combinaon of error and margin. Lambda: Gets
large, the sum of ai2 gets large: the importance of large margin, outweighs avoiding mistakes, as lambda gets close to 0, the importance of
minimizing mistakes outweighs having large margin.
Hard Separaon: Classi.er we chose, depends on value of intercept. Scaling: Before running SVM need to scale. Near-zero coeFcients, not relevant
for classi.caon.
Kernel methods: Allow non-linear classifier.
Scaling and standardizaon: Common scaling: Data between 0 and 1 (Data Range). Standardizaon: Mean: 0, SD: 1 (PCA, Clustering)
K-nearest neighbor algorithm: There is more than one way to measure distance, common straight line, a4rib. weight by importance
Week 2:
Validaon: How good is the model? How accurate, how well, how o1en does it correctly determine something.
Model validaon: Data 2 types of pa4erns: Real EGect. Random EGect. Fi+ng matches both real (same) and random eGect (diG.).
Validaon and Test Data Sets: Larger set of data to .t the model, smaller set to measure eGecveness.
Observed performance: Real quality + random eGects (High Performance models, above-average random eGects: too opmisc).
Test Sets: Training Set FIT model, Validaon Set to CHOOSE model, Test Set to est. PERFORMANCE Spli,ng Data: With one model
(Training and test): 70-90% training, 30-10% test.
Methods: Random: Chose random between proporon, Rotaon: Take turns selecng points (advantage of being equally separated). Ex:
Week sales, randomness: give set earlier data, rotaon: introduce bias.
Cross-validaon: To account for important data only on validaon or test sets. K-Fold Cross Validaon: Train model on all parts, evaluate
it on the one remaining part. Average the k evaluaons to est. model quality. K=10 common.
Clustering: Grouping data points, with distance norms (robot example in isles example).
K-means-Clustering: k-means algorithm, Heurisc: Fast, good, but not best soluon. Expectaon-maximizaon algorithm.
Praccal k-means: Run several mes, diGerent inial cluster centers, choose best soluon found. Test diGerent k clusters (Use elbow
graph).
Classicaon: Correct classi.caon of data points is already known Clustering: Correct classi.caon of data is not known.
Supervised Learning: Response is known for data point Unsupervised Learning: Response is not known Week 3:
Data Preparaon: Using speci.c data in analysis (Predictors: regression, factors: classi.caon). Scale the data, seek outliers, and extraneous informaon
complicates the model and harder to interpret soluon.
Outliers: Data point diGerent from rest Point outlier: data point, Contextual outlier: Time series, far away for me. Collecve outlier: There isn’t a
point where there should be one, missing in range. Finding outliers: 1D: Box-Whisker-Plot xD: Fit model, exponenal smoothing: point with error