IN A BIG DATA WORLD (18/20)
Complete Course Summary — Chapters 1 through 11
KU Leuven · Prof. Seppe vanden Broucke
Chapter 1 — Introduction to Data Science
1. What is Data Science?
Data science is the discipline of extracting value and knowledge from data using three
ingredients: data (the raw material), an algorithm (the method), and a purpose (the business or
scientific goal).
Discovered patterns and models should be valid (generalise to new data), useful (actionable),
unexpected (non-obvious), and understandable (interpretable by humans).
2. Types of Data
Rather than simply calling data 'structured' or 'unstructured', it is more precise to speak of:
tabular/relational data, text, imagery & video, and audio. Working with non-tabular data requires
either featurization (converting raw data into a table) or end-to-end deep learning (architectures
that consume raw data directly).
3. Supervised vs. Unsupervised Learning
Category Target variable? Goal Examples
Supervised Yes Learn a mapping f: X → Y Fraud detection, churn
prediction
Unsupervised No Find hidden structure in data Clustering, anomaly
detection
,Supervised subtypes: Classification (categorical target) vs. Regression (continuous target).
Discriminative models learn P(Y|X) directly; generative models learn the joint P(X,Y) and apply
Bayes' rule.
4. Analytics Spectrum
Type Question answered Typical techniques
Exploratory What does the data look Charts, distributions
like?
Descriptive What happened? Clustering, association rules
Explanatory Why did it happen? Statistical tests, feature importance
Predictive What will happen? Supervised ML
Prescriptive What should we do? What-if analysis, optimisation
5. The Analytics Process (CRISP-DM)
The standard lifecycle: Business Understanding → Data Understanding → Data Preparation →
Modelling → Evaluation → Deployment — and then back to the beginning. In practice the
process is messy and iterative; you will constantly loop back.
Purpose first!
Always ask: if I had a perfect model, how would it actually be used? A well-designed heuristic can
often deliver 50% of the gain of a full ML model, much faster.
6. Key AI Milestones
Year Milestone
2012 AlexNet wins ImageNet — the starting gun for deep learning
2016–19 AlphaGo, AlphaZero, AlphaStar master games via self-play RL
2022 ChatGPT reaches 100 million users in 2 months
2023–25 GNoME, AlphaGeometry, DeepSeek, GPT-4o — continuous breakthroughs
Chapter 2 — Preprocessing & Feature Engineering
1. The Golden Rule
, Train/test split FIRST
Fit all preprocessing parameters (means, standard deviations, bin edges, encodings) on the
training set ONLY. Apply those exact same parameters to the test set and to new production
data. Never look at the test set during training.
2. Missing Values
First understand WHY data is missing: 'Not applicable' vs. 'Unknown'. Then choose a strategy:
Strategy How it works When to use Risk
Delete Remove rows/columns Missing completely at Loss of data; bias
random; low %
Mean / Median Replace with average of Numerical features, Reduces variance;
training set roughly normal distorts correlations
Mode Replace with most Categorical features Majority-class bias
frequent value
Model-based Predict missing values Missingness depends on Computationally
from other features other features expensive
Missingness Add a binary flag 'was When missingness itself is Doubles features for that
indicator this missing?' informative column
Missing values can NEVER be ignored
Every algorithm needs a value in every cell.
3. Outliers
Type Recommended action
Invalid outlier (e.g. age = 999) Treat as missing and apply your missing value strategy
Valid outlier — algorithm is sensitive Cap/winsorise, or apply a robust transformation
(e.g. linear regression, k-NN)
Valid outlier — algorithm is robust Can often leave as-is
(tree-based models)
Finding outliers is the goal Treat as an anomaly detection task (unsupervised)
4. Numerical Transformations
Transformation What it does
Standardisation (z-score) Subtract mean, divide by σ. Required for distance-based and gradient-
based models (k-NN, SVM, neural nets, logistic regression).
Min-Max normalisation [0, Scales values to [0, 1]. Sensitive to outliers.
1]
, Transformation What it does
Log transform Compresses right-skewed distributions. Useful for income, counts,
transaction amounts.
Box-Cox / Yeo-Johnson Generalised power transform to make distributions more normal-shaped.
Yeo-Johnson also handles negative values.
Tree-based models are scale-invariant
Decision trees, Random Forest, and gradient boosting are not affected by scaling or
normalisation.
5. Encoding Categorical Variables
Encoding # New features Best for
Integer encoding 1 Ordinal features (use with caution on nominal)
One-hot encoding k−1 Nominal features with few categories
Binary encoding log₂(k) High-cardinality nominals
Weights of Evidence 1 Binary classification; logistic regression
(WoE)
Embeddings d (user-defined) Very high cardinality; deep learning pipelines
Hashing trick n (user-defined) Online learning; very large feature spaces
Weights of Evidence (WoE)
WoE replaces each category with a number reflecting its association with the positive class.
Formula: WoE = ln( P(events in category) / P(non-events in category) )
A positive WoE means the category is associated with the event (e.g. fraud). Popular in credit
scoring. Always fit on training set only.
6. Feature Selection Methods
Method Approach
Filter methods Rank features by a statistical measure (correlation, information gain, chi-
squared) independently of any model.
Wrapper methods Evaluate subsets of features using a model (forward selection, backward
elimination, RFE). More accurate but computationally expensive.
Embedded methods Feature selection is built into the model training (Lasso L1 penalty drives
coefficients to zero; tree importance scores).