Geschreven door studenten die geslaagd zijn Direct beschikbaar na je betaling Online lezen of als PDF Verkeerd document? Gratis ruilen 4,6 TrustPilot
logo-home
Samenvatting

Summary Advanced Analytics in a Big Data World - 18/20

Beoordeling
-
Verkocht
6
Pagina's
31
Geüpload op
13-03-2026
Geschreven in
2024/2025

This document contains a concise summary of the course Advanced Analytics in a Big Data World. The notes are based on lecture materials and provide an overview of the key concepts discussed throughout the course. The course is designed for master’s students in Information Management, Business and Information Systems Engineering, and Data Science at KU Leuven. The summary covers the main topics from the lectures, including preprocessing, supervised learning, model evaluation, ensemble methods, neural networks, representation learning, and applications such as text mining and social network analysis. These notes aim to help students review the essential ideas, understand the course structure, and prepare efficiently for exams or assignments.

Meer zien Lees minder
Instelling
Vak

Voorbeeld van de inhoud

ADVANCED ANALYTICS
IN A BIG DATA WORLD (18/20)
Complete Course Summary — Chapters 1 through 11

KU Leuven · Prof. Seppe vanden Broucke




Chapter 1 — Introduction to Data Science
1. What is Data Science?
Data science is the discipline of extracting value and knowledge from data using three
ingredients: data (the raw material), an algorithm (the method), and a purpose (the business or
scientific goal).
Discovered patterns and models should be valid (generalise to new data), useful (actionable),
unexpected (non-obvious), and understandable (interpretable by humans).



2. Types of Data
Rather than simply calling data 'structured' or 'unstructured', it is more precise to speak of:
tabular/relational data, text, imagery & video, and audio. Working with non-tabular data requires
either featurization (converting raw data into a table) or end-to-end deep learning (architectures
that consume raw data directly).



3. Supervised vs. Unsupervised Learning
Category Target variable? Goal Examples
Supervised Yes Learn a mapping f: X → Y Fraud detection, churn
prediction
Unsupervised No Find hidden structure in data Clustering, anomaly
detection

,Supervised subtypes: Classification (categorical target) vs. Regression (continuous target).
Discriminative models learn P(Y|X) directly; generative models learn the joint P(X,Y) and apply
Bayes' rule.



4. Analytics Spectrum
Type Question answered Typical techniques
Exploratory What does the data look Charts, distributions
like?
Descriptive What happened? Clustering, association rules
Explanatory Why did it happen? Statistical tests, feature importance
Predictive What will happen? Supervised ML
Prescriptive What should we do? What-if analysis, optimisation




5. The Analytics Process (CRISP-DM)
The standard lifecycle: Business Understanding → Data Understanding → Data Preparation →
Modelling → Evaluation → Deployment — and then back to the beginning. In practice the
process is messy and iterative; you will constantly loop back.
Purpose first!
Always ask: if I had a perfect model, how would it actually be used? A well-designed heuristic can
often deliver 50% of the gain of a full ML model, much faster.



6. Key AI Milestones
Year Milestone
2012 AlexNet wins ImageNet — the starting gun for deep learning
2016–19 AlphaGo, AlphaZero, AlphaStar master games via self-play RL
2022 ChatGPT reaches 100 million users in 2 months
2023–25 GNoME, AlphaGeometry, DeepSeek, GPT-4o — continuous breakthroughs




Chapter 2 — Preprocessing & Feature Engineering
1. The Golden Rule

, Train/test split FIRST
Fit all preprocessing parameters (means, standard deviations, bin edges, encodings) on the
training set ONLY. Apply those exact same parameters to the test set and to new production
data. Never look at the test set during training.



2. Missing Values
First understand WHY data is missing: 'Not applicable' vs. 'Unknown'. Then choose a strategy:
Strategy How it works When to use Risk
Delete Remove rows/columns Missing completely at Loss of data; bias
random; low %
Mean / Median Replace with average of Numerical features, Reduces variance;
training set roughly normal distorts correlations
Mode Replace with most Categorical features Majority-class bias
frequent value
Model-based Predict missing values Missingness depends on Computationally
from other features other features expensive
Missingness Add a binary flag 'was When missingness itself is Doubles features for that
indicator this missing?' informative column

Missing values can NEVER be ignored
Every algorithm needs a value in every cell.



3. Outliers
Type Recommended action
Invalid outlier (e.g. age = 999) Treat as missing and apply your missing value strategy
Valid outlier — algorithm is sensitive Cap/winsorise, or apply a robust transformation
(e.g. linear regression, k-NN)
Valid outlier — algorithm is robust Can often leave as-is
(tree-based models)
Finding outliers is the goal Treat as an anomaly detection task (unsupervised)




4. Numerical Transformations
Transformation What it does
Standardisation (z-score) Subtract mean, divide by σ. Required for distance-based and gradient-
based models (k-NN, SVM, neural nets, logistic regression).
Min-Max normalisation [0, Scales values to [0, 1]. Sensitive to outliers.
1]

, Transformation What it does
Log transform Compresses right-skewed distributions. Useful for income, counts,
transaction amounts.
Box-Cox / Yeo-Johnson Generalised power transform to make distributions more normal-shaped.
Yeo-Johnson also handles negative values.

Tree-based models are scale-invariant
Decision trees, Random Forest, and gradient boosting are not affected by scaling or
normalisation.



5. Encoding Categorical Variables
Encoding # New features Best for
Integer encoding 1 Ordinal features (use with caution on nominal)
One-hot encoding k−1 Nominal features with few categories
Binary encoding log₂(k) High-cardinality nominals
Weights of Evidence 1 Binary classification; logistic regression
(WoE)
Embeddings d (user-defined) Very high cardinality; deep learning pipelines
Hashing trick n (user-defined) Online learning; very large feature spaces



Weights of Evidence (WoE)
WoE replaces each category with a number reflecting its association with the positive class.
Formula: WoE = ln( P(events in category) / P(non-events in category) )
A positive WoE means the category is associated with the event (e.g. fraud). Popular in credit
scoring. Always fit on training set only.



6. Feature Selection Methods
Method Approach
Filter methods Rank features by a statistical measure (correlation, information gain, chi-
squared) independently of any model.
Wrapper methods Evaluate subsets of features using a model (forward selection, backward
elimination, RFE). More accurate but computationally expensive.
Embedded methods Feature selection is built into the model training (Lasso L1 penalty drives
coefficients to zero; tree importance scores).

Geschreven voor

Instelling
Studie
Vak

Documentinformatie

Geüpload op
13 maart 2026
Aantal pagina's
31
Geschreven in
2024/2025
Type
SAMENVATTING

Onderwerpen

$8.35
Krijg toegang tot het volledige document:

Verkeerd document? Gratis ruilen Binnen 14 dagen na aankoop en voor het downloaden kun je een ander document kiezen. Je kunt het bedrag gewoon opnieuw besteden.
Geschreven door studenten die geslaagd zijn
Direct beschikbaar na je betaling
Online lezen of als PDF

Maak kennis met de verkoper
Seller avatar
driesfroidmont

Maak kennis met de verkoper

Seller avatar
driesfroidmont
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
7
Lid sinds
3 jaar
Aantal volgers
0
Documenten
4
Laatst verkocht
2 dagen geleden

0.0

0 beoordelingen

5
0
4
0
3
0
2
0
1
0

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Bezig met je bronvermelding?

Maak nauwkeurige citaten in APA, MLA en Harvard met onze gratis bronnengenerator.

Bezig met je bronvermelding?

Veelgestelde vragen