Samenvatting

Summary Advanced Analytics in a Big Data World - 18/20

Beoordeling

Verkocht

Pagina's

Geüpload op

13-03-2026

Geschreven in

2024/2025

This document contains a concise summary of the course Advanced Analytics in a Big Data World. The notes are based on lecture materials and provide an overview of the key concepts discussed throughout the course. The course is designed for master’s students in Information Management, Business and Information Systems Engineering, and Data Science at KU Leuven. The summary covers the main topics from the lectures, including preprocessing, supervised learning, model evaluation, ensemble methods, neural networks, representation learning, and applications such as text mining and social network analysis. These notes aim to help students review the essential ideas, understand the course structure, and prepare efficiently for exams or assignments.

Meer zien Lees minder

Instelling

Vak

Voorbeeld van de inhoud

ADVANCED ANALYTICS
IN A BIG DATA WORLD (18/20)
Complete Course Summary — Chapters 1 through 11

KU Leuven · Prof. Seppe vanden Broucke

Chapter 1 — Introduction to Data Science
1. What is Data Science?
Data science is the discipline of extracting value and knowledge from data using three
ingredients: data (the raw material), an algorithm (the method), and a purpose (the business or
scientific goal).
Discovered patterns and models should be valid (generalise to new data), useful (actionable),
unexpected (non-obvious), and understandable (interpretable by humans).

2. Types of Data
Rather than simply calling data 'structured' or 'unstructured', it is more precise to speak of:
tabular/relational data, text, imagery & video, and audio. Working with non-tabular data requires
either featurization (converting raw data into a table) or end-to-end deep learning (architectures
that consume raw data directly).

3. Supervised vs. Unsupervised Learning
Category Target variable? Goal Examples
Supervised Yes Learn a mapping f: X → Y Fraud detection, churn
prediction
Unsupervised No Find hidden structure in data Clustering, anomaly
detection

,Supervised subtypes: Classification (categorical target) vs. Regression (continuous target).
Discriminative models learn P(Y|X) directly; generative models learn the joint P(X,Y) and apply
Bayes' rule.

4. Analytics Spectrum
Type Question answered Typical techniques
Exploratory What does the data look Charts, distributions
like?
Descriptive What happened? Clustering, association rules
Explanatory Why did it happen? Statistical tests, feature importance
Predictive What will happen? Supervised ML
Prescriptive What should we do? What-if analysis, optimisation

5. The Analytics Process (CRISP-DM)
The standard lifecycle: Business Understanding → Data Understanding → Data Preparation →
Modelling → Evaluation → Deployment — and then back to the beginning. In practice the
process is messy and iterative; you will constantly loop back.
Purpose first!
Always ask: if I had a perfect model, how would it actually be used? A well-designed heuristic can
often deliver 50% of the gain of a full ML model, much faster.

6. Key AI Milestones
Year Milestone
2012 AlexNet wins ImageNet — the starting gun for deep learning
2016–19 AlphaGo, AlphaZero, AlphaStar master games via self-play RL
2022 ChatGPT reaches 100 million users in 2 months
2023–25 GNoME, AlphaGeometry, DeepSeek, GPT-4o — continuous breakthroughs

Chapter 2 — Preprocessing & Feature Engineering
1. The Golden Rule

, Train/test split FIRST
Fit all preprocessing parameters (means, standard deviations, bin edges, encodings) on the
training set ONLY. Apply those exact same parameters to the test set and to new production
data. Never look at the test set during training.

2. Missing Values
First understand WHY data is missing: 'Not applicable' vs. 'Unknown'. Then choose a strategy:
Strategy How it works When to use Risk
Delete Remove rows/columns Missing completely at Loss of data; bias
random; low %
Mean / Median Replace with average of Numerical features, Reduces variance;
training set roughly normal distorts correlations
Mode Replace with most Categorical features Majority-class bias
frequent value
Model-based Predict missing values Missingness depends on Computationally
from other features other features expensive
Missingness Add a binary flag 'was When missingness itself is Doubles features for that
indicator this missing?' informative column

Missing values can NEVER be ignored
Every algorithm needs a value in every cell.

3. Outliers
Type Recommended action
Invalid outlier (e.g. age = 999) Treat as missing and apply your missing value strategy
Valid outlier — algorithm is sensitive Cap/winsorise, or apply a robust transformation
(e.g. linear regression, k-NN)
Valid outlier — algorithm is robust Can often leave as-is
(tree-based models)
Finding outliers is the goal Treat as an anomaly detection task (unsupervised)

4. Numerical Transformations
Transformation What it does
Standardisation (z-score) Subtract mean, divide by σ. Required for distance-based and gradient-
based models (k-NN, SVM, neural nets, logistic regression).
Min-Max normalisation [0, Scales values to [0, 1]. Sensitive to outliers.
1]

, Transformation What it does
Log transform Compresses right-skewed distributions. Useful for income, counts,
transaction amounts.
Box-Cox / Yeo-Johnson Generalised power transform to make distributions more normal-shaped.
Yeo-Johnson also handles negative values.

Tree-based models are scale-invariant
Decision trees, Random Forest, and gradient boosting are not affected by scaling or
normalisation.

5. Encoding Categorical Variables
Encoding # New features Best for
Integer encoding 1 Ordinal features (use with caution on nominal)
One-hot encoding k−1 Nominal features with few categories
Binary encoding log₂(k) High-cardinality nominals
Weights of Evidence 1 Binary classification; logistic regression
(WoE)
Embeddings d (user-defined) Very high cardinality; deep learning pipelines
Hashing trick n (user-defined) Online learning; very large feature spaces

Weights of Evidence (WoE)
WoE replaces each category with a number reflecting its association with the positive class.
Formula: WoE = ln( P(events in category) / P(non-events in category) )
A positive WoE means the category is associated with the event (e.g. fraud). Popular in credit
scoring. Always fit on training set only.

6. Feature Selection Methods
Method Approach
Filter methods Rank features by a statistical measure (correlation, information gain, chi-
squared) independently of any model.
Wrapper methods Evaluate subsets of features using a model (forward selection, backward
elimination, RFE). More accurate but computationally expensive.
Embedded methods Feature selection is built into the model training (Lasso L1 penalty drives
coefficients to zero; tree importance scores).

Meld schending auteursrecht

Geschreven voor

Instelling: Katholieke Universiteit Leuven (KU Leuven)
Studie: Master Handelsingenieur In De Beleidsinformatica
Vak: Advanced Analytics in a Big Data World (D0S06B)

Alle documenten voor dit vak (1)

Documentinformatie

Geüpload op: 13 maart 2026
Aantal pagina's: 31
Geschreven in: 2024/2025
Type: SAMENVATTING

Onderwerpen

neural networks
data analytics
data preprocessing
feature engineering
supervised learning
unsupervised learning
ensemble modeling
deep learning
hadoop
spark
textmining
graph learning

$8.35

Krijg toegang tot het volledige document:

Geschreven door studenten die geslaagd zijn

Direct beschikbaar na je betaling

Online lezen of als PDF

Maak kennis met de verkoper

driesfroidmont

Maak kennis met de verkoper

driesfroidmont

Bekijk profiel

Volgen

Verkocht

Lid sinds

3 jaar

Aantal volgers

Documenten

Laatst verkocht

2 dagen geleden

0.0

0 beoordelingen

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper driesfroidmont. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor $8.35. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews) Afgelopen 30 dagen zijn er 49710 samenvattingen verkocht Opgericht in 2010, al 16 jaar dé plek om samenvattingen te kopen

Summary Advanced Analytics in a Big Data World - 18/20

Voorbeeld van de inhoud

Geschreven voor

Documentinformatie

Onderwerpen

Maak kennis met de verkoper

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Niet tevreden? Kies een ander document

Betaal zoals je wilt, start meteen met leren

Bezig met je bronvermelding?

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?