Samenvatting

Summary Introduction to Analytics D0H61a

Beoordeling

Verkocht

Pagina's

Geüpload op

15-06-2025

Geschreven in

2024/2025

Course by prof. Jochen de Weerdt. Very comprehensive summary comprising all of the theory required to make the exam. If something is not included in the document (which will be rare), I always refer to the slides. Many subjects that may be confusing at first when going through the slides are described intuitively in this summary. Almost every link (url) that was used in the slides as examples or clarifications are included as well. 82 pages may look like much, but it includes a lot of tables and figures, which make the summary very easy and untuitive to go through.

Meer zien Lees minder

Instelling

Vak

Voorbeeld van de inhoud

Introduction to Analytics

2024-2025 Finn Germe

Contents

1 The Data Analytics Process 4
1.1 What is it all about? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Supervised Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2 Unsupervised algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 The Data Analytics Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.1 MLOps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5.2 Involved Parties and Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.7 Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Data Preprocessing 12
2.1 Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Data Leakage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Data Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Standardization, Normalization & Categorization . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Dummy Variables and Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.3 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.4 Feature Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.5 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Exploratory Data Analysis EDA 18
3.1 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 The Essence: Human Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 Human limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.3 What makes a good visualization? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 An example EDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 Predictive Analytics – Decision Trees 23
4.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 The ID3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3.2 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3.3 Impurity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 C4.5 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5 Countering Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.5.1 Possible Fixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.5.2 Non-Linearity and Decision Boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.6 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

,5 Predictive Analytics – Model Evaluation 30
5.1 Classification Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.1.1 Threshold Dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 Cost-Sensitive Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2.1 Inverse Class Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.2 Classifying Using Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Threshold-Independent Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3.1 Receiver Operating Characteristic ROC Curve . . . . . . . . . . . . . . . . . . . . . . 34
5.3.2 Precision-Recall Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.4 Cross-Validation and Tuning Analytical Models . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.5 Model Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6 Predicitve Analytics – Regression 39
6.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.3.1 Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.3.2 Elastic Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.4 Regression Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.4.1 Confidence intervals for β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.5 Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.5.1 Splitting Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7 Predictive Analytics – Other 46
7.1 k-Nearest Neighbours – kNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7.1.1 Weighted Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.1.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.2 Support Vector Machines – SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.2.1 The Dual Problem and Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.2.2 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.3 Naïve Bayes and Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.3.1 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.4 Others: RF, XGB, Deep Learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

8 Descriptive Analytics I – Clustering 51
8.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
8.2 Partitional Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.2.1 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.2.2 Choosing the number of clusters K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.2.3 K-means++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
8.3 Other Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
8.4 Cluster Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.4.1 Internal Cluster Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.4.2 External Cluster Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.4.3 Domain Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

0

Finn Germe 2/82

,9 Descriptive Analytics II – Association Rules 59
9.1 Association Rule Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
9.1.1 Support and Confidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
9.1.2 Mining Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
9.1.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
9.2 Sequential Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
9.2.1 Temporal Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
9.2.2 Sequential Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
9.2.3 Algorithms for sequential pattern mining . . . . . . . . . . . . . . . . . . . . . . . . . 63
9.2.4 Timing Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
9.3 Conclusion – ARM and Seq. Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

10 Fraud Analytics 67
10.1 Predicitve Analytics for Fraud Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
10.1.1 Synthetic Minority Oversampling TEchnique SMOTE . . . . . . . . . . . . . . . . . 69
10.1.2 Descriptive Analytics for Fraud Detection . . . . . . . . . . . . . . . . . . . . . . . . 70
10.1.3 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
10.1.4 Local Outlier Factor LOF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
10.1.5 Isolation Forests IF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
10.2 Benford’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
10.2.1 First-two digits? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
10.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

11 Business Applications 75
11.1 Credit Risk Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
11.1.1 Retail Credit Risk Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
11.1.2 Corporate Credit Risk Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
11.2 Marketing Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
11.2.1 Customer churn prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
11.2.2 Customer lifetime value modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
11.2.3 Response modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
11.2.4 Uplift modeling – prescriptive analytics . . . . . . . . . . . . . . . . . . . . . . . . . . 81
11.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

0

Finn Germe 3/82

,1 The Data Analytics Process

Figure 1: Using data analytics to create real business value

Data contains value and knowledge. Some claim data is the new oil, professor does not agree. To
extract this knowledge you have to be able to:
∗ Store it

∗ Manage it
∗ Analyze it

1.1 What is it all about?
What is AI?

AI is a field of computer science dedicated to solving problems which otherwise require human intelli-
gence—for example, pattern recognition, learning, and generalization
Machine Learning ML is the scientific study of algorithms and statistical models that computer sys-
tems use to perform a specific task without using explicit instructions, relying on patterns and
inference instead

∗ ML is seen as a subset of AI
∗ ML algorithms build a mathematical model based on training data, to make predictions or
decisions without being explicitly programmed to perform the task

1.1

Finn Germe 4/82

, ∗ Describing data (average, sd. . . )

∗ Quantitative modeling of distributions and re-
STATISTICS lationships between variables
∗ Models often parametric and meant for human
understanding

∗ Algorithms for pattern extraction and model-
ing

ML ∗ Many of which use statistical (↑) underpinning
∗ Models often non-parametric and black box
∗ Meant to optimize predictive power

∗ At first (1960-): programming paradigms to
make machines appear intelligent
∗ Later NN and SVM based approaches. True
AI intelligence in reach, then lost
∗ Now: Deep Learning. These are ML algo-
rithms as well but often applied in difficult
non-traditional settings where simpler statis-
tical approaches failed

Ð→ We will focus on analytics from a business perspective. Extracting useful patterns and
models from data.
Using
1. Data
2. An algorithm
3. A purpose

1.2 Data
Structured vs Unstructured Data
2 main approaches for dealing with non-tabular data
1.2.0

Finn Germe 5/82

, 1. Making it tabular → featurization
2. Using models that can directly utilize data as-is (Deep learning models)

Figure 2: Datasets, attributes, instances

A tabular data set (structured data) has
∗ Instances (examples, rows, observations, customers. . . )
∗ Features (attributes, fields, variables, predictors, explanatory variables, regressors. . . ). These can
be
– Numeric (continuous)
– Categorical (discrete, factor) and either
∗ Nominal
∗ Ordinal
∗ The target variable is most of the time present as well

1.3 Algorithms
“An algorithm is a set of step-by-step instructions designed to solve a problem or accomplish a specific
task. It defines a sequence of actions to be performed in a particular order to process inputs and produce
desired outputs. Algorithms are foundational in computer science and programming, where they serve as
the logic behind software and systems.”

Figure 3: The data analysis spectrum

1.3.1

Finn Germe 6/82

, 1.3.1 Supervised Algorithms
→ for which you will need a target/label
Categorical target Numerical target

∗ Binary classification → regression

∗ Multiclass classification ∗ Absolute values

∗ Ordinal classification ∗ Delta values

∗ Multilable classification ∗ Quantiles regression
ML is about generalizable correlation → not causation per se!. ML can find correlation between
variables who have absolutely nothing to do with each other
Examples:
∗ Tanks
∗ RL agent in Udacity self-driving car rewarded for speed learns to spin in circles
∗ NASA Mars mission planning, optimizing food/water/electricity consumption for total man-days
survival, yields an optimal plan of killing 2/3 crew & keep survivor alive as long as possible

1.3.2 Unsupervised algorithms
→ Extract patterns from the data as is
Clustering construct groups over the data set
Association/sequence/rule. . . mining find rules of antecedents and consequents that describe the
data
Anomaly detection find outliers
Dimensionality reduction from many variables to fewer

Figure 4: Clustering & association rules

1.4 Purpose
Unsupervised learning techniques for descriptive analytics and supervised learning techniques for predic-
tive analytics
Yes. . . but why?

Exploratory plots, distributions, quick charts, basic correlations – very visual
1.4

Finn Germe 7/82

Meld schending auteursrecht

Geschreven voor

Instelling: Katholieke Universiteit Leuven (KU Leuven)
Studie: Bachelor Handelsingenieur
Vak: Introduction to Analytics (D0H61A)

Alle documenten voor dit vak (1)

Documentinformatie

Geüpload op: 15 juni 2025
Aantal pagina's: 82
Geschreven in: 2024/2025
Type: SAMENVATTING

Onderwerpen

data analytics
algorithms
data
supervised algorithms
unsupervised algorithms
big data
data preprocessing
exploratory data analysis
eda
predictive analytics
decision trees
model evaluation
regression
k

$11.91

Krijg toegang tot het volledige document:

Geschreven door studenten die geslaagd zijn

Direct beschikbaar na je betaling

Online lezen of als PDF

Maak kennis met de verkoper

SmwBoy123

4.5

(6)

Maak kennis met de verkoper

SmwBoy123 Katholieke Universiteit Leuven

Bekijk profiel

Volgen

Verkocht

Lid sinds

4 jaar

Aantal volgers

Documenten

Laatst verkocht

1 maand geleden

4.5

6 beoordelingen

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper SmwBoy123. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor $11.91. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews) Afgelopen 30 dagen zijn er 49710 samenvattingen verkocht Opgericht in 2010, al 16 jaar dé plek om samenvattingen te kopen

Summary Introduction to Analytics D0H61a

Voorbeeld van de inhoud

Geschreven voor

Documentinformatie

Onderwerpen

Maak kennis met de verkoper

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Niet tevreden? Kies een ander document

Betaal zoals je wilt, start meteen met leren

Bezig met je bronvermelding?

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?