Geschreven door studenten die geslaagd zijn Direct beschikbaar na je betaling Online lezen of als PDF Verkeerd document? Gratis ruilen 4,6 TrustPilot
logo-home
Samenvatting

Summary Introduction to Analytics D0H61a

Beoordeling
-
Verkocht
5
Pagina's
82
Geüpload op
15-06-2025
Geschreven in
2024/2025

Course by prof. Jochen de Weerdt. Very comprehensive summary comprising all of the theory required to make the exam. If something is not included in the document (which will be rare), I always refer to the slides. Many subjects that may be confusing at first when going through the slides are described intuitively in this summary. Almost every link (url) that was used in the slides as examples or clarifications are included as well. 82 pages may look like much, but it includes a lot of tables and figures, which make the summary very easy and untuitive to go through.

Meer zien Lees minder
Instelling
Vak

Voorbeeld van de inhoud

Introduction to Analytics

2024-2025 Finn Germe



Contents

1 The Data Analytics Process 4
1.1 What is it all about? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Supervised Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.2 Unsupervised algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 The Data Analytics Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5.1 MLOps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5.2 Involved Parties and Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.7 Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Data Preprocessing 12
2.1 Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Data Leakage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Data Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Standardization, Normalization & Categorization . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Dummy Variables and Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.3 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.4 Feature Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.5 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Exploratory Data Analysis EDA 18
3.1 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 The Essence: Human Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 Human limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.3 What makes a good visualization? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 An example EDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 Predictive Analytics – Decision Trees 23
4.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 The ID3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3.2 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3.3 Impurity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 C4.5 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5 Countering Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.5.1 Possible Fixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.5.2 Non-Linearity and Decision Boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.6 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

,5 Predictive Analytics – Model Evaluation 30
5.1 Classification Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.1.1 Threshold Dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 Cost-Sensitive Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2.1 Inverse Class Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.2 Classifying Using Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Threshold-Independent Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3.1 Receiver Operating Characteristic ROC Curve . . . . . . . . . . . . . . . . . . . . . . 34
5.3.2 Precision-Recall Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.4 Cross-Validation and Tuning Analytical Models . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.5 Model Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6 Predicitve Analytics – Regression 39
6.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.3.1 Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.3.2 Elastic Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.4 Regression Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.4.1 Confidence intervals for β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.5 Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.5.1 Splitting Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7 Predictive Analytics – Other 46
7.1 k-Nearest Neighbours – kNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7.1.1 Weighted Voting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.1.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.2 Support Vector Machines – SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.2.1 The Dual Problem and Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.2.2 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.3 Naïve Bayes and Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.3.1 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.4 Others: RF, XGB, Deep Learning? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

8 Descriptive Analytics I – Clustering 51
8.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
8.2 Partitional Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.2.1 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.2.2 Choosing the number of clusters K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.2.3 K-means++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
8.3 Other Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
8.4 Cluster Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.4.1 Internal Cluster Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.4.2 External Cluster Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.4.3 Domain Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59




0

Finn Germe 2/82

,9 Descriptive Analytics II – Association Rules 59
9.1 Association Rule Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
9.1.1 Support and Confidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
9.1.2 Mining Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
9.1.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
9.2 Sequential Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
9.2.1 Temporal Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
9.2.2 Sequential Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
9.2.3 Algorithms for sequential pattern mining . . . . . . . . . . . . . . . . . . . . . . . . . 63
9.2.4 Timing Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
9.3 Conclusion – ARM and Seq. Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

10 Fraud Analytics 67
10.1 Predicitve Analytics for Fraud Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
10.1.1 Synthetic Minority Oversampling TEchnique SMOTE . . . . . . . . . . . . . . . . . 69
10.1.2 Descriptive Analytics for Fraud Detection . . . . . . . . . . . . . . . . . . . . . . . . 70
10.1.3 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
10.1.4 Local Outlier Factor LOF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
10.1.5 Isolation Forests IF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
10.2 Benford’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
10.2.1 First-two digits? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
10.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

11 Business Applications 75
11.1 Credit Risk Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
11.1.1 Retail Credit Risk Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
11.1.2 Corporate Credit Risk Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
11.2 Marketing Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
11.2.1 Customer churn prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
11.2.2 Customer lifetime value modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
11.2.3 Response modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
11.2.4 Uplift modeling – prescriptive analytics . . . . . . . . . . . . . . . . . . . . . . . . . . 81
11.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82




0

Finn Germe 3/82

,1 The Data Analytics Process




Figure 1: Using data analytics to create real business value

Data contains value and knowledge. Some claim data is the new oil, professor does not agree. To
extract this knowledge you have to be able to:
∗ Store it

∗ Manage it
∗ Analyze it

1.1 What is it all about?
What is AI?

AI is a field of computer science dedicated to solving problems which otherwise require human intelli-
gence—for example, pattern recognition, learning, and generalization
Machine Learning ML is the scientific study of algorithms and statistical models that computer sys-
tems use to perform a specific task without using explicit instructions, relying on patterns and
inference instead

∗ ML is seen as a subset of AI
∗ ML algorithms build a mathematical model based on training data, to make predictions or
decisions without being explicitly programmed to perform the task




1.1

Finn Germe 4/82

, ∗ Describing data (average, sd. . . )

∗ Quantitative modeling of distributions and re-
STATISTICS lationships between variables
∗ Models often parametric and meant for human
understanding



∗ Algorithms for pattern extraction and model-
ing

ML ∗ Many of which use statistical (↑) underpinning
∗ Models often non-parametric and black box
∗ Meant to optimize predictive power


∗ At first (1960-): programming paradigms to
make machines appear intelligent
∗ Later NN and SVM based approaches. True
AI intelligence in reach, then lost
∗ Now: Deep Learning. These are ML algo-
rithms as well but often applied in difficult
non-traditional settings where simpler statis-
tical approaches failed


Ð→ We will focus on analytics from a business perspective. Extracting useful patterns and
models from data.
Using
1. Data
2. An algorithm
3. A purpose

1.2 Data
Structured vs Unstructured Data
2 main approaches for dealing with non-tabular data
1.2.0

Finn Germe 5/82

, 1. Making it tabular → featurization
2. Using models that can directly utilize data as-is (Deep learning models)




Figure 2: Datasets, attributes, instances

A tabular data set (structured data) has
∗ Instances (examples, rows, observations, customers. . . )
∗ Features (attributes, fields, variables, predictors, explanatory variables, regressors. . . ). These can
be
– Numeric (continuous)
– Categorical (discrete, factor) and either
∗ Nominal
∗ Ordinal
∗ The target variable is most of the time present as well

1.3 Algorithms
“An algorithm is a set of step-by-step instructions designed to solve a problem or accomplish a specific
task. It defines a sequence of actions to be performed in a particular order to process inputs and produce
desired outputs. Algorithms are foundational in computer science and programming, where they serve as
the logic behind software and systems.”




Figure 3: The data analysis spectrum

1.3.1

Finn Germe 6/82

, 1.3.1 Supervised Algorithms
→ for which you will need a target/label
Categorical target Numerical target

∗ Binary classification → regression

∗ Multiclass classification ∗ Absolute values

∗ Ordinal classification ∗ Delta values

∗ Multilable classification ∗ Quantiles regression
ML is about generalizable correlation → not causation per se!. ML can find correlation between
variables who have absolutely nothing to do with each other
Examples:
∗ Tanks
∗ RL agent in Udacity self-driving car rewarded for speed learns to spin in circles
∗ NASA Mars mission planning, optimizing food/water/electricity consumption for total man-days
survival, yields an optimal plan of killing 2/3 crew & keep survivor alive as long as possible

1.3.2 Unsupervised algorithms
→ Extract patterns from the data as is
Clustering construct groups over the data set
Association/sequence/rule. . . mining find rules of antecedents and consequents that describe the
data
Anomaly detection find outliers
Dimensionality reduction from many variables to fewer




Figure 4: Clustering & association rules


1.4 Purpose
Unsupervised learning techniques for descriptive analytics and supervised learning techniques for predic-
tive analytics
Yes. . . but why?

Exploratory plots, distributions, quick charts, basic correlations – very visual
1.4

Finn Germe 7/82

Geschreven voor

Instelling
Studie
Vak

Documentinformatie

Geüpload op
15 juni 2025
Aantal pagina's
82
Geschreven in
2024/2025
Type
SAMENVATTING

Onderwerpen

$11.91
Krijg toegang tot het volledige document:

Verkeerd document? Gratis ruilen Binnen 14 dagen na aankoop en voor het downloaden kun je een ander document kiezen. Je kunt het bedrag gewoon opnieuw besteden.
Geschreven door studenten die geslaagd zijn
Direct beschikbaar na je betaling
Online lezen of als PDF

Maak kennis met de verkoper

Seller avatar
De reputatie van een verkoper is gebaseerd op het aantal documenten dat iemand tegen betaling verkocht heeft en de beoordelingen die voor die items ontvangen zijn. Er zijn drie niveau’s te onderscheiden: brons, zilver en goud. Hoe beter de reputatie, hoe meer de kwaliteit van zijn of haar werk te vertrouwen is.
SmwBoy123 Katholieke Universiteit Leuven
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
79
Lid sinds
4 jaar
Aantal volgers
26
Documenten
11
Laatst verkocht
1 maand geleden

4.5

6 beoordelingen

5
3
4
3
3
0
2
0
1
0

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Bezig met je bronvermelding?

Maak nauwkeurige citaten in APA, MLA en Harvard met onze gratis bronnengenerator.

Bezig met je bronvermelding?

Veelgestelde vragen