C1 - The Data Analytics Process
1. Introduction
Use of AI skyrockets
➔ More efficient
➔ More affordable
➔ More accessible
However, 95% of GenAI pilots fail
Can you create business value with AI?
➔ Real business value = in applications
o What do people look into when looking at simple implementations?
o Marketing, risk management, government, web, logistics, …
Data analytics
- Data contains value and knowledge
- Some claim data is the new oil (but I don’t agree)
- But to extract this knowledge, you need to be able to
o Store it
o Manage it
o Analyze it → remains a big issue, data in itself is not valuable, you need to extract the
information from it in order to realize the value
- Data Mining ≈ Big Data ≈ Data Analytics ≈ Data Science ≈ Machine Learning ≈ Deep Learning ≈
Decision Science ≈ AI ?
AI = Artificial Intelligence = a field of computer science focused on building systems that perform tasks that
normally require human intelligence (for example, pattern recognition, learning, and generalization)
ML = Machine Learning = methods that learn patterns from data to make predictions or decisions, without
being explicitly programmed with rules
Data Analytics = the application of data analysis and machine learning to extract insights from data and
support decision-making
Statistics
- Explain relationships in data (does variable X influence Y?
- Emphasis on assumptions, uncertainty, interpretability
- Often smaller data, parametric models
- Primary goal: explanation and understanding
ML / AI
- Predict outcomes or support decisions
- Emphasis on performance and generalization
- Often larger data, flexible models
Statistics: explanation & inference
ML: prediction & decision-making
Data science = umbrella term for statistics + ML + AI →
1
,Business perspective of analytics:
- Given (lots of) data, extracting useful patterns and models from data
o Instead of hand-coding, let the data speak
o To help predict something, explain something, decide something (and more?)
- Using
1. Data
2. An algorithm
3. A purpose
That are
o Valid: hold on new data with some certainty (i.e. generalizable)
o Useful: should be possible to act on the item (i.e. actionable)
o Unexpected: non-obvious to the system (i.e. interesting)
o Understandable: humans should be able to interpret the pattern (i.e. explainable)
1) Data
o Structured unstructured? Tabular, relational, text, imagery, audio, …
o Two main approaches to deal with non-tabular data
▪ Making it tabular (“featurization”)
▪ Using models that can directly utilize data as-is (“deep learning”)
o A tubular data set (“structured data”)
▪ Instances (examples, rows, observations, customers, cases, …)
▪ Features (attributes, fields, variables, predictors, covariates, explanatory variables,
regressors, independent variables)
• Numeric (continuous)
• Categorical (discrete, factor), either nominal (binary as a special case) or ordinal
▪ Target (label, class, dependent variable, response variable) can also be present
• = feature that you want to predict for
• Numeric, categorical, …
2) Algorithms
o Data analysis spectrum
▪ BI = Business intelligence = what you show is upfront decided by humans you design
what you want to see yourself
▪ AI / ML / analytics → you don’t design yourself/make assumptions, the algorithm
decides
1. Descriptive analytics = finding hidden structure in data (e.g. clustering, pattern
mining, …)
2. Predictive analytics = build models that predict what will happen (ML techniques
like classification, regression, forecasting, …)
3. Prescriptive analytics = build models that predict what you should do (decision
making, recommender systems, reinforcement learning, …)
4. Cognitive analytics = self-learning systems, cognitive computing, artificial
general intelligence
2
, o 3 big types
▪ ( Reinforcement learning: learn by interacting with an environment )
▪ Supervised learning: learn from labeled data → predictive analytics
• Key idea: learn a function that maps inputs X to a known target Y
• Need labels!
• 2 problem types
o Classification → target is categorical (e.g. binary, multiclass, ordinal, …)
o Regression → target is numerical (continuous) (e.g. absolute values,
changes (deltas), quantiles)
• Generalizability to “unseen” data (= data not previously used for the training)
o ML is all about generalizable correlation (the model learns patterns) (not
causation! (no proof that a particular variable will have an influence on
another))
o E.g. identifying pictures of tanks: model focused on the clouds & weather
instead of the tank patterns themselves
• Example algorithm: decision tree learner
▪ Unsupervised learning: find structure in data → descriptive analytics
• Extract patterns from the data as is
o Clustering : construct groups over the data set
o Association/sequence/… rule mining : find rules of antecedents and
consequents that describe the data
o Anomaly detection: find outliers in the data set
o (Dimensionality reduction : from many variables to fewer)
3) Purpose
o Business question? Business problem?
o Types
▪ Exploratory: plots, distributions, quick charts, basic correlations – very visual
But who says you couldn’t build a supervised model to help here?
▪ Descriptive: unsupervised – clustering, association rules
Depends on which style of descriptions you want to get; very often you already
have some hypothesis going on
▪ Explanatory: unsupervised again?
Depending on target definition and model type used, a supervised model can be
used as an explanatory means with not much generalization power going forward
▪ Predictive: supervised for sure (right?)
Though in many cases unsupervised techniques can be used here as a
featurization or pre-processing step
▪ Prescriptive: “what should I do”
What-if analysis using a supervised model, or using good ole’ operations research
o ML isn’t the solution for every problem!
2. The data analytics process
KDD process = knowledge discovery in databases
➔ Linear process
3
, CRISP-DM = cross-industry standard process for data mining
➔ no linear process but an iteration (won't get it completely
right on the first try)
(SEMMA = Sample, Explore, Modify, Model & Assess)
(The drivetrain approach)
The real data analytics process: complicated, a lot of skipping & going back
Where does it go wrong?
- Misaligned objectives
o Data science teams often optimize model accuracy
o Business teams care about value, insight, and usability
o Accuracy is easy to measure, impact is not
➔ Collaborate with business teams
- Wrong project mindset
o Data science is often treated as an execution task ( it won’t guarantee delivery)
o In reality, it is an exploratory learning process
o Models, features, and parameters are discovered through iteration
➔ Data science teams need the freedom to learn what works as they go (not before they go!)
Managing data science:
- Data science is not a linear project
o Goals, data, and models evolve during the project
o You cannot fully specify requirements upfront
- Key management challenges
o Bridging business goals and technical metrics
o Supporting experimentation and iteration
o Moving from prototype to production reliably
➔ Managing data science requires processes and infrastructure, not just algorithms
MLOps = a set of techniques and practices used to design, build and deploy machine learning models in an
efficient, optimized, and organized manner
➔ How to serve/deliver your models?
➔ Integrated thinking across the entire chain
➔ Key focus: deployment:
business problem → data engineering
→ ML model engineering → code engineering
➔ MLOps technologies:
o open source (TensorFlow, Airflow, Kubeflow, …)
o commercial (databricks, azure ML, …)
4