INHOUD
Lecture 1: Introduction + Data Hardware .......................................................................................................................... 3
Intro to Data Engineering .............................................................................................................................................. 4
Some Fundamentals .................................................................................................................................................... 8
Lecture 2: Data Governance / Data Ingestion .................................................................................................................. 15
Data governance ........................................................................................................................................................ 15
Data ingestion ............................................................................................................................................................ 23
Lecture 3+4: Storage ...................................................................................................................................................... 29
Databases: why do we use them? ............................................................................................................................... 29
Relational databases.................................................................................................................................................. 30
SQL ............................................................................................................................................................................ 39
Non-relational databases (NoSQL) ............................................................................................................................. 45
Data Formats ............................................................................................................................................................. 51
Storage architectures ................................................................................................................................................. 57
Physical storage ......................................................................................................................................................... 59
Lecture 5+6: Computing................................................................................................................................................. 60
Algorithmic complexity ............................................................................................................................................... 60
Parallel computing ..................................................................................................................................................... 63
Distributed computing................................................................................................................................................ 65
Batching and streaming .............................................................................................................................................. 74
Cloud computing ....................................................................................................................................................... 76
Containerization ........................................................................................................................................................ 82
Pipelines .................................................................................................................................................................... 85
What do we need? ...................................................................................................................................................... 87
Lecture 7: Analytics - Graphs ......................................................................................................................................... 89
Graphs are everywhere............................................................................................................................................... 89
What are graphs? ....................................................................................................................................................... 90
Graph metrics ............................................................................................................................................................ 91
Graph analytics .......................................................................................................................................................... 97
How to use graphs for predictions ............................................................................................................................ 102
Tools and libraries .................................................................................................................................................... 105
Applications ............................................................................................................................................................. 108
Lecture 8: Analytics – NLP, time series, other modalities .............................................................................................. 109
Predictive modeling: recap ....................................................................................................................................... 109
NLP .......................................................................................................................................................................... 111
Behavioral analytics ................................................................................................................................................. 120
Time series analytics ................................................................................................................................................ 122
1
,Lecture 8: Visualisation................................................................................................................................................ 127
Why visualise? ......................................................................................................................................................... 127
Foundations of visualisations ................................................................................................................................... 127
Types of visualisation ............................................................................................................................................... 128
Statistical visualisations........................................................................................................................................... 133
Displaying high-dimensional data ............................................................................................................................. 133
Design best practices ............................................................................................................................................... 134
Ethics & misrepresentation ...................................................................................................................................... 135
Tools ........................................................................................................................................................................ 138
Guest Lecture: Modern Data Platforms ........................................................................................................................ 140
1. What is a data platform? ...................................................................................................................................... 140
2. Ingestion .............................................................................................................................................................. 141
3. Storage................................................................................................................................................................. 141
4. Processing & Insights ............................................................................................................................................ 142
5. Data Governance in practice................................................................................................................................. 144
6. AI architectures .................................................................................................................................................... 146
2
,LECTURE 1: INTRODUCTION + DATA HARDWARE
INTRO
ABOUT ME
• Prof. dr. Sofie Goethals,
• University of Antwerp, Faculty of Business & Economics, Department of Engineering Management
• Education
o Master in Business Engineering (KULeuven)
o Master in Artificial Intelligence (KULeuven)
• 2020 – 2024: PhD in Responsible AI (University of Antwerp)
• 2024 – 2025: Postdoctoral researcher (Columbia Business School)
• 2025 - …: Assistant Professor at University of Antwerp (Deep Learning, Data Engineering, Python for Machine
learning)
ABOUT THIS COURSE
• Goal
o General principles
▪ lot of the tools change rapidly
o Tools
▪ Python
▪ SQL
▪ Linux
▪ PySpark
▪ A lot of other tools
• What you should know before
o Machine Learning principles
▪ ML models for tabular data
▪ Preprocessing, training and test set, …
o Python
➔ Python for Machine Learning or Machine Learning For Business
• Structure
o Fundamentals of data engineering
▪ Hardware principles
▪ Pipelines
▪ Storage
▪ Computing
o Big Data Analytics
▪ Graph
▪ NLP
▪ Time series
▪ Visualization
• Inspiration
o Data Engineering course until 2018-2023 (Len Feremans)
o Books:
▪ Designing Data-Intensive Applications
▪ Fundamentals of Data Engineering
▪ The Data Engineering Cookbook
• Lab sessions
o Thursday 10h30-12h30 → Check schedule! (Not every week)
o Python and other tools
▪ Jupyter notebooks
3
, • Lesson plan
WEEK DATE LECTURE (Tuesday) DATE LAB SESSION (Thursday)
10/2 Intro/ Data Hardware 12/2 /
1
2 17/2 Data Governance/ Data Ingestion 19/2 Github + Data Ingestion
24/2 STORAGE: databases: SQL 26/2 /
3
3/3 STORAGE: NoSQL, Storage abstractions, physical data 5/3 Storage
4
formats
10/3 COMPUTE: Algorithmic complexity, Parallel computing, 12/3 /
5
Distributed Computing, MapReduce, Spark
17/3 COMPUTE: Batch and Streaming, Cloud Computing, 19/3 Compute
6
Linux
24/3 ANALYTICS: Graphs 26/3 Graph analytics
7
31/3 ANALYTICS: NLP, Time series, Other modalities 2/4 Other Analytics
8
21/4 VISUALISATION 23/4 Visualisation
9
28/4 DE IN PRACTICE: Guest Lecture (MANDATORY) 30/4 /
10
5&12/5 PROJECT PRESENTATIONS 7/5 /
11
EVALUATION (EXAM + PROJECT)
• Closed-book exam (15/20)
o Combination of multiple choice and open questions
o Multiple choice questions
▪ Examples throughout the lectures
o Open questions
▪ Build a SQL query
▪ Graph analytics calculation
▪ Explain the difference between a CPU and GPU
▪ For a specific company set-up, discuss whether you would go for cloud or not?
▪ ..
• Project (5/20)
o Coding project
o Groups of 3-4 persons
Each lecture there will be example exam question and some of them will come back on the exam
INTRO TO DATA ENGINEERING
DATA, DATA, DATA
• 5.52B people on the internet
• What happens every minute?
9,000 members
18.8M Text 251.1M Emails 5.9M Google 3.5M Youtube 852 Airbnb stays
apply for jobs on
messages sent Searches video views booked
o Lots of data, which can be valuable, but you need to be able to work with it
• Why is it important?
o Top 5 companies in the S&P 500 are tech companies
▪ Microsoft
▪ Nvidia
▪ Apple
▪ Amazon
▪ Alphabet
➔ the largest companies work with data
4
, • 5 V’s of Big Data
Volume Size and amounts of big data
Velocity Speed of data generation and movement → generated in a quick way
Variety Diversity and range of different data types (unstructured, semi-structured, structured, …)
Veracity Varying levels of data quality and accuracy
Value Turning data into actionable value
• 3 types of data
Regular databases with rows and columns (like an excel sheet)
Structured data
What we have worked with so far
Unstructured data Images, videos, … → need to be extracted first
Transaction files, log files, … → more structured than an image, but not like an Excel
Semi-structured data sheet
• Power of Big Data: Gathering Google Searches → example of getting value out of data
o Stephen-Davidowitz examines Google Search data to uncover hidden patterns:
▪ Underreported racism in the United States (areas with higher rates of searches for racist jokes
show less support for Obama)
▪ Parenting concerns (different search terms for sons and daughters)
▪ Understanding mental health (‘depression symptoms’ search term peaks early in the week)
• Application: Netflix
o Netflix has a lot of information about us (what we watch, what we pass, what we search for, what we
rewatch, what time we watch, membership, …)
o By the use of this information (data) they can ‘easily’ predict what we will watch next
• You can also make a hit tv show by analyzing data, you can create a tv should that people would like based on
their streaming behavior → Use data to decide what to create ➔ TED TALK
o Goal is to produce TV shows close to 10 on the IMDB scale
▪ Amazon selects 8 candidates for TV shows and makes for each the first episode. They put them
online for free for everyone to watch and tracks everyone’s behavior (when someone press
pause or play, what parts are skipped, what parts are watched again, …)
• Data to decide which show they would make
• Doesn’t worked that well (7.5 score, the average score is 7.2)
▪ Netflix → looked at all the data they already had about Netflix viewers (ratings, viewing history,
…) to discover all the little bits and pieces about the audience → worked very well (9.1 rating)
o Data analysis does not always guarantee optimal results!
▪ Difference between successful & unsuccessful decision-making with data
• Solving a complex problem consists of 2 parts
o 1: Take that problem apart into its bits and pieces so you can analyze those
o 2: put all the bits and pieces back together to come to the conclusion
• Data & data analysis is only good for the first part
o Data & data analysis can only help you taking a problem apart and
understanding its pieces
• The brain can do step 2
• Netflix used data and brains where they belong in the process
o Data to understand pieces of their audience
o Decision to take the bits & pieces and put them back together was nowhere in
the data
• Amazon used data all the way to drive their decision making
o Safe decision because they can always point at the data, but it didn’t lead to the
desired result
▪ Things go wrong when data starts to drive decisions → data is just a tool
▪ In the end, it’s not data, it’s risk that will bring you to the higher ratings
▪ Data should not drive the decisions, reason about the data
5