Written by students who passed Immediately available after payment Read online or as PDF Wrong document? Swap it for free 4.6 TrustPilot
logo-home
Summary

Summary Data Engineering | University of Antwerp | 2025/26

Rating
-
Sold
4
Pages
146
Uploaded on
28-05-2026
Written in
2025/2026

Comprehensive summary for Data Engineering at Universiteit Antwerpen covering the complete course curriculum from 2025/26. Topics include data hardware fundamentals, data governance, data ingestion, relational and non-relational databases, SQL, storage architectures, algorithmic complexity, parallel and distributed computing, cloud computing, containerization, data pipelines, and graph analytics. Essential study material for understanding core data engineering concepts and ideal for exam preparation or reinforcing lectures. Contains slides and lecture notes.

Show more Read less
Institution
Course

Content preview

DATA ENGINEERING
INHOUD

Lecture 1: Introduction + Data Hardware .......................................................................................................................... 3
Intro to Data Engineering .............................................................................................................................................. 4
Some Fundamentals .................................................................................................................................................... 8
Lecture 2: Data Governance / Data Ingestion .................................................................................................................. 15
Data governance ........................................................................................................................................................ 15
Data ingestion ............................................................................................................................................................ 23
Lecture 3+4: Storage ...................................................................................................................................................... 29
Databases: why do we use them? ............................................................................................................................... 29
Relational databases.................................................................................................................................................. 30
SQL ............................................................................................................................................................................ 39
Non-relational databases (NoSQL) ............................................................................................................................. 45
Data Formats ............................................................................................................................................................. 51
Storage architectures ................................................................................................................................................. 57
Physical storage ......................................................................................................................................................... 59
Lecture 5+6: Computing................................................................................................................................................. 60
Algorithmic complexity ............................................................................................................................................... 60
Parallel computing ..................................................................................................................................................... 63
Distributed computing................................................................................................................................................ 65
Batching and streaming .............................................................................................................................................. 74
Cloud computing ....................................................................................................................................................... 76
Containerization ........................................................................................................................................................ 82
Pipelines .................................................................................................................................................................... 85
What do we need? ...................................................................................................................................................... 87
Lecture 7: Analytics - Graphs ......................................................................................................................................... 89
Graphs are everywhere............................................................................................................................................... 89
What are graphs? ....................................................................................................................................................... 90
Graph metrics ............................................................................................................................................................ 91
Graph analytics .......................................................................................................................................................... 97
How to use graphs for predictions ............................................................................................................................ 102
Tools and libraries .................................................................................................................................................... 105
Applications ............................................................................................................................................................. 108
Lecture 8: Analytics – NLP, time series, other modalities .............................................................................................. 109
Predictive modeling: recap ....................................................................................................................................... 109
NLP .......................................................................................................................................................................... 111
Behavioral analytics ................................................................................................................................................. 120
Time series analytics ................................................................................................................................................ 122



1

,Lecture 8: Visualisation................................................................................................................................................ 127
Why visualise? ......................................................................................................................................................... 127
Foundations of visualisations ................................................................................................................................... 127
Types of visualisation ............................................................................................................................................... 128
Statistical visualisations........................................................................................................................................... 133
Displaying high-dimensional data ............................................................................................................................. 133
Design best practices ............................................................................................................................................... 134
Ethics & misrepresentation ...................................................................................................................................... 135
Tools ........................................................................................................................................................................ 138
Guest Lecture: Modern Data Platforms ........................................................................................................................ 140
1. What is a data platform? ...................................................................................................................................... 140
2. Ingestion .............................................................................................................................................................. 141
3. Storage................................................................................................................................................................. 141
4. Processing & Insights ............................................................................................................................................ 142
5. Data Governance in practice................................................................................................................................. 144
6. AI architectures .................................................................................................................................................... 146




2

,LECTURE 1: INTRODUCTION + DATA HARDWARE

INTRO


ABOUT ME
• Prof. dr. Sofie Goethals,
• University of Antwerp, Faculty of Business & Economics, Department of Engineering Management
• Education
o Master in Business Engineering (KULeuven)
o Master in Artificial Intelligence (KULeuven)
• 2020 – 2024: PhD in Responsible AI (University of Antwerp)
• 2024 – 2025: Postdoctoral researcher (Columbia Business School)
• 2025 - …: Assistant Professor at University of Antwerp (Deep Learning, Data Engineering, Python for Machine
learning)


ABOUT THIS COURSE
• Goal
o General principles
▪ lot of the tools change rapidly
o Tools
▪ Python
▪ SQL
▪ Linux
▪ PySpark
▪ A lot of other tools
• What you should know before
o Machine Learning principles
▪ ML models for tabular data
▪ Preprocessing, training and test set, …
o Python
➔ Python for Machine Learning or Machine Learning For Business
• Structure
o Fundamentals of data engineering
▪ Hardware principles
▪ Pipelines
▪ Storage
▪ Computing
o Big Data Analytics
▪ Graph
▪ NLP
▪ Time series
▪ Visualization
• Inspiration
o Data Engineering course until 2018-2023 (Len Feremans)
o Books:
▪ Designing Data-Intensive Applications
▪ Fundamentals of Data Engineering
▪ The Data Engineering Cookbook
• Lab sessions
o Thursday 10h30-12h30 → Check schedule! (Not every week)
o Python and other tools
▪ Jupyter notebooks


3

, • Lesson plan
WEEK DATE LECTURE (Tuesday) DATE LAB SESSION (Thursday)
10/2 Intro/ Data Hardware 12/2 /
1
2 17/2 Data Governance/ Data Ingestion 19/2 Github + Data Ingestion
24/2 STORAGE: databases: SQL 26/2 /
3
3/3 STORAGE: NoSQL, Storage abstractions, physical data 5/3 Storage
4
formats
10/3 COMPUTE: Algorithmic complexity, Parallel computing, 12/3 /
5
Distributed Computing, MapReduce, Spark
17/3 COMPUTE: Batch and Streaming, Cloud Computing, 19/3 Compute
6
Linux
24/3 ANALYTICS: Graphs 26/3 Graph analytics
7
31/3 ANALYTICS: NLP, Time series, Other modalities 2/4 Other Analytics
8
21/4 VISUALISATION 23/4 Visualisation
9
28/4 DE IN PRACTICE: Guest Lecture (MANDATORY) 30/4 /
10
5&12/5 PROJECT PRESENTATIONS 7/5 /
11


EVALUATION (EXAM + PROJECT)
• Closed-book exam (15/20)
o Combination of multiple choice and open questions
o Multiple choice questions
▪ Examples throughout the lectures
o Open questions
▪ Build a SQL query
▪ Graph analytics calculation
▪ Explain the difference between a CPU and GPU
▪ For a specific company set-up, discuss whether you would go for cloud or not?
▪ ..
• Project (5/20)
o Coding project
o Groups of 3-4 persons
 Each lecture there will be example exam question and some of them will come back on the exam

INTRO TO DATA ENGINEERING


DATA, DATA, DATA
• 5.52B people on the internet
• What happens every minute?
9,000 members
18.8M Text 251.1M Emails 5.9M Google 3.5M Youtube 852 Airbnb stays
apply for jobs on
messages sent Searches video views booked
Linkedin

o Lots of data, which can be valuable, but you need to be able to work with it
• Why is it important?
o Top 5 companies in the S&P 500 are tech companies
▪ Microsoft
▪ Nvidia
▪ Apple
▪ Amazon
▪ Alphabet
➔ the largest companies work with data




4

, • 5 V’s of Big Data
Volume Size and amounts of big data

Velocity Speed of data generation and movement → generated in a quick way

Variety Diversity and range of different data types (unstructured, semi-structured, structured, …)

Veracity Varying levels of data quality and accuracy

Value Turning data into actionable value

• 3 types of data
Regular databases with rows and columns (like an excel sheet)
Structured data
What we have worked with so far
Unstructured data Images, videos, … → need to be extracted first
Transaction files, log files, … → more structured than an image, but not like an Excel
Semi-structured data sheet
• Power of Big Data: Gathering Google Searches → example of getting value out of data
o Stephen-Davidowitz examines Google Search data to uncover hidden patterns:
▪ Underreported racism in the United States (areas with higher rates of searches for racist jokes
show less support for Obama)
▪ Parenting concerns (different search terms for sons and daughters)
▪ Understanding mental health (‘depression symptoms’ search term peaks early in the week)
• Application: Netflix
o Netflix has a lot of information about us (what we watch, what we pass, what we search for, what we
rewatch, what time we watch, membership, …)
o By the use of this information (data) they can ‘easily’ predict what we will watch next
• You can also make a hit tv show by analyzing data, you can create a tv should that people would like based on
their streaming behavior → Use data to decide what to create ➔ TED TALK
o Goal is to produce TV shows close to 10 on the IMDB scale
▪ Amazon selects 8 candidates for TV shows and makes for each the first episode. They put them
online for free for everyone to watch and tracks everyone’s behavior (when someone press
pause or play, what parts are skipped, what parts are watched again, …)
• Data to decide which show they would make
• Doesn’t worked that well (7.5 score, the average score is 7.2)
▪ Netflix → looked at all the data they already had about Netflix viewers (ratings, viewing history,
…) to discover all the little bits and pieces about the audience → worked very well (9.1 rating)
o Data analysis does not always guarantee optimal results!
▪ Difference between successful & unsuccessful decision-making with data
• Solving a complex problem consists of 2 parts
o 1: Take that problem apart into its bits and pieces so you can analyze those
o 2: put all the bits and pieces back together to come to the conclusion
• Data & data analysis is only good for the first part
o Data & data analysis can only help you taking a problem apart and
understanding its pieces
• The brain can do step 2
• Netflix used data and brains where they belong in the process
o Data to understand pieces of their audience
o Decision to take the bits & pieces and put them back together was nowhere in
the data
• Amazon used data all the way to drive their decision making
o Safe decision because they can always point at the data, but it didn’t lead to the
desired result
▪ Things go wrong when data starts to drive decisions → data is just a tool
▪ In the end, it’s not data, it’s risk that will bring you to the higher ratings
▪ Data should not drive the decisions, reason about the data



5

Written for

Institution
Study
Course

Document information

Uploaded on
May 28, 2026
File latest updated on
June 5, 2026
Number of pages
146
Written in
2025/2026
Type
SUMMARY

Subjects

$18.87
Get access to the full document:

Wrong document? Swap it for free Within 14 days of purchase and before downloading, you can choose a different document. You can simply spend the amount again.
Written by students who passed
Immediately available after payment
Read online or as PDF

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
StudentUA8 Universiteit Antwerpen
Follow You need to be logged in order to follow users or courses
Sold
406
Member since
4 year
Number of followers
141
Documents
41
Last sold
3 days ago

4.3

42 reviews

5
25
4
8
3
6
2
2
1
1

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Working on your references?

Create accurate citations in APA, MLA and Harvard with our free citation generator.

Working on your references?

Frequently asked questions