1
Vrije Universiteit Amsterdam
Minor: Data Science (2022-23)
Course: Intro to Data Science
Course Code: XB_0018
1
, 2
Week 1 Lecture 1
Data Science History
Statistical Approach (1900s) but today computer simulation will give a quicker answer
- More modern approach: Data science approach
- Handles large amounts of data
- Focuses on visualising trends and variance
Dataset → make decisions based on data
- Data science does not replace experts, but augments expertise with knowledge derived
from data
- Data scientists often help other experts handle data through data hacking skills
- If you are a data scientist, you should consider building up domain expertise
Venn-diagram to describe the field of data science:
Data visualisation can help solve problems (see historical examples from medicineI)
Using Data to make decisions: MBA fees
Target group: 12 students
Make profit or make a loss?
2
, 3
Marginal cost level
Or: each student should pay what they can based on company, work-experience, current job,
country, previous study within 1000 euros
→ individual fee from each student
- Let go of the solution that we can only pick one idea that works for everybody
Data Science → look for the patterns
Check for correlations in the data
Correlation -1 --- + 1
Correlation expresses whether the values of two variables are related. If variables are
correlated, one of the values can be used to predict the other.
Correlation can be zero, positive or negative:
- Negative correlation means they move in opposite direction: e.g. house price and
distance to city centre
- Positive correlation means they move in the same direction: e.g. house-size and tax
value
- Near-zero correlation means that there is no linear relation. There can be other relations
Correlation is a small first step towards data science. You could try to manually transform your
data and construct features until you have good correlation between your inputs and output. Or
you could use actual data science and use algorithms that can handle complex relations.
We will compute correlations in our practical examples to get experience.
Week 1 Lecture 2
Data Science Basics
Programming Languages:
Maturity level of programming languages the company chooses
- Choice of programming language a company chooses depends on needs and what is
available
- Common roles: big data engineer - BI (business intelligence) developer - data analyst -
data scientist - machine learning engineer
3
, 4
- Python: most popular
- There is technology for each phase of data
Data Science Role Descriptions
- Big Data Engineer
- Bi Developer / Data Analyst
- Data Scientist
- Machine Learning Engineer
Many different programming languages have been invented, each with different technical
strengths and weaknesses:
- Performance / resource efficiency: C is most efficient, python probably least efficient
- Interactiveness: R, Matlab, SPSS are designed for interactive use, but do not work well
as a microservice to support a website
- Licensing costs: Many companies and universities prefer open source since it saves
legal issues and vendor lock in
- Cloud / on premise: Not all data can be shared with the cloud
- Integration: Some companies want to have compatible technologies, e.g. all
AWS-based, web-friendly or Microsoft based
Fortran: 1957, nearly invented before computer screens (fastest code)
Python: Most popular - best image processing/computer vision
C / C++: high performance, structured language that allows for detailed memory control and
inline assembly (hard to learn, easy to crash, not for beginners)
Java: the best programming language in the world 1994 - 2005
4
Vrije Universiteit Amsterdam
Minor: Data Science (2022-23)
Course: Intro to Data Science
Course Code: XB_0018
1
, 2
Week 1 Lecture 1
Data Science History
Statistical Approach (1900s) but today computer simulation will give a quicker answer
- More modern approach: Data science approach
- Handles large amounts of data
- Focuses on visualising trends and variance
Dataset → make decisions based on data
- Data science does not replace experts, but augments expertise with knowledge derived
from data
- Data scientists often help other experts handle data through data hacking skills
- If you are a data scientist, you should consider building up domain expertise
Venn-diagram to describe the field of data science:
Data visualisation can help solve problems (see historical examples from medicineI)
Using Data to make decisions: MBA fees
Target group: 12 students
Make profit or make a loss?
2
, 3
Marginal cost level
Or: each student should pay what they can based on company, work-experience, current job,
country, previous study within 1000 euros
→ individual fee from each student
- Let go of the solution that we can only pick one idea that works for everybody
Data Science → look for the patterns
Check for correlations in the data
Correlation -1 --- + 1
Correlation expresses whether the values of two variables are related. If variables are
correlated, one of the values can be used to predict the other.
Correlation can be zero, positive or negative:
- Negative correlation means they move in opposite direction: e.g. house price and
distance to city centre
- Positive correlation means they move in the same direction: e.g. house-size and tax
value
- Near-zero correlation means that there is no linear relation. There can be other relations
Correlation is a small first step towards data science. You could try to manually transform your
data and construct features until you have good correlation between your inputs and output. Or
you could use actual data science and use algorithms that can handle complex relations.
We will compute correlations in our practical examples to get experience.
Week 1 Lecture 2
Data Science Basics
Programming Languages:
Maturity level of programming languages the company chooses
- Choice of programming language a company chooses depends on needs and what is
available
- Common roles: big data engineer - BI (business intelligence) developer - data analyst -
data scientist - machine learning engineer
3
, 4
- Python: most popular
- There is technology for each phase of data
Data Science Role Descriptions
- Big Data Engineer
- Bi Developer / Data Analyst
- Data Scientist
- Machine Learning Engineer
Many different programming languages have been invented, each with different technical
strengths and weaknesses:
- Performance / resource efficiency: C is most efficient, python probably least efficient
- Interactiveness: R, Matlab, SPSS are designed for interactive use, but do not work well
as a microservice to support a website
- Licensing costs: Many companies and universities prefer open source since it saves
legal issues and vendor lock in
- Cloud / on premise: Not all data can be shared with the cloud
- Integration: Some companies want to have compatible technologies, e.g. all
AWS-based, web-friendly or Microsoft based
Fortran: 1957, nearly invented before computer screens (fastest code)
Python: Most popular - best image processing/computer vision
C / C++: high performance, structured language that allows for detailed memory control and
inline assembly (hard to learn, easy to crash, not for beginners)
Java: the best programming language in the world 1994 - 2005
4