Samirah Bakker
Introduction:
Data Science focuses on exploiting the modern deluge of data for prediction, exploration,
understanding, and intervention.
“ (...) the practice of data science is not just a single step of analyzing a dataset. Rather, it
cycles between data preprocessing, exploration, selection, transformation, analysis, interpretation, and
communication. One of the main priorities for data science is to develop the tools and methods that
facilitate this cycle. “
Python:
- Lists: ordered and mutable collection of objects [a,b,c]
- Can store any type of object
- Flexible yet inefficient → Therefore we have NumPy
- Tuple: ordered and immutable collection of objects (a,b,c)
- Set: unordered collection of unique values {a,b,c}
- Dictionary: collection of key : value {a:1, c:2}
NumPy:
- NumPy arrays at the core of any data science tool in Python
- Efficient interface to store and operate numerical data
- Efficient storage of numerical data
- Efficient manipulation of numerical data
- Implements efficient operations (e.g., matrix multiplication)
NumPy slicing:
- a[start:stop:step, start:stop:step, …]
- Some values can be omitted; by default: start=0, stop=end, step=1
- Values can be negative
NumPy array aggregation / reduction:
- In aggregation operations, the axis specifies which dimension to collapse!
- a.sum(axis=0) → array([4.,4.,4.])
- a.sum(axis=1) → array([3.,3.,3.,3.])
,NumPy Broadcasting:
a
Pandas:
- Pandas is built on top of NumPy, providing easy manipulation of labeled arrays (with 1 or
multiple dimensions) with heterogeneous data.
Data structures:
- Series → One dimensional array of indexed data. Here indexes can be other than sequence of
integers (indexes can be strings for example).
- DataFrame → Two dimensional array with flexible row indices and column names.
- DataFrame = dictionary of Series with different labels (keys) and common index
- Can be seen as a collection of Series, all sharing the same index.
Indexing and selection:
- NumPy ndarray: array[0] selects row 0
- Pandas DataFrame: states[‘area’] selects column area
- For dictionary-style indexing use df[‘column_name’][‘index’]
- For NumPy array-style indexing use loc, iloc df.loc[‘index’,‘column_name’] df.iloc[i,j]
- .loc -> array-style indexing, explicit indexing using labels
- .iloc -> array-style indexing, implicit indexing using positions
, - i.loc and loc → first access rows then columns!
- Dictionary style indexing → first we access columns and then rows!
Slicing and masking:
Handling missing data:
- df.notnull()
- df.isnull()
- df.dropna()
- df.dropna(axis=’columns)
- df.fillna(0)
Data science life-cycle:
- Does not consist of a single step
- Statistics and plotting are not everything, but simply a part of the cycle
- Problem driven: start by posing and understanding the question
- It is a cycle
The most frequent failure in data analysis is mistaking the type of question being considered.
- Any type of question can be interesting, but we need to define it upfront and be aware and
clear about its type
- Type of questions:
- Descriptive: what is out there? (e.g, national census; no interpretations are made)
- Exploratory: are there (apparently) trends, correlations, or relationships between the
measurements to generate ideas or hypotheses? Should we study further?
- Inferential: will an observed pattern likely hold beyond the data set we have? Any
significant correlation? Can we infer a population state from our small sample?
- Predictive: can we use features to predict an outcome?
- Causal: what happens to one measurement (statistically, on average) if we change
another?
- Mechanistic: what happens (deterministically) to one measurement if we change
another? How does a variable change another?
, Exploratory data analysis (EDA):
Exploratory data analysis: (informal definition) process of transforming, describing and visualizing a
data set to better understand it, identify problems and inform subsequent hypothesis and analysis.
EDA steps:
- Formulate initial question
- Collect raw data and understand the format
- Clean and pre-process the data
- Describe the dataset
- Make plots to visualize data distribution and relationship between some variables
- Is there any interesting trend that suggests further analysis? Do we have the right question and
data?
Principles of Data Visualization:
Rule 1: Know the audience
Rule 2: Identify your message beforehand
Rule 3: Adapt figure to medium
Rule 4: Caption is important
Rule 5: Do not trust the defaults
Rule 6: Use color effectively
- Use diverging shades if there is a meaningful middle point
- Use a sequential color scale for a more intuitive reading
Rule 7: Do not mislead the audience
- Scale and visual perception are important
Rule 8: Avoid “chartjunk” (unnecessary visual elements)
Rule 9: Choose message over beauty
Rule 10: Know and use the right tool
(t-) Stochastic neighbor embedding (t-SNE):
Data visualization of high-dimensional data: t-SNE:
Goal: visualize in a reduced number of dimensions while keeping structure of data (e.g., be able to tell
apart clusters).