Samenvatting

Data Analytics (2IAB0) Summary Lectures 2020

Name: Data Analytics (2IAB0) Summary Lectures 2020
SKU: doc_724509
Rating: 5.00 (1 reviews)
Author: IsabelRutten

Beoordeling

5,0

(1)

Verkocht

Pagina's

Geüpload op

28-05-2020

Geschreven in

2019/2020

EN: Data Analytics for engineers (2IAB0) is a basis course of the Bachelor College at Eindhoven University of Technology. This means that all Bachelor TUe students should have completed this course. It is given in the third quartile of the first year. Data Analytics for Engineers provides more information on how to analyze and display data using the programming language Python. It also discusses statistics and data visualizations and how to apply them to certain situations. ---- NL: Data Analytics for engineers (2IAB0) is een basisvak van de Bachelor College op de Technische Universiteit Eindhoven. Dit betekent dat alle Bachelor TUe studenten dit vak moeten hebben gevolgd. Het vak wordt gegeven in het derde kwartiel van het eerste jaar. Data Analytics for Engineers geeft meer informatie over hoe je data kan analyseren en weergeven met behulp van de programmeertaal Python. Het gaat ook dieper in op statistieken en datavisualizaties en hoe je deze moet toepassen op bepaalde situaties.

Meer zien Lees minder

Instelling

Vak

Voorbeeld van de inhoud

Data Analytics (2IAB0) Summary Lectures
Week 1: EDA
Descriptive data analytics is data collected now that may be used later for other purposes and is used to
give an insight into the past. Predictive data analytics is for looking into the future: we only do predictions
but don’t give an indication of what we should we do. Prescriptive data analytics consists of data-driven
advices how to take action to influence of change the future.
Data are raw, unorganized numbers, facts, etc. Information is structured, meaningful and useful numbers
and facts.
There are two data forms/types:
- categorical/nominal: a) dichotomous (yes/no, male/female) b) nominal: no ordering (genre)
c) ordinal: has ordering (ratings, bad – good)
- numerical: a) interval: no fixed “zero point”, only difference has meaning (temperature in F, ranking)
b) ratio: has fixed “zero point”, so ratios also do make sense (budget, running time)
A reference table stores “all” data in a table so that it can be looked up easily. A demonstration table is a
table to illustrate a point (so present just enough data). In a table, pay attention to the kind of data type,
units of measurement, whether the values make sense when comparing columns or rows and which
column/row has the largest/smallest values.
Asking what to expect is also an important way to spot errors. You can ask two questions: “What are
reasonable values?” (human age) and “Given one value, what could be others?” (time so what distance?).
Typical questions for statistical plots are about whether values are as expected, what typical sizes are,
the variation of the values, the distribution of the values and whether there are any exceptional values.
A scatter plot is good for showing actual values and structure of numerical variables but it is not suitable
for large data sets because of the many overlapping dots. The jitter option (which changes horizontal
placement) may help to avoid this.
The choice of plot depends on the data type: bar charts –> categorical data, histograms –> numerical data.
A histogram has the range of data values split in bins (intervals of values). You can choose the number of
bins or the bin size. The histogram will show the number of observations in the dataset for every bin. The
rule of thumb for choosing number of bins is √𝑛 where n is the number of observations. If the bin width is
too small, the histogram will be too wiggly. If it is too large, there are too few details.
In a cumulative histogram, the vertical axis reflects the share (or %) of the observations in a dataset with
values smaller than a value specified on the horizontal axis.
Kernel density plots make use of bandwidths. Assuming each observation indicates that this value is
possible, but values nearby could also occur (but less likely), choose a bandwidth to be taken around each
observation, generate a kernel with the chosen bandwidth for every observation in the dataset and the sum
of the kernels results in the kernel density plot. Choosing the bandwidth is important!
Summary statistics are numbers to describe level (location statistics: what are “typical” values?) and
spread (scale statistics: how much do values vary?). Typical distribution shapes are as follows:
- unimodal distribution (1 peak) – bimodal distribution (2 peaks, possibly due to 2 different groups)
- symmetric distribution (left = right) – right-skewed distribution/asymmetric (top at the left, tail at right,
if the mean is bigger than the median, there is a right-skewed distribution)
There are the following location statistics: - mean (average, ) – mode (most occurring value)
- median ( (average of two) middle value(s) ) – quartiles/percentiles (1st quartile = cut-off point for 25%,
pth percentile is a cut-off point for p% of data, for a percentile P we compute its location in a data set of n
observations: Lp = p/100 *(n + 1). Computing the pth percentile value by linear interpolation:
Let l and h be the observations at the position ⌊𝐿𝑝⌋ and ⌈𝐿𝑝⌉ in the ordered data set.
pth percentile value = l + (Lp - ⌊𝐿𝑝⌋)(h – l).)
There are the following scale statistics: - range (max – min) – interquartile range (IQR) (3rd quartile – 1st)

- sample variance (σ^2 = ) – sample standard deviation (σ = )
- median absolute deviation (MAD) (median of the absolute deviation from the median)

1
Data Analytics (2IAB0) Summary Q3 2020 by Isabel Rutten

, The higher these statistics, the more spread/variability in the data. Variance and standard deviation are
sensitive to outliers, IQR and MAD are not.
Standardization / z-score normalization: z-score transforms data in their original units into
universal statistical unit of standard deviation from the mean using the following formula:
The mean value of the transformed data set is 0 and the sample standard deviation is 1. Negative z-score
means that the value is below the mean, positive means above the mean. The observations with a z-score
larger than 2.5 are considered as outliers.
A Box(-and-Whisker) plot is a convenient way to graphically display summary statistics since it shows the
median, the 1st and 3rd quartile and the minimum and maximum values. It is better than histograms/kernel
density estimators to compare groups but the others are better for showing distribution shape.
QUIZ: The variance of 5 numbers is 10. If each number is divided by 2, then the variance of the new
numbers is 2.5. Since division by 2, when calculating the squares of differences: 2^2 so 10/2^2 = 2.5.

Week 2: VIS
We make data visualisations, not infographics (focused on telling a story creatively instead of data).
Visualization has always been important in history, for example when recording a pulse signal or when
trying to find the reason for an epidemic (visualizing the deaths which strangely occur near a pump). Also,
communicating data effectively is of importance, for example for a subway map.
Visualization is the process that transforms (abstract) data into (interactive) graphical representations for
the purpose of exploration, confirmation or communication. Communication is done to inform humans and
shows specific aspects of a larger dataset to allow the reader to better connect the presented information to
their existing knowledge. Exploration is done when questions are not well-defined and shows a large,
complex dataset which is meant for professionals. Confirmation is a combination of those two.
Why do we visualize data?
In the case of high-level actions, we analyze the data:
- visualization for consuming information: - discover – present – enjoy (meant for end-users)
- visualization for producing data: - annotate – record – derive (extends the dataset)
In the case of lower-level actions, we search in the data:
Lookup: search in a dictionary how to spell a certain word
Browse: look for a synonym for a certain word
Locate: try to find your lost keys
Explore: unexpected patterns

There are two kinds of targets: - We look at all data and then at the trends (define the “mainstream”), the
outliers (standout from the mainstream) and the features (task-dependent structures of interest).
- We look at attributes and then at one (by analyzing the distribution or the extremes) or many (by
analyzing dependency, correlation or similarity).
Human perception can be influenced which has been researched in a psychological theory called “Gestalt
theory”. Proximity: objects close to each other are perceived as a group. Similarity: objects that are
similar (color, shape, etc.) are perceived as a group. Continuity: we unconsciously draw a line through
points that are in a graph. So: position and the arrangement of visual elements is the most important
channel for visualizations.
Perception of colors begins with 3 specialized retinal cells known as cone cells. The red cone cell
shows black-white, the green one shows green-purple and the blue one shows blue – yellow. Combining
them gives the right color. However, you could be color blind when one of those cone cells are missing. It is
rare to miss the blue cone cell so doing visualizations in the colors blue-yellow is safe.
There are several ordering directions: - sequential (XS – S – M – etc.) – diverging (… -10 … 0 … 5 …)
– cyclic (days of the week)
A key attribute (also called an independent attribute) acts as an index that is used to look up value
attributes (also called a dependent attribute).
Data visualization makes use of marks (geometric primitives: points, lines, areas, complex shapes) and
channels (appearance of marks: position, color, length, size, shape).

2
Data Analytics (2IAB0) Summary Q3 2020 by Isabel Rutten

Meld schending auteursrecht

Geschreven voor

Instelling: Technische Universiteit Eindhoven (TUE)
Studie: Computer Science and Engineering
Vak: Data Analytics 2IAB0 (2IAB0)

Alle documenten voor dit vak (1)

Documentinformatie

Geüpload op: 28 mei 2020
Bestand laatst geupdate op: 2 april 2021
Aantal pagina's: 14
Geschreven in: 2019/2020
Type: SAMENVATTING

Onderwerpen

data analytics
technical university
eindhoven
2iab0
bachelor college
data
python
statistics
statistieken
data visualizations
data visualizaties

€4,49

Krijg toegang tot het volledige document:

Geschreven door studenten die geslaagd zijn

Direct beschikbaar na je betaling

Online lezen of als PDF

Maak kennis met de verkoper

IsabelRutten

4,4

(12)

Ook beschikbaar in voordeelbundel

Beoordelingen van geverifieerde kopers

Alle reviews worden weergegeven

matthewmihu Computer Science And Engineering · 1 beoordeling

5 jaar geleden

It was what I was looking for

5,0

1 beoordelingen

Betrouwbare reviews op Stuvia

Alle beoordelingen zijn geschreven door echte Stuvia-gebruikers na geverifieerde aankopen.

Maak kennis met de verkoper

IsabelRutten Technische Universiteit Eindhoven

Bekijk profiel

Volgen

Verkocht

Lid sinds

6 jaar

Aantal volgers

Documenten

Laatst verkocht

7 maanden geleden

Summaries for Computer Science, Industrial Engineering, and ICT in Business

If you have any questions about the summaries or other study-related topics, you can always send me a message on this platform. For a cheaper price, you can also message me privately: I only receive 40% of the price you pay on this platform. I hope that these summaries help you advance your studies!

4,4

12 beoordelingen

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper IsabelRutten. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €4,49. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews) Afgelopen 30 dagen zijn er 53294 samenvattingen verkocht Opgericht in 2010, al 16 jaar dé plek om samenvattingen te kopen

Data Analytics (2IAB0) Summary Lectures 2020

Voorbeeld van de inhoud

Geschreven voor

Documentinformatie

Onderwerpen

Meer vakken binnen Technische Universiteit Eindhoven (TUE) > Computer Science and Engineering

Ook beschikbaar in voordeelbundel

Beoordelingen van geverifieerde kopers

Maak kennis met de verkoper

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Niet tevreden? Kies een ander document

Betaal zoals je wilt, start meteen met leren

Bezig met je bronvermelding?

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?