Week 1
Variables = it consists of a number of properties à each column represents one variable.
Units = measured from a set of thing/people/etc. à each row represents one unit.
Levels of measurement
Categorical (entities are dived into diverse categories):
§ Binary variable (2 outcomes) e.g. dead or alive, yes or no etc.
§ Nominal variable (more than 2 outcomes)
§ Ordinal variable (same as nominal) but in an order e.g. bad, intermediate or good.
Numerical (entities get a diverse score):
§ Discrete data (counts à whole numbers) e.g. number of defects
§ Continuous (entities get a distinct score) e.g. temperature, body length (no whole
numbers)
Numerical data contain more information than categorical data. As a result, you need larger
samples for data with less information.
Data collection
1. Point of statistics à generalize findings in a sample to an entire population.
2. Representativeness à statistics only gives conclusions about the population you
have sampled from.
3. Questions to ask à what is the population, how to make my sample representative
for that population? And use of random sampling.
4. Validity à do the data reflect what they should reflect? And can they be used to
answer the research question?
Measurement error
Systematic measurement error
Difference between the average measurement result and the true value.
Random measurement error
Unsystematic deviations due to imprecision of the measurement system.
Median à the middle score when data is ordered.
Mean à the sum of the data divided by the amount of data.
Range à the smallest value subtracted from the largest (very sensitive to outliers).
Interquartile range (IQR) à the range of the middle 50% of the data.
Variance à the average squared distance between each point and the mean of the data.
Standard deviation à the square root of the variance.
, Confidence interval
o When we estimate something (mean, standard deviation, correlation etc.) we make
sampling error (a different sample will contain different estimates)
o Which means; means (sample statistic) is not equal to mean (population parameter)
o However: the mean (sample statistic) will be close to or around the mean (population
parameter)
Skew
§ The asymmetry of the distribution.
§ Positive skew (scores bunched at low values with the tail pointing to high values).
§ Negative skew (scores bunched at high values with the tail pointing to low values).
Mode: the most frequent score.
Bimodal: having two modes.
Multimodal: having several modes.
Plotting data
§ Just like descriptive. Statistics, a popular way to concisely display an entire dataset.
§ Follow plots are best practices: best way to display data.
§ Depends on whether you have
§ 1 or 2 variables to display
§ Categorical or numerical data
Week 2
Probability is assigned to events. For example, when you flip a coin. There are two possible
events:
- Heads
- Tails
Which both occur with a probability of ½. You denote the probability of event A occurring as
P[A] = 1/2.
Univariate categorical probability distributions
Sometimes you use ‘random ‘variable’ to refer to events that may about to happen. For
example, an outcome from dice roll.
There are two kinds of examples:
P [ X=1 or X=2 ] = 1/3 P [X=1 and X=2 ] = 0