CATEGORICAL: data that is grouped into distinct categories or labels
→ e.g. amount of women in global studies
QUANTITATIVE
ordinal = specific order (education level)
nominal = no specific order (colours)
QUALITATIVE
discrete: a number that cannot be divided or split into parts (counted)
continuous: a number that can be divided (measured)
- ratio: equal intervals between values (2 op 4)
- interval: between two specific numbers (tussen 2 en 4)
STRING: words “”
INTEGER: whole numbers
FLOAT: decimal numbers
LIST: more elements (combination or one single kind) [1,2,3] or [“apple”, 1, 1.3]
DISTRIBUTIONS
- normal distribution
- skewed distribution (left- or right-skewed)
BARPLOT: categorical data
x-axes = various categories and labels
y-axes = count
HISTOGRAM: continuous data
x-axes = one label
y-axes = frequency
DESCRIPTIVE statistics: description of a dataframe
mean: average (t-test)
mode: most common answer
median: middle in order
standard deviation: how spread the data is from the mean
range: maximum value - minimum value
interval quartile range: 75% - 25%
variance: how much individual points differ from te mean
Z-SCORE: how many standard deviations is a specific value away from the mean
→ easier to compare how distant two values are from the mean
formula: (observed value - mean) / standard deviation
, - indicates it is below the mean
+ indicates it is above the mean
HYPOTHESIS TEST:
Test an assumption about a population based on a sample
CONDITIONS:
- Normal distribution
- Random sample
T-TEST
comparison of two means
H₀: μ₁ = μ₂
H1:μ₁ ≠ μ₂
If the p-value > 0.05, you reject the null hypothesis because there is only a 5% chance this is
true, in 95% of the chance you will sample that this is wrong (everything below 0.05 is
statistically significant)
F-TEST/REGRESSION
compare the variance across two groups (whether there is a relation)
P-VALUE (probability of the value being out of coincidence)
the chance that the effects are similar or more extreme → under the assumption that
the null hypothesis is true
e.g. null hypothesis: there is no relation between sex and self-esteem
small p-value: the observed result under the null hypothesis is less than 5% →
thereby rejecting the null hypothesis
high p-value: there is a high chance that what the null hypothesis is testing is true,
or at least there is little evidence that it is not true → therefore; cannot reject the null
hypothesis
CONFIDENCE INTERVAL
Estimates a population parameter:
, formula = mean +- (critical value x standard error)
e.g. 100 student’s exam scores
mean = 80
standard error = 2
critical value for 95% confidence interval = 1.96
confidence interval = 80 + (1.96 x 2) = 83.92
80 - (1.96 x 2) = 76.08
→ 95% sure that the population that the true mean falls between [76.08; 83.92]
0.05 = 5% significance level - 95% confidence interval (if you repeat the experiment
100 times under the same conditions, 95 of the CI will contain the true value and 5
not
- can also be done for other population parameters than just the mean
- is presented in regression output
CHI SQUARE
tests whether two categorical variables are independent → it is based on counts or
frequencies, which cannot be done for categorical variables