Mann whitney wilcoxon/Wilcoxon rank sum test: 2 groups, not parametric = not normal, unequal variances, 2 indp. Hypotheses - 1sample t-test of differences: Notations
H0:median(male)=median(female)
Kruskal wallis: 3+ groups, not parametric, not normal, unequal variances.(+ groepen, niet var) H 0: 2- μ μ
1 = 0 (no change); HA: 2- 1 Population μ μ≠
proportion: π
H0:median(group1) = median(group2) = median(group3) 0 (change) Addition Sample proportion: p
Wilcoxon signed rank test: 2 datasets of 1 group, not normal, meet medians, not parametric( post pre test difference) option 2-sided hypothesis: H0 = 0 Population mean: β1 OR μ
1 H0:median(jan_2023) = median(jan_2022) option 2 H0: median (diff) = 0
Anova: 3 groups, normal, equal variances. Gericht op gemiddeldes. Bv. Onderzoekt effect van 3 lessen op school prestaties.
H0: is 100; HA: is not 100 μ HA μ0 β1≠ β
0
1-sided hypothesis: Code for scatterplot:
H0:mean(group1) = mean(group2) = mean(group3)
Independent sample t-test: mean of 2 independent groups, normal and equal. H0:mean(group1) = mean(group2) H 0: 100; HA : 100 μ≥ H0 = 0
μ Healthdata %>% β2
One sample t-test: means of 1 group compared to population mean. Normal distribution. T value (standardised effect of b) ggplot(aes(x=pred,
H0:mean(sample) = mean(population) S.E. (proportion
SD of the sampling distribution =
b− β b y=res))+geom_jitter
Of die plot(model2,1)
Welch t test: compares means of 2 independent groups, unequal variances. H0:mean(group1) = mean(group2)
s.e. s.e .
Standardised t-value Chi-square statistic SD of sampling distribution Effect size: horizontal % diff between 2 horizontal values
x−μ Standard error (mean) (calculate % first from total in column)
(S.E.) of ρ when ρ = 0 is S t-
If you usedistribution
SD of the sampling Load data:
s.e. distribution n-1 dataset <- read.csv("data_scot.csv", sep = ",")
Levene’s test meet variances
*T value > 2: it is very unlikely these data come from a population where β
is 0 (different groups) P > 0,05=homoskedasticity in residuals = spreiding = equal
-> we reject the null hypothesis Breusch pagan meet variances
* T value < 2: it is very likely these data come from a population where there is no association P > 0.05= homoskedacitiy = equal variances = accept H0
between the variables (simlar groups, β=0 ¿ Shapira wilk meet normality: p<0.05= reject H0 = not normal. W: 0-1, hoe dichterbij 1 hoe
The bigger the t is, the smaller the p-value will be. T is calculated using the following formula: T = normaler.
estimate/ Standard error. Pvalue <0,05 = reject H0 = violation of the equal variances = heteroskedasticity
If the estimate is big -> t will be big -> p-value small.
Pvalue >0.05=not reject H0 = equal variances, homoskedacity
The p-value associated with X1 is smaller than the p-value associated with X2. This is because the -there is no difference in health between genders
estimate (see output) of X1 is larger (in absolute terms) than X2. = incorrect -the medians between genders are not significantly different
-the medians between genders are similar
Als je kijkt naar X1 en X2, als P laag is (<0.05), dan is X1 statistisch significant geassocieerd met de
afhankelijke variabele.. >0.05 dat is het niet geassocieerd met de afhankelijke variabele. -we are 95% sure that there are NOT significant health differences exist between males and
females
What does it mean that a case is influential? And why do we need to check whether they are
present?=
Means= affect the estimates in the model Income = B2 + B3private – B4Unemp + B5Education – B6Private*Education – B7Unemp*Education
Check= because it means that the estimates are strongly affected by only one or two data points R studio – interaction/moderation
(1) Create dummies: data$x_dummy_1 <- ifelse(data$x == “public”, 1, 0) select public as ref cat.
What is the expected probability of smoking for someone who is 50 years old? (2) Use lm() to estimate model parameters: data_sample %>% lm(income ~ private + unemp +
b0 = -2.87 educ + educ*private + educ*unemp, data = .) %>% summary t-values >2 and p-value >0.05?
b1 = -0.015 (3) Add residuals: data_clean$residuals <- model$residuals
x = 50 (4) Add predictions: data_clean$pred <- model$fitted.values
logodds = b0 + b1*x OR in this way: data <- data %>% add_predictions(model1) %>% add_residuals(model1)
p = exp(logodds)/(1+exp(logodds)) Check residuals: data_clean %>% ggplot(aes(x = pred, y = residuals)) + geom_point()
p Low vs high (1) level of education: model_low_educ <- data_clean %>% filter(education == 0)%>%
Logodds is een manier om een kans (ppp) uit te drukken in termen van een logaritme en de
verhouding tussen succes (ppp) en niet-succes (1−p1-p1−p). ligt tusen 0-1. lm(support ~ campaign, data = .)
Create interaction term: data_clean<-data_clean%>%mutate(interaction=campaign*education)
Met hand log odds= B0+b1(x) model_withinteraction<-data_clean%>%lm(support~campaign+education+interaction)
Probability= e(logodds) : 1+e(logodds) =…. summary(model_withinteraction)
Unit 560 – Non-normality of resid & omitted variables (assumpt. of normality & equal variance)
Unit 554 – Multiple regression and non-linearity
Studying improves your grade (non-linearity) – effect decreases when you study more ε¿
Errors (
i are in the population Residuals are in the sample ( ie¿
Grade= β 0+ ( β 1 )∗hours of studying If a model is good the errors will be random. Deviations are problematic: (1) the mean of residuals
(zero) is affected by outliers/skew, (2) This mean is associated with b (the estimate), (3) Less
β 1=β 2−β 3∗hours of studying confident in the S.E. based on these means execute the steps below
Grade= β 0+ ( β 2−β 3 HoursStud ) HoursStud 1.visual histogram inspection
2 2.QQ plots (range in variable): shows the relationship between what you expect to find (x-axis)
β 0+ β 2∗HoursStud−β 3∗HoursStud when it is a normal distribution and what you observe on the y-axis. Can answer if the distribution
is normal (straight line). If the data deviates from normality, then the line will display strong
Non-linearity: (1) inspect via scatterplot, (2) residuals don’t have equal variances solution ^2 curvature. (formality test for testing normality Shapiro Wilk test)
For example (X = 75) = 20 + 0.6 * X + 0.002 * X^2 20 + 0.6*75 + 0.002*(75*75) = 76.25 3.Shapiro Wilk test (GoOfTe): chance of finding a W in a sample, smaller than critical value. Tests
hypothesis that the distribution of the data deviates from a comparable normal distribution. If
R studio – non-linearity (p<0.05) reject null-hypothesis data is not normally distributed. When sample size increases,
Scatterplot: data_sample %>% ggplot(aes(x = size, y = conflicts)) + geom_point() + SW will lead to greater probability of rejecting the null-hypothesis. H0= normal distribution.
geom_smooth(method = “lm”) R studio
Estimate model: model1 %>% data_sample %>% lm(conflicts ~ size, data = .) Creat a model create & store residuals (step 3 and 4) histogram (x = residuals)
data_sample <- data_sample %>% add_residuals(model1) OR data_sample$resid2 <- Transforming Y: model2 <- data %>% lm(log(punish +4) ~ crime, data=.) (this isn’t easy)
model$residuals data$residuals <- model2$residuals OR same query but instead of log use sqrt
Detecting non-lin: data_sample %>% ggplot(aes(x = size, y = resid2)) + geom_point() + geom_sm.. QQ plot: data %>% ggplot(aes(sample = residuals)) + geom_qq() + geom_qq_line()
Shapiro Wilk test (test normality): shapiro.test(data$residuals)
Unit 561 – Heteroscedasticity – non-equal variances and interaction effects
Homoscedasticity (residual variance) = homogeneity (of variances) = equal variances Unit 563 – Outliers, Influential cases, and Multicollinearity
Heteroscedasticity= heterogeneity (of variances) = unequal variances is bad because: (1) We Residual: extent to which a datapoint is away from the estimated line.
only have one S.E. for the slope. (2) S.E. is used to evaluate the ‘quality’ of the slope/find p-values. Leverage: outlier on the x (IV) , say beyond +/- 2 S.E. How much the observation’s value on the
It occurs because of (1) Measurement error in Y (which is related to x) and (2) Interaction effects. predictor variable differs from the mean of the predictor variable. Look at whether they’re
detect: by making an X (predicted) and Y (residuals) graph. Save pred and resid and create different from the rest.
R studio Influence: the extent to which the slope of the line is affected by the data point. Determined by
Create differences t2_t1, create dummies, estimate lm with read_diff, add residuals residuals and leverage. When both are high then you have influence.
Levene’s (groups only): leveneTest(resid ~ as.factor(DV), data) OR instead of resid use read_diff Cooks distance in R studio: unit 563
levenetest <- leveneTest(read_diff ~ treatment, data)
1st : Create a model: for example: model <- lm(y~x, data = data563)
Breusch (for lm): bptest <- bptest(model1) bptest
Unit 470 2nd : Plot the graph using one of the following code:
True positive (TP): 108 (correct voorspelde 1’s) plot(model) #Then hit enter 4 times until you see the first plot below left.
True negatives (TN): 84 (correct voorspelde 0’s) plot(model, 4)
False positives (FP): 60 (0’s fout voorspeld als 1’s) 3rd : Optional: Add all the cook distances in the dataset so you can find the influential case directly
False negatives (FN) : 69 (1’s fout voorspeld als 0’s)
Prevalence: totaal aantal 1: total observation = 168:321 on the table: data563$cd = cooks.distance(model). This can help to identify other information
Accurancy: correct voorspelde obervatie: total obervatie = 108+84:321 about this case.
Specifity: true negatives : true negatives + false positives = 84: 84+60 A case is influential if:
Sensitivity: true positives : true positives + false negatives = 108:108+69 -high leverage: far in x axis
Precision: true positives : true positives + false positives= 108: 108+60
-high residual: far in y axis
In R:
In R -high impact: removing/including it owuld change the slope&estimate
TP <- 108
prevalence <- (TP + FN) / Total If case has high leverage,residual and impact: cooks distance will be >0.5 and even 1.
TN <- 84
FP <- 60 accuracy <- (TP + TN) / Total 2 gebogen lijnen = kwadratisch verband (niet linear)
FN <- 69 specificity <- TN / (TN + FP) Bij de dummy veraiabele geen kwadraat!
Total <- 321 sensitivity <- TP / (TP + FN) Altijd beginnen met b0 (intercept)
precision <- TP / (TP + FP)
list(prevalence = prevalence, accuracy = accuracy, specificity = Interaction=verschillen tussen groepen, effecten
specificity, sensitivity = sensitivity, precision = precision) Addition: elk effect word los van elkaar bekeken
Bijv. lm(health ~ bmi + smoking)
H0:median(male)=median(female)
Kruskal wallis: 3+ groups, not parametric, not normal, unequal variances.(+ groepen, niet var) H 0: 2- μ μ
1 = 0 (no change); HA: 2- 1 Population μ μ≠
proportion: π
H0:median(group1) = median(group2) = median(group3) 0 (change) Addition Sample proportion: p
Wilcoxon signed rank test: 2 datasets of 1 group, not normal, meet medians, not parametric( post pre test difference) option 2-sided hypothesis: H0 = 0 Population mean: β1 OR μ
1 H0:median(jan_2023) = median(jan_2022) option 2 H0: median (diff) = 0
Anova: 3 groups, normal, equal variances. Gericht op gemiddeldes. Bv. Onderzoekt effect van 3 lessen op school prestaties.
H0: is 100; HA: is not 100 μ HA μ0 β1≠ β
0
1-sided hypothesis: Code for scatterplot:
H0:mean(group1) = mean(group2) = mean(group3)
Independent sample t-test: mean of 2 independent groups, normal and equal. H0:mean(group1) = mean(group2) H 0: 100; HA : 100 μ≥ H0 = 0
μ Healthdata %>% β2
One sample t-test: means of 1 group compared to population mean. Normal distribution. T value (standardised effect of b) ggplot(aes(x=pred,
H0:mean(sample) = mean(population) S.E. (proportion
SD of the sampling distribution =
b− β b y=res))+geom_jitter
Of die plot(model2,1)
Welch t test: compares means of 2 independent groups, unequal variances. H0:mean(group1) = mean(group2)
s.e. s.e .
Standardised t-value Chi-square statistic SD of sampling distribution Effect size: horizontal % diff between 2 horizontal values
x−μ Standard error (mean) (calculate % first from total in column)
(S.E.) of ρ when ρ = 0 is S t-
If you usedistribution
SD of the sampling Load data:
s.e. distribution n-1 dataset <- read.csv("data_scot.csv", sep = ",")
Levene’s test meet variances
*T value > 2: it is very unlikely these data come from a population where β
is 0 (different groups) P > 0,05=homoskedasticity in residuals = spreiding = equal
-> we reject the null hypothesis Breusch pagan meet variances
* T value < 2: it is very likely these data come from a population where there is no association P > 0.05= homoskedacitiy = equal variances = accept H0
between the variables (simlar groups, β=0 ¿ Shapira wilk meet normality: p<0.05= reject H0 = not normal. W: 0-1, hoe dichterbij 1 hoe
The bigger the t is, the smaller the p-value will be. T is calculated using the following formula: T = normaler.
estimate/ Standard error. Pvalue <0,05 = reject H0 = violation of the equal variances = heteroskedasticity
If the estimate is big -> t will be big -> p-value small.
Pvalue >0.05=not reject H0 = equal variances, homoskedacity
The p-value associated with X1 is smaller than the p-value associated with X2. This is because the -there is no difference in health between genders
estimate (see output) of X1 is larger (in absolute terms) than X2. = incorrect -the medians between genders are not significantly different
-the medians between genders are similar
Als je kijkt naar X1 en X2, als P laag is (<0.05), dan is X1 statistisch significant geassocieerd met de
afhankelijke variabele.. >0.05 dat is het niet geassocieerd met de afhankelijke variabele. -we are 95% sure that there are NOT significant health differences exist between males and
females
What does it mean that a case is influential? And why do we need to check whether they are
present?=
Means= affect the estimates in the model Income = B2 + B3private – B4Unemp + B5Education – B6Private*Education – B7Unemp*Education
Check= because it means that the estimates are strongly affected by only one or two data points R studio – interaction/moderation
(1) Create dummies: data$x_dummy_1 <- ifelse(data$x == “public”, 1, 0) select public as ref cat.
What is the expected probability of smoking for someone who is 50 years old? (2) Use lm() to estimate model parameters: data_sample %>% lm(income ~ private + unemp +
b0 = -2.87 educ + educ*private + educ*unemp, data = .) %>% summary t-values >2 and p-value >0.05?
b1 = -0.015 (3) Add residuals: data_clean$residuals <- model$residuals
x = 50 (4) Add predictions: data_clean$pred <- model$fitted.values
logodds = b0 + b1*x OR in this way: data <- data %>% add_predictions(model1) %>% add_residuals(model1)
p = exp(logodds)/(1+exp(logodds)) Check residuals: data_clean %>% ggplot(aes(x = pred, y = residuals)) + geom_point()
p Low vs high (1) level of education: model_low_educ <- data_clean %>% filter(education == 0)%>%
Logodds is een manier om een kans (ppp) uit te drukken in termen van een logaritme en de
verhouding tussen succes (ppp) en niet-succes (1−p1-p1−p). ligt tusen 0-1. lm(support ~ campaign, data = .)
Create interaction term: data_clean<-data_clean%>%mutate(interaction=campaign*education)
Met hand log odds= B0+b1(x) model_withinteraction<-data_clean%>%lm(support~campaign+education+interaction)
Probability= e(logodds) : 1+e(logodds) =…. summary(model_withinteraction)
Unit 560 – Non-normality of resid & omitted variables (assumpt. of normality & equal variance)
Unit 554 – Multiple regression and non-linearity
Studying improves your grade (non-linearity) – effect decreases when you study more ε¿
Errors (
i are in the population Residuals are in the sample ( ie¿
Grade= β 0+ ( β 1 )∗hours of studying If a model is good the errors will be random. Deviations are problematic: (1) the mean of residuals
(zero) is affected by outliers/skew, (2) This mean is associated with b (the estimate), (3) Less
β 1=β 2−β 3∗hours of studying confident in the S.E. based on these means execute the steps below
Grade= β 0+ ( β 2−β 3 HoursStud ) HoursStud 1.visual histogram inspection
2 2.QQ plots (range in variable): shows the relationship between what you expect to find (x-axis)
β 0+ β 2∗HoursStud−β 3∗HoursStud when it is a normal distribution and what you observe on the y-axis. Can answer if the distribution
is normal (straight line). If the data deviates from normality, then the line will display strong
Non-linearity: (1) inspect via scatterplot, (2) residuals don’t have equal variances solution ^2 curvature. (formality test for testing normality Shapiro Wilk test)
For example (X = 75) = 20 + 0.6 * X + 0.002 * X^2 20 + 0.6*75 + 0.002*(75*75) = 76.25 3.Shapiro Wilk test (GoOfTe): chance of finding a W in a sample, smaller than critical value. Tests
hypothesis that the distribution of the data deviates from a comparable normal distribution. If
R studio – non-linearity (p<0.05) reject null-hypothesis data is not normally distributed. When sample size increases,
Scatterplot: data_sample %>% ggplot(aes(x = size, y = conflicts)) + geom_point() + SW will lead to greater probability of rejecting the null-hypothesis. H0= normal distribution.
geom_smooth(method = “lm”) R studio
Estimate model: model1 %>% data_sample %>% lm(conflicts ~ size, data = .) Creat a model create & store residuals (step 3 and 4) histogram (x = residuals)
data_sample <- data_sample %>% add_residuals(model1) OR data_sample$resid2 <- Transforming Y: model2 <- data %>% lm(log(punish +4) ~ crime, data=.) (this isn’t easy)
model$residuals data$residuals <- model2$residuals OR same query but instead of log use sqrt
Detecting non-lin: data_sample %>% ggplot(aes(x = size, y = resid2)) + geom_point() + geom_sm.. QQ plot: data %>% ggplot(aes(sample = residuals)) + geom_qq() + geom_qq_line()
Shapiro Wilk test (test normality): shapiro.test(data$residuals)
Unit 561 – Heteroscedasticity – non-equal variances and interaction effects
Homoscedasticity (residual variance) = homogeneity (of variances) = equal variances Unit 563 – Outliers, Influential cases, and Multicollinearity
Heteroscedasticity= heterogeneity (of variances) = unequal variances is bad because: (1) We Residual: extent to which a datapoint is away from the estimated line.
only have one S.E. for the slope. (2) S.E. is used to evaluate the ‘quality’ of the slope/find p-values. Leverage: outlier on the x (IV) , say beyond +/- 2 S.E. How much the observation’s value on the
It occurs because of (1) Measurement error in Y (which is related to x) and (2) Interaction effects. predictor variable differs from the mean of the predictor variable. Look at whether they’re
detect: by making an X (predicted) and Y (residuals) graph. Save pred and resid and create different from the rest.
R studio Influence: the extent to which the slope of the line is affected by the data point. Determined by
Create differences t2_t1, create dummies, estimate lm with read_diff, add residuals residuals and leverage. When both are high then you have influence.
Levene’s (groups only): leveneTest(resid ~ as.factor(DV), data) OR instead of resid use read_diff Cooks distance in R studio: unit 563
levenetest <- leveneTest(read_diff ~ treatment, data)
1st : Create a model: for example: model <- lm(y~x, data = data563)
Breusch (for lm): bptest <- bptest(model1) bptest
Unit 470 2nd : Plot the graph using one of the following code:
True positive (TP): 108 (correct voorspelde 1’s) plot(model) #Then hit enter 4 times until you see the first plot below left.
True negatives (TN): 84 (correct voorspelde 0’s) plot(model, 4)
False positives (FP): 60 (0’s fout voorspeld als 1’s) 3rd : Optional: Add all the cook distances in the dataset so you can find the influential case directly
False negatives (FN) : 69 (1’s fout voorspeld als 0’s)
Prevalence: totaal aantal 1: total observation = 168:321 on the table: data563$cd = cooks.distance(model). This can help to identify other information
Accurancy: correct voorspelde obervatie: total obervatie = 108+84:321 about this case.
Specifity: true negatives : true negatives + false positives = 84: 84+60 A case is influential if:
Sensitivity: true positives : true positives + false negatives = 108:108+69 -high leverage: far in x axis
Precision: true positives : true positives + false positives= 108: 108+60
-high residual: far in y axis
In R:
In R -high impact: removing/including it owuld change the slope&estimate
TP <- 108
prevalence <- (TP + FN) / Total If case has high leverage,residual and impact: cooks distance will be >0.5 and even 1.
TN <- 84
FP <- 60 accuracy <- (TP + TN) / Total 2 gebogen lijnen = kwadratisch verband (niet linear)
FN <- 69 specificity <- TN / (TN + FP) Bij de dummy veraiabele geen kwadraat!
Total <- 321 sensitivity <- TP / (TP + FN) Altijd beginnen met b0 (intercept)
precision <- TP / (TP + FP)
list(prevalence = prevalence, accuracy = accuracy, specificity = Interaction=verschillen tussen groepen, effecten
specificity, sensitivity = sensitivity, precision = precision) Addition: elk effect word los van elkaar bekeken
Bijv. lm(health ~ bmi + smoking)