Overig

Cheatsheet test 1 - Inferential Statistics ()

Beoordeling

Verkocht

Pagina's

Geüpload op

23-04-2026

Geschreven in

2024/2025

Cheatsheet test 1 - Inferential Statistics (), pre-master Psychology, universiteit Twente

Instelling

Vak

Voorbeeld van de inhoud

Mann whitney wilcoxon/Wilcoxon rank sum test: 2 groups, not parametric = not normal, unequal variances, 2 indp. Hypotheses - 1sample t-test of differences: Notations
H0:median(male)=median(female)
Kruskal wallis: 3+ groups, not parametric, not normal, unequal variances.(+ groepen, niet var) H 0: 2- μ μ
1 = 0 (no change); HA: 2- 1 Population μ μ≠
proportion: π
H0:median(group1) = median(group2) = median(group3) 0 (change) Addition Sample proportion: p
Wilcoxon signed rank test: 2 datasets of 1 group, not normal, meet medians, not parametric( post pre test difference) option 2-sided hypothesis: H0 = 0 Population mean: β1 OR μ
1 H0:median(jan_2023) = median(jan_2022) option 2 H0: median (diff) = 0
Anova: 3 groups, normal, equal variances. Gericht op gemiddeldes. Bv. Onderzoekt effect van 3 lessen op school prestaties.
H0: is 100; HA: is not 100 μ HA μ0 β1≠ β
0
1-sided hypothesis: Code for scatterplot:
H0:mean(group1) = mean(group2) = mean(group3)
Independent sample t-test: mean of 2 independent groups, normal and equal. H0:mean(group1) = mean(group2) H 0: 100; HA : 100 μ≥ H0 = 0
μ Healthdata %>% β2
One sample t-test: means of 1 group compared to population mean. Normal distribution. T value (standardised effect of b) ggplot(aes(x=pred,
H0:mean(sample) = mean(population) S.E. (proportion
SD of the sampling distribution =
b− β b y=res))+geom_jitter
Of die plot(model2,1)
Welch t test: compares means of 2 independent groups, unequal variances. H0:mean(group1) = mean(group2)
s.e. s.e .
Standardised t-value Chi-square statistic SD of sampling distribution Effect size: horizontal % diff between 2 horizontal values
x−μ Standard error (mean) (calculate % first  from total in column)
(S.E.) of ρ when ρ = 0 is S  t-
If you usedistribution
SD of the sampling Load data:
s.e. distribution  n-1 dataset <- read.csv("data_scot.csv", sep = ",")

Levene’s test meet variances
*T value > 2: it is very unlikely these data come from a population where β
is 0 (different groups) P > 0,05=homoskedasticity in residuals = spreiding = equal
-> we reject the null hypothesis Breusch pagan meet variances
* T value < 2: it is very likely these data come from a population where there is no association P > 0.05= homoskedacitiy = equal variances = accept H0
between the variables (simlar groups, β=0 ¿ Shapira wilk meet normality: p<0.05= reject H0 = not normal. W: 0-1, hoe dichterbij 1 hoe
The bigger the t is, the smaller the p-value will be. T is calculated using the following formula: T = normaler.
estimate/ Standard error. Pvalue <0,05 = reject H0 = violation of the equal variances = heteroskedasticity
If the estimate is big -> t will be big -> p-value small.
Pvalue >0.05=not reject H0 = equal variances, homoskedacity
The p-value associated with X1 is smaller than the p-value associated with X2. This is because the -there is no difference in health between genders
estimate (see output) of X1 is larger (in absolute terms) than X2. = incorrect -the medians between genders are not significantly different
-the medians between genders are similar
Als je kijkt naar X1 en X2, als P laag is (<0.05), dan is X1 statistisch significant geassocieerd met de
afhankelijke variabele.. >0.05 dat is het niet geassocieerd met de afhankelijke variabele. -we are 95% sure that there are NOT significant health differences exist between males and
females
What does it mean that a case is influential? And why do we need to check whether they are
present?=
Means= affect the estimates in the model Income = B2 + B3private – B4Unemp + B5Education – B6Private*Education – B7Unemp*Education
Check= because it means that the estimates are strongly affected by only one or two data points R studio – interaction/moderation
(1) Create dummies: data$x_dummy_1 <- ifelse(data$x == “public”, 1, 0) select public as ref cat.
What is the expected probability of smoking for someone who is 50 years old? (2) Use lm() to estimate model parameters: data_sample %>% lm(income ~ private + unemp +
b0 = -2.87 educ + educ*private + educ*unemp, data = .) %>% summary  t-values >2 and p-value >0.05?
b1 = -0.015 (3) Add residuals: data_clean$residuals <- model$residuals
x = 50 (4) Add predictions: data_clean$pred <- model$fitted.values
logodds = b0 + b1*x OR in this way: data <- data %>% add_predictions(model1) %>% add_residuals(model1)
p = exp(logodds)/(1+exp(logodds)) Check residuals: data_clean %>% ggplot(aes(x = pred, y = residuals)) + geom_point()
p Low vs high (1) level of education: model_low_educ <- data_clean %>% filter(education == 0)%>%
Logodds is een manier om een kans (ppp) uit te drukken in termen van een logaritme en de
verhouding tussen succes (ppp) en niet-succes (1−p1-p1−p). ligt tusen 0-1. lm(support ~ campaign, data = .)
Create interaction term: data_clean<-data_clean%>%mutate(interaction=campaign*education)
Met hand log odds= B0+b1(x)  model_withinteraction<-data_clean%>%lm(support~campaign+education+interaction)
Probability= e(logodds) : 1+e(logodds) =….  summary(model_withinteraction)

Unit 560 – Non-normality of resid & omitted variables (assumpt. of normality & equal variance)
Unit 554 – Multiple regression and non-linearity
Studying improves your grade (non-linearity) – effect decreases when you study more ε¿
Errors (
i are in the population  Residuals are in the sample ( ie¿
Grade= β 0+ ( β 1 )∗hours of studying  If a model is good the errors will be random. Deviations are problematic: (1) the mean of residuals
(zero) is affected by outliers/skew, (2) This mean is associated with b (the estimate), (3) Less
β 1=β 2−β 3∗hours of studying confident in the S.E. based on these means  execute the steps below
Grade= β 0+ ( β 2−β 3 HoursStud ) HoursStud  1.visual histogram inspection
2 2.QQ plots (range in variable): shows the relationship between what you expect to find (x-axis)
β 0+ β 2∗HoursStud−β 3∗HoursStud when it is a normal distribution and what you observe on the y-axis. Can answer if the distribution
is normal (straight line). If the data deviates from normality, then the line will display strong
Non-linearity: (1) inspect via scatterplot, (2) residuals don’t have equal variances  solution ^2 curvature.  (formality test for testing normality Shapiro Wilk test)
For example (X = 75) = 20 + 0.6 * X + 0.002 * X^2  20 + 0.6*75 + 0.002*(75*75) = 76.25 3.Shapiro Wilk test (GoOfTe): chance of finding a W in a sample, smaller than critical value. Tests
hypothesis that the distribution of the data deviates from a comparable normal distribution. If
R studio – non-linearity (p<0.05)  reject null-hypothesis  data is not normally distributed. When sample size increases,
Scatterplot: data_sample %>% ggplot(aes(x = size, y = conflicts)) + geom_point() + SW will lead to greater probability of rejecting the null-hypothesis. H0= normal distribution.
geom_smooth(method = “lm”) R studio
Estimate model: model1 %>% data_sample %>% lm(conflicts ~ size, data = .) Creat a model  create & store residuals (step 3 and 4)  histogram (x = residuals)
data_sample <- data_sample %>% add_residuals(model1) OR data_sample$resid2 <- Transforming Y: model2 <- data %>% lm(log(punish +4) ~ crime, data=.) (this isn’t easy)
model$residuals data$residuals <- model2$residuals OR same query but instead of log use sqrt
Detecting non-lin: data_sample %>% ggplot(aes(x = size, y = resid2)) + geom_point() + geom_sm.. QQ plot: data %>% ggplot(aes(sample = residuals)) + geom_qq() + geom_qq_line()
Shapiro Wilk test (test normality): shapiro.test(data$residuals)
Unit 561 – Heteroscedasticity – non-equal variances and interaction effects
Homoscedasticity (residual variance) = homogeneity (of variances) = equal variances Unit 563 – Outliers, Influential cases, and Multicollinearity
Heteroscedasticity= heterogeneity (of variances) = unequal variances  is bad because: (1) We Residual: extent to which a datapoint is away from the estimated line.
only have one S.E. for the slope. (2) S.E. is used to evaluate the ‘quality’ of the slope/find p-values. Leverage: outlier on the x (IV) , say beyond +/- 2 S.E. How much the observation’s value on the
It occurs because of (1) Measurement error in Y (which is related to x) and (2) Interaction effects. predictor variable differs from the mean of the predictor variable. Look at whether they’re
 detect: by making an X (predicted) and Y (residuals) graph. Save pred and resid and create different from the rest.
R studio Influence: the extent to which the slope of the line is affected by the data point. Determined by
Create differences t2_t1, create dummies, estimate lm with read_diff, add residuals  residuals and leverage. When both are high then you have influence.
Levene’s (groups only): leveneTest(resid ~ as.factor(DV), data) OR instead of resid use read_diff Cooks distance in R studio: unit 563
levenetest <- leveneTest(read_diff ~ treatment, data)
1st : Create a model: for example: model <- lm(y~x, data = data563)
Breusch (for lm): bptest <- bptest(model1)  bptest
Unit 470 2nd : Plot the graph using one of the following code:
True positive (TP): 108 (correct voorspelde 1’s) plot(model) #Then hit enter 4 times until you see the first plot below left.
True negatives (TN): 84 (correct voorspelde 0’s) plot(model, 4)
False positives (FP): 60 (0’s fout voorspeld als 1’s) 3rd : Optional: Add all the cook distances in the dataset so you can find the influential case directly
False negatives (FN) : 69 (1’s fout voorspeld als 0’s)
Prevalence: totaal aantal 1: total observation = 168:321 on the table: data563$cd = cooks.distance(model). This can help to identify other information
Accurancy: correct voorspelde obervatie: total obervatie = 108+84:321 about this case.
Specifity: true negatives : true negatives + false positives = 84: 84+60 A case is influential if:
Sensitivity: true positives : true positives + false negatives = 108:108+69 -high leverage: far in x axis
Precision: true positives : true positives + false positives= 108: 108+60
-high residual: far in y axis
In R:
In R -high impact: removing/including it owuld change the slope&estimate
TP <- 108
prevalence <- (TP + FN) / Total If case has high leverage,residual and impact: cooks distance will be >0.5 and even 1.
TN <- 84
FP <- 60 accuracy <- (TP + TN) / Total 2 gebogen lijnen = kwadratisch verband (niet linear)
FN <- 69 specificity <- TN / (TN + FP) Bij de dummy veraiabele geen kwadraat!
Total <- 321 sensitivity <- TP / (TP + FN) Altijd beginnen met b0 (intercept)
precision <- TP / (TP + FP)
list(prevalence = prevalence, accuracy = accuracy, specificity = Interaction=verschillen tussen groepen, effecten
specificity, sensitivity = sensitivity, precision = precision) Addition: elk effect word los van elkaar bekeken
Bijv. lm(health ~ bmi + smoking)

Meld schending auteursrecht

Geschreven voor

Instelling: Universiteit Twente (UT)
Studie: Pre master Psychologie
Vak: Inferential Statistics (202001403)

Alle documenten voor dit vak (8)

Documentinformatie

Geüpload op: 23 april 2026
Aantal pagina's: 3
Geschreven in: 2024/2025
Type: OVERIG
Persoon: Onbekend

Onderwerpen

€5,80

Krijg toegang tot het volledige document:

Geschreven door studenten die geslaagd zijn

Direct beschikbaar na je betaling

Online lezen of als PDF

Maak kennis met de verkoper

pienvanderv

Maak kennis met de verkoper

pienvanderv Intercultural Open University

Bekijk profiel

Volgen

Verkocht

Lid sinds

8 jaar

Aantal volgers

Documenten

Laatst verkocht

1 dag geleden

0,0

0 beoordelingen

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper pienvanderv. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €5,80. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews) Afgelopen 30 dagen zijn er 47507 samenvattingen verkocht Opgericht in 2010, al 16 jaar dé plek om samenvattingen te kopen

Cheatsheet test 1 - Inferential Statistics ()

Voorbeeld van de inhoud

Geschreven voor

Documentinformatie

Onderwerpen

Meer vakken binnen Universiteit Twente (UT) > Pre master Psychologie

Maak kennis met de verkoper

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Niet tevreden? Kies een ander document

Betaal zoals je wilt, start meteen met leren

Bezig met je bronvermelding?

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?