Summary

Experimental Design and Data Analysis (X_405078): Complete summary with Code implementation in R

Rating

Sold

Pages

Uploaded on

12-04-2021

Written in

2020/2021

A full overview of the course content, with implemtations in R.

Institution

Course

Content preview

Experimental Design and Data Analysis Course VU
Tim de Boer
February-March 2021

1 Lecture 1
Contents: recap of statistical concepts.

• The normal density curse is given by:
1 1 2
/σ 2
fµ,σ (x) = √ exp− 2 (x−µ)
2πσ 2
Where µ determines the position of the peak on the x-axis and σ determines the width of density curve.

• A quantile of α is number qα such that P (X ≤ qα ) = α. The upper quantile is the other side,
P (X > qα ) = α. The quantile value qα is the median value if we choose α = 0.5. If α = 0.25, qα is the
value such that the data is split in 25 below and 75% above qα . The function qnorm() aims to find the
boundary value, A in P(X < A), given the probability P. For example, suppose you want to find the
85th percentile of a normal distribution whose mean is 70 and whose standard deviation is 3. Then
you ask for:

qnorm(0.85,mean=70,sd=3)

• A QQ-plot can reveal whether data follows certain distribution P . It plots the theoretical ordered
probabilities from normal distribution on the x-axis (theoretical quantiles) versus the from sampling
obtained quantiles on y-axis. Linear line means correct distribution (sampled from the population).
Use the qqnorm plot together with a histogram to see if the data is normally distributed. Also plot an
boxplot to get an idea about differences and spread.

par(mfrow=c(1,2)); qqnorm(data); hist(data); boxplot(hours~environment)

• Central Limit Theorem: Sampling from a (not known) distribution and calculating mean for these
sample means. The distribution of all these sample means is more normally distributed. If you
keep sampling from the unknown distribution and keep calculating the mean of these samples, the
distribution of means becomes more and more normally distributed. The higher sample size, the
better normally distributed. When a sample is taken from the distribution N(µ, σ 2 ) then the sample
mean is N(µ, σ 2 /n): another way of describing the Central Limit Theorem: the sample mean varies
less than original mean.

• In a real dataset, the full population std σ is unknown. We replace σ with sample std we call s which
gives the T-distribution as a sort of Central Limit Theorem if we take the mean of samples:

X̄ − µ
T = √
s/ n

which does not have N(0,1) distribution due to uncertainty about full population. Instead, T has
t-distribution with n - 1 degrees of freedom.

1

, • A point estimate for a unknown parameter (for example the mean) is a function of only the observed
data, seen as a random variable. Denote them with µ̂.
• The confidence interval of 1 − α, e.g. 95%, is a random interval based only on the observed data that
contains the true value of the parameters with probability of 95%. If σ is unknown (which is true in
almost all cases), the t-confidence interval becomes [X̄ − t, X̄ + t] e.g. how confident are we that that
true proportion is is in between 2 std’s from the sample proportion p̂. If we want to calculate a 95%
confidence interval for a normal distributed ppopulation, we have to calculate the 97.5th percentile:
σ
CIrange = µ ± qnorm(97.5) · √
n
And for a sample of the population we use the t-distribution:
s
CIrange = µ ± qt(97.5, n − 1) · √
n
With s the standard deviation of the sample instead of σ which we use for the whole population.
In R for normal distributed population:

mu = mean(birthweights); sd = sd(birthweights); size = length(birthweights)
error = qnorm(0.975)*sd / sqrt(size) # or with qt if we have sample
lowerbound = mu - error; upperbound = mu + error

• Strong outcome: H0 rejected, H1 is true. Weak outcome: H0 not rejected. Type 1 error: rejecting H0
while it is true, type 2: not rejecting H0 while it is false.
• Power depends on amount of data: 1-Probability(type 2 error), thus power is the probability of correctly
rejecting H0 (seeing an effect which is really an effect). If we want to know the power of our test,
we repeat the test 1000 times where we initialize the distribution of our sample x and y based on
parameters, do a t-test, and then calculate how often the p value is below our threshold of 0.05.
We can calculate this fraction as the mean of the total amount of tests. For this example, the null
hypothesis we are testing is H0 : nu = mu.

b = 1000; nu = 175, mu = 180, m = n = 30; sd=5; p_values = numeric(b);
for (b in 1:B) {
x=rnorm(n,mu,sd); y = rnorm(m,nu,sd);
p[b] = t.test(x,y,var.equal=TRUE)[[3]]} #3rd value is the p-value for our H_0
power= mean(p<0.05)}

• There are three ways to reject H0 : t-value bigger than quantile, p-value lower than 0.05, mean not
in confidence interval (so the mean we want to test is not in the range of the calculated mean of the
sample plus or minus 2 std).
• Since we don’t know the distribution, we generally use the t-distribution with t0.025,n−1 (2.5%, n-1
degrees of freedom); this makes CI bigger (more conservative) since t > z, which was 1.96.
• Two sample t-test we calculate by subtracting sample means and dividing by standard error of the two
samples (e.g., adding SE1 with SE2, divide by Size1 + size2 - 2) from which we have our T. Unreliable
for Size below 20. In R it is simple: t.test(x, y, and create x an y with x = rnorm(size, mean, variance).
• For one-sample test (is the data mean equal / smaller / bigger to / than a certain mean?) we can use
t-test or sign-test. For normal data, t-test has bigger power (closer to 1), since t-test has a stronger
assumption (data must be normal) and thus better performance than sign-test for normal data, since
sign-test does not assume a normal distribution. We can do a one-sided t-test as follows, in this case
to check if mean is bigger than 2800:

2

Report Copyright Violation

Written for

Institution: Vrije Universiteit Amsterdam (VU)
Study: MSc Artificial Intelligence
Course: Experimental Design and Data Analysis (X_405078)

All documents for this subject (1)

Document information

Uploaded on: April 12, 2021
Number of pages: 15
Written in: 2020/2021
Type: SUMMARY

Subjects

statistics
experimental design
introduction to r
r

$10.58

Get access to the full document:

Written by students who passed

Immediately available after payment

Read online or as PDF

Get to know the seller

timdeboer

3.0

(1)

Get to know the seller

timdeboer Vrije Universiteit Amsterdam

View profile

Sold

Member since

5 year

Number of followers

Documents

Last sold

1 year ago

3.0

1 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller timdeboer. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $10.58. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 50650 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 16 years now

Experimental Design and Data Analysis (X_405078): Complete summary with Code implementation in R

Content preview

Written for

Document information

Subjects

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Working on your references?

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?