Written by students who passed Immediately available after payment Read online or as PDF Wrong document? Swap it for free 4.6 TrustPilot
logo-home
Summary

Experimental Design and Data Analysis (X_405078): Complete summary with Code implementation in R

Rating
-
Sold
5
Pages
15
Uploaded on
12-04-2021
Written in
2020/2021

A full overview of the course content, with implemtations in R.

Institution
Course

Content preview

Experimental Design and Data Analysis Course VU
Tim de Boer
February-March 2021


1 Lecture 1
Contents: recap of statistical concepts.

• The normal density curse is given by:
1 1 2
/σ 2
fµ,σ (x) = √ exp− 2 (x−µ)
2πσ 2
Where µ determines the position of the peak on the x-axis and σ determines the width of density curve.

• A quantile of α is number qα such that P (X ≤ qα ) = α. The upper quantile is the other side,
P (X > qα ) = α. The quantile value qα is the median value if we choose α = 0.5. If α = 0.25, qα is the
value such that the data is split in 25 below and 75% above qα . The function qnorm() aims to find the
boundary value, A in P(X < A), given the probability P. For example, suppose you want to find the
85th percentile of a normal distribution whose mean is 70 and whose standard deviation is 3. Then
you ask for:

qnorm(0.85,mean=70,sd=3)


• A QQ-plot can reveal whether data follows certain distribution P . It plots the theoretical ordered
probabilities from normal distribution on the x-axis (theoretical quantiles) versus the from sampling
obtained quantiles on y-axis. Linear line means correct distribution (sampled from the population).
Use the qqnorm plot together with a histogram to see if the data is normally distributed. Also plot an
boxplot to get an idea about differences and spread.

par(mfrow=c(1,2)); qqnorm(data); hist(data); boxplot(hours~environment)


• Central Limit Theorem: Sampling from a (not known) distribution and calculating mean for these
sample means. The distribution of all these sample means is more normally distributed. If you
keep sampling from the unknown distribution and keep calculating the mean of these samples, the
distribution of means becomes more and more normally distributed. The higher sample size, the
better normally distributed. When a sample is taken from the distribution N(µ, σ 2 ) then the sample
mean is N(µ, σ 2 /n): another way of describing the Central Limit Theorem: the sample mean varies
less than original mean.

• In a real dataset, the full population std σ is unknown. We replace σ with sample std we call s which
gives the T-distribution as a sort of Central Limit Theorem if we take the mean of samples:

X̄ − µ
T = √
s/ n

which does not have N(0,1) distribution due to uncertainty about full population. Instead, T has
t-distribution with n - 1 degrees of freedom.


1

, • A point estimate for a unknown parameter (for example the mean) is a function of only the observed
data, seen as a random variable. Denote them with µ̂.
• The confidence interval of 1 − α, e.g. 95%, is a random interval based only on the observed data that
contains the true value of the parameters with probability of 95%. If σ is unknown (which is true in
almost all cases), the t-confidence interval becomes [X̄ − t, X̄ + t] e.g. how confident are we that that
true proportion is is in between 2 std’s from the sample proportion p̂. If we want to calculate a 95%
confidence interval for a normal distributed ppopulation, we have to calculate the 97.5th percentile:
σ
CIrange = µ ± qnorm(97.5) · √
n
And for a sample of the population we use the t-distribution:
s
CIrange = µ ± qt(97.5, n − 1) · √
n
With s the standard deviation of the sample instead of σ which we use for the whole population.
In R for normal distributed population:

mu = mean(birthweights); sd = sd(birthweights); size = length(birthweights)
error = qnorm(0.975)*sd / sqrt(size) # or with qt if we have sample
lowerbound = mu - error; upperbound = mu + error


• Strong outcome: H0 rejected, H1 is true. Weak outcome: H0 not rejected. Type 1 error: rejecting H0
while it is true, type 2: not rejecting H0 while it is false.
• Power depends on amount of data: 1-Probability(type 2 error), thus power is the probability of correctly
rejecting H0 (seeing an effect which is really an effect). If we want to know the power of our test,
we repeat the test 1000 times where we initialize the distribution of our sample x and y based on
parameters, do a t-test, and then calculate how often the p value is below our threshold of 0.05.
We can calculate this fraction as the mean of the total amount of tests. For this example, the null
hypothesis we are testing is H0 : nu = mu.

b = 1000; nu = 175, mu = 180, m = n = 30; sd=5; p_values = numeric(b);
for (b in 1:B) {
x=rnorm(n,mu,sd); y = rnorm(m,nu,sd);
p[b] = t.test(x,y,var.equal=TRUE)[[3]]} #3rd value is the p-value for our H_0
power= mean(p<0.05)}


• There are three ways to reject H0 : t-value bigger than quantile, p-value lower than 0.05, mean not
in confidence interval (so the mean we want to test is not in the range of the calculated mean of the
sample plus or minus 2 std).
• Since we don’t know the distribution, we generally use the t-distribution with t0.025,n−1 (2.5%, n-1
degrees of freedom); this makes CI bigger (more conservative) since t > z, which was 1.96.
• Two sample t-test we calculate by subtracting sample means and dividing by standard error of the two
samples (e.g., adding SE1 with SE2, divide by Size1 + size2 - 2) from which we have our T. Unreliable
for Size below 20. In R it is simple: t.test(x, y, and create x an y with x = rnorm(size, mean, variance).
• For one-sample test (is the data mean equal / smaller / bigger to / than a certain mean?) we can use
t-test or sign-test. For normal data, t-test has bigger power (closer to 1), since t-test has a stronger
assumption (data must be normal) and thus better performance than sign-test for normal data, since
sign-test does not assume a normal distribution. We can do a one-sided t-test as follows, in this case
to check if mean is bigger than 2800:



2

Written for

Institution
Study
Course

Document information

Uploaded on
April 12, 2021
Number of pages
15
Written in
2020/2021
Type
SUMMARY

Subjects

$10.86
Get access to the full document:

Wrong document? Swap it for free Within 14 days of purchase and before downloading, you can choose a different document. You can simply spend the amount again.
Written by students who passed
Immediately available after payment
Read online or as PDF

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
timdeboer Vrije Universiteit Amsterdam
Follow You need to be logged in order to follow users or courses
Sold
28
Member since
5 year
Number of followers
19
Documents
6
Last sold
1 year ago

3.0

1 reviews

5
0
4
0
3
1
2
0
1
0

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Working on your references?

Create accurate citations in APA, MLA and Harvard with our free citation generator.

Working on your references?

Frequently asked questions