fSTATISTICS AND METHODOLOGY
Recap of RBMS:
- RQ → hypotheses of a population → based on that we design a study and collect
data on a subsample of the pop → descriptive statistics to see how the sample reacts
→ inferential statistics to determine how likely it is that the things we observe in the
sample are true for the population.
- p-value: data unlikely to occur if the H0 is true, reject H0, while if they are likely to
occur then we retain H0.
- We can’t prove a negative, we prove a positive by rejecting a negative. Only
‘probably’, because there is a certain level of uncertainty, and the statistics help us
determine the uncertainty and how probable the observation is when the H0 is true. It
doesn’t talk about reality, we only find support for hypotheses, never prove things.
- Never say that the H0 is true, only that we retain it.
- RQ: PICOS, intervention, comparison, outcomes, population, study design.
- H0: no effect, no difference, no association; H1: an effect in either direction. Based
on literature.
- Two-sided: an effect in either direction; One-sided: only one direction, only when the
other direction is biologically impossible.
- Study design: causal effect or association, dependent and independent variables, the
latter also how it is manipulated, if it’s paired or not paired.
- Observational: cross sectional (all measures at the same time), case control (you go
back in time), prospective (you follow the sample over time) or experimental: RCD
(participant are assigned to one condition only), cross-over design (all conditions, but
the order is randomly assigned); the only one you can draw causal conclusions.
- Descriptive statistics: mean, mode, mode; range, interquartile range, variance, SD;
graphs and figures.
- Inferential statistics: p-value represents the probability of the data given that the null
hypothesis is true; can be unlikely, so we reject H0 and accept H, when p-value < H0.
TS = (point estimate - expected value)/SE, deviation of the data from the data under
the H0.
- p-value: we use probability distributions, empirical is based on the data in the
sample, while the theoretical is the hypothetical population distribution. If the
empirical data resemble the normal distribution, then the properties of the normal
distribution can be used to draw conclusions about the population based on the
sample using parametric statistics, and if not then we use nonparametric.
- Normal distribution: symmetrical around the midpoint; 95% of the observations are
between the mean - 1.96 SD and the mean + 1.96 SD, 2.5% are lower than the (-)
and 2.5% are higher than the (+).
- Z-scores: used to standardize normal distribution to standard normal distribution, to
compare two scores from two different normal distributions, (x-mean)/sd. Then, we
can calculate the probability of something compared to this z-score. This can be
done by looking at the area under the curve, or in the standard table.
- t-test lies between the critical t, meaning that the probability of observing the data
given that the H0 is true is likely, >5%, so retain the H0, non-significant result. If it’s
more extreme than the critical t, it means the data are very unlikely, so we reject the
H0, significant result.
,- Failing to reject the H0 doesn’t mean you can accept the H0 as true, you can only
say that there is not enough evidence to conclude that is untrue. Also you do not
know the size of the effect, if it’s either strong signal or little noise.
- Test selection: difference in means, proportions, or an association? DV and IV level
of measurements? DV normally distributed? IV levels? Paired or unpaired?
- Errors: type I errors: rejecting the H0 when we should have retained it; type Ii errors:
retaining the H0 when we should have rejected it.
- Confidence intervals: whether the hypothesized estimate falls in an interval around
the sample parameter of which we are very confident that it contains the population
parameter. CI = point estimate +- margin of error (critical statistic value * SE). We
retain the H0 when the population parameter lies in the CI. The meaning is that we
are 95% confident that the population mean lies within the CI around the sample
mean, an advantage.
- Correlation: measure of covariation, whether a change in a variable is associated
with a change in the other variable. Correlation coefficients tell us the direction of the
relationships (+ or -) and the strength of it (from -1 to 1). Pearson’s product moment
(parametric) or the Spearman’s rho (non-parametric).
- Covariation is not causality: unknown the direction, another variable or a triangular
explanation.
- Correlation coefficients are measures of linear association, and for a nonlinear
association is 0, which has no meaning.
- Correlation coefficients are sensitive to outliers, so always draw the data in a plot.
- Regression has a plus, it can tell you the direction of the correlation, with the
regression line. Still, it doesn’t imply causality, which is only determined by the
research design, only experimental!
, Regression I:
Statistical models: understanding relationship between variables, both for experimental and
observational studies. Statistics tell you whether an observed relationship between X and Y
in your sample is likely to reflect a true relationship in the unobserved population, accounting
for chance effects of random sampling and uncovering possible biases. All statistical tests
develop a model that describes the data well and makes accurate predictions about new
data points in the population.
Outcome(ind variable) = model + error (data point individually).
Means model is the simplest model: best guess that gives us the most accurate prediction, if
we don’t have any data, but if we have additional useful information, we can make a better,
more predictive model by including this: outcome = (mean +bXi) + error, where bXi is the
additional information which might be a deviation for a particular individual.
Regression leverages predictors to fit in a linear association model. Statistical tests depend
on the variables both independent and dependent, so if they’re continuous or categorical.
Linear regression is used to compare two continuous variables, and the ANOVA to
compare a continuous dependent variable between two or more categorical independent
variables, and logistic regression can be used when the dependent variable is categorical.
Statistical inference: data are just samples from a much bigger population, but might be
different from the population due to chance, so we use statistical testing to ensure that our
result and conclusion are not likely due to lucky sampling. Sampling distribution is a
distribution made with the many means found in many different samples taken from a
population and it shows a similar distribution to the population one. But as researchers we
can only do it with ONE sample, so this is why we use the SE, theoretical estimate of the
deviation that we would see in a population parameter in any individual randomly sampled
drawn from that population. SE is the standard deviation of sample of means, so the many
means found in the many samples observed, and it indicates how much we expect the
sample mean to be wrong from the value in the population mean. Central limit theorem: if
samples are large enough, the sampling distribution is normal with SD=s/sqrtN.
If we conduct a linear association the same things are true. Test statistics also tell you the
effect size and the uncertainty, expressed in units on a known distribution.
Degrees of freedom shapes the distribution, it represents the difference between, so the
ratio, the available information and the estimated information, usually equals the number of
observations minus the parameters estimated.
Regression: relationship between two quantitative variables, so when the independent
variable increases, what happens to the dependent one, what is the magnitude and the
direction of the association.
Option A describes it well but it cannot be used to make predictions, so this is why option C
is the best option. The purpose of regression is to find the line that best fits the points, and it
minimizes the squared error between the point and the line. Formula form: y = a + b*x + e,
where a is the intercept so average y when x=0, b is the slope so the effect of x on y, e is
the error or residual and it’s the difference between the true and predicted value of the
dependent variable. The purpose of drawing a regression line is to fit a statistical model to
our data and describe the linear pattern of association between x and y, and we can use the
formula to predict DV if IV is known. If we change the units, the slope coefficient changes
accordingly, so the relation stays the same, and in general linear regression is insensitive to
linear transformation. But for nonlinear transformation, such as logarithmic, the relation is
affected and the conclusion can change. After standardizing both predictor x and outcome y,
we know that b = r * sy/sx and a=0.
Recap of RBMS:
- RQ → hypotheses of a population → based on that we design a study and collect
data on a subsample of the pop → descriptive statistics to see how the sample reacts
→ inferential statistics to determine how likely it is that the things we observe in the
sample are true for the population.
- p-value: data unlikely to occur if the H0 is true, reject H0, while if they are likely to
occur then we retain H0.
- We can’t prove a negative, we prove a positive by rejecting a negative. Only
‘probably’, because there is a certain level of uncertainty, and the statistics help us
determine the uncertainty and how probable the observation is when the H0 is true. It
doesn’t talk about reality, we only find support for hypotheses, never prove things.
- Never say that the H0 is true, only that we retain it.
- RQ: PICOS, intervention, comparison, outcomes, population, study design.
- H0: no effect, no difference, no association; H1: an effect in either direction. Based
on literature.
- Two-sided: an effect in either direction; One-sided: only one direction, only when the
other direction is biologically impossible.
- Study design: causal effect or association, dependent and independent variables, the
latter also how it is manipulated, if it’s paired or not paired.
- Observational: cross sectional (all measures at the same time), case control (you go
back in time), prospective (you follow the sample over time) or experimental: RCD
(participant are assigned to one condition only), cross-over design (all conditions, but
the order is randomly assigned); the only one you can draw causal conclusions.
- Descriptive statistics: mean, mode, mode; range, interquartile range, variance, SD;
graphs and figures.
- Inferential statistics: p-value represents the probability of the data given that the null
hypothesis is true; can be unlikely, so we reject H0 and accept H, when p-value < H0.
TS = (point estimate - expected value)/SE, deviation of the data from the data under
the H0.
- p-value: we use probability distributions, empirical is based on the data in the
sample, while the theoretical is the hypothetical population distribution. If the
empirical data resemble the normal distribution, then the properties of the normal
distribution can be used to draw conclusions about the population based on the
sample using parametric statistics, and if not then we use nonparametric.
- Normal distribution: symmetrical around the midpoint; 95% of the observations are
between the mean - 1.96 SD and the mean + 1.96 SD, 2.5% are lower than the (-)
and 2.5% are higher than the (+).
- Z-scores: used to standardize normal distribution to standard normal distribution, to
compare two scores from two different normal distributions, (x-mean)/sd. Then, we
can calculate the probability of something compared to this z-score. This can be
done by looking at the area under the curve, or in the standard table.
- t-test lies between the critical t, meaning that the probability of observing the data
given that the H0 is true is likely, >5%, so retain the H0, non-significant result. If it’s
more extreme than the critical t, it means the data are very unlikely, so we reject the
H0, significant result.
,- Failing to reject the H0 doesn’t mean you can accept the H0 as true, you can only
say that there is not enough evidence to conclude that is untrue. Also you do not
know the size of the effect, if it’s either strong signal or little noise.
- Test selection: difference in means, proportions, or an association? DV and IV level
of measurements? DV normally distributed? IV levels? Paired or unpaired?
- Errors: type I errors: rejecting the H0 when we should have retained it; type Ii errors:
retaining the H0 when we should have rejected it.
- Confidence intervals: whether the hypothesized estimate falls in an interval around
the sample parameter of which we are very confident that it contains the population
parameter. CI = point estimate +- margin of error (critical statistic value * SE). We
retain the H0 when the population parameter lies in the CI. The meaning is that we
are 95% confident that the population mean lies within the CI around the sample
mean, an advantage.
- Correlation: measure of covariation, whether a change in a variable is associated
with a change in the other variable. Correlation coefficients tell us the direction of the
relationships (+ or -) and the strength of it (from -1 to 1). Pearson’s product moment
(parametric) or the Spearman’s rho (non-parametric).
- Covariation is not causality: unknown the direction, another variable or a triangular
explanation.
- Correlation coefficients are measures of linear association, and for a nonlinear
association is 0, which has no meaning.
- Correlation coefficients are sensitive to outliers, so always draw the data in a plot.
- Regression has a plus, it can tell you the direction of the correlation, with the
regression line. Still, it doesn’t imply causality, which is only determined by the
research design, only experimental!
, Regression I:
Statistical models: understanding relationship between variables, both for experimental and
observational studies. Statistics tell you whether an observed relationship between X and Y
in your sample is likely to reflect a true relationship in the unobserved population, accounting
for chance effects of random sampling and uncovering possible biases. All statistical tests
develop a model that describes the data well and makes accurate predictions about new
data points in the population.
Outcome(ind variable) = model + error (data point individually).
Means model is the simplest model: best guess that gives us the most accurate prediction, if
we don’t have any data, but if we have additional useful information, we can make a better,
more predictive model by including this: outcome = (mean +bXi) + error, where bXi is the
additional information which might be a deviation for a particular individual.
Regression leverages predictors to fit in a linear association model. Statistical tests depend
on the variables both independent and dependent, so if they’re continuous or categorical.
Linear regression is used to compare two continuous variables, and the ANOVA to
compare a continuous dependent variable between two or more categorical independent
variables, and logistic regression can be used when the dependent variable is categorical.
Statistical inference: data are just samples from a much bigger population, but might be
different from the population due to chance, so we use statistical testing to ensure that our
result and conclusion are not likely due to lucky sampling. Sampling distribution is a
distribution made with the many means found in many different samples taken from a
population and it shows a similar distribution to the population one. But as researchers we
can only do it with ONE sample, so this is why we use the SE, theoretical estimate of the
deviation that we would see in a population parameter in any individual randomly sampled
drawn from that population. SE is the standard deviation of sample of means, so the many
means found in the many samples observed, and it indicates how much we expect the
sample mean to be wrong from the value in the population mean. Central limit theorem: if
samples are large enough, the sampling distribution is normal with SD=s/sqrtN.
If we conduct a linear association the same things are true. Test statistics also tell you the
effect size and the uncertainty, expressed in units on a known distribution.
Degrees of freedom shapes the distribution, it represents the difference between, so the
ratio, the available information and the estimated information, usually equals the number of
observations minus the parameters estimated.
Regression: relationship between two quantitative variables, so when the independent
variable increases, what happens to the dependent one, what is the magnitude and the
direction of the association.
Option A describes it well but it cannot be used to make predictions, so this is why option C
is the best option. The purpose of regression is to find the line that best fits the points, and it
minimizes the squared error between the point and the line. Formula form: y = a + b*x + e,
where a is the intercept so average y when x=0, b is the slope so the effect of x on y, e is
the error or residual and it’s the difference between the true and predicted value of the
dependent variable. The purpose of drawing a regression line is to fit a statistical model to
our data and describe the linear pattern of association between x and y, and we can use the
formula to predict DV if IV is known. If we change the units, the slope coefficient changes
accordingly, so the relation stays the same, and in general linear regression is insensitive to
linear transformation. But for nonlinear transformation, such as logarithmic, the relation is
affected and the conclusion can change. After standardizing both predictor x and outcome y,
we know that b = r * sy/sx and a=0.