– Introduction –
What statistics does:
Statistics is used to summarize patterns in data and approximate how reality works.
Because real data are noisy and variable, statistics relies on models rather than exact descriptions.
A statistical model is a simplified mathematical description of the data-generating process.
Statistical models:
Defintion:
A statistical model is a simplified description of how a variable behaves in the real world.
It does not describe every individual perfectly, but captures the overall pattern in the data.
A model is built from two main components:
Variables:
Quantities that are measured
They vary across individuals
Represent observed data
Example: improvement in fatigue after therapy differs from one patient to another.
Parameters:
Fixed values within the model
Summarize important characteristics of the data
Are estimated from the observed data
Example: the average improvement in fatigue represents the typical therapy effect.
Sample vs population:
Definitions:
Sample:
the data that are actually observed (e.g., patients measured in a study)
Population:
the larger group we want to draw conclusions about (all patients who could receive the therapy).
Because population values are unknown, we use the sample to estimate them.
Key idea:
Model parameters are estimated in the sample.
, These estimates are then used to infer values in the population.
Term Meaning
Variable Measured, varies across people
Parameter Constant, estimated from data
Sample What we observe
Population What we want to know
Example: improvement in fatigue after therapy
Suppose we measure improvement in fatigue in a group of patients who received cognitive behavioral
therapy (CBT).
Each patient shows a different amount of improvement, so the observed scores vary across
individuals.
To summarize the overall effect of the therapy, we need a single value that represents the typical
improvement.
The mean as a statistical model
The mean improvement can be used as a model of the therapy effect.
→ Represents the typical outcome
→ Serves as the simplest statistical model
The mean does not perfectly describe each individual, but it provides a useful summary of the overall
pattern.
Model fit and error
Model fit
A statistical model makes predictions about the data.
Model fit refers to how closely these predictions match the observed values.
When the model is the mean, the prediction for every individual is the same: the average value.
Error
Because individuals differ, observed values usually deviate from the model prediction.
These deviations are called errors.
, Error reflects how well the model represents the data
Some error is unavoidable due to individual variability
Quantifying error
Deviations from the mean:
For each individual, the deviation from the mean is calculated as:
These deviations show how far each observation lies from the model prediction.
Sum of squared errors (SS):
To summarize error across all individuals, deviations are squared and summed:
Squaring the deviations ensures that:
Positive and negative deviations do not cancel each other out
Larger deviations contribute more strongly to total error
Why the mean is the best-fitting value
The mean has a key mathematical property:
→ The mean is the value that minimizes the sum of squared errors.
No other single value produces a smaller total squared deviation from the observed data.
For this reason, the mean is used as the model estimate when summarizing a single variable.
Mean squared error (MSE)
The sum of squared errors increases as the number of observations increases.
Because of this, total error is not directly comparable across samples of different sizes.
To describe the average amount of error per observation, the mean squared error (MSE) is
computed:
The mean squared error represents the average squared deviation of observations from the
mean.
Degrees of freedom (N − 1)
When calculating the MSE, the sum of squared errors is divided by N−1 rather than N.
This adjustment is necessary because:
The population mean is unknown
The sample mean is used as an estimate
, Estimating the mean uses up one degree of freedom
Dividing by N−1 corrects for this loss of information and allows inference about the population.
Variance
When the model is the mean, the mean squared error is called the variance.
Variance quantifies the spread of the data around the mean, indicating how much individual
observations typically differ from the model prediction.
Larger variance indicates poorer model fit (more dispersion).
Standard deviation
Because variance is expressed in squared units, it is difficult to interpret directly.
The standard deviation is obtained by taking the square root of the variance:
The standard deviation represents the typical distance of observations from the mean, expressed in
the original units of measurement.
Standard deviation and distribution shape
A large standard deviation indicates a wide, spread-out distribution
A small standard deviation indicates a narrow, tightly clustered distribution
Two distributions can have the same mean but very different standard deviations.
Standard deviation in a normal distribution
Empirical rule (normal distribution):
~68% of observations lie within ±1 SD of the mean
, ~95% lie within ±2 SD
~99.7% lie within ±3 SD
→ This links standard deviation to the probability structure of the data.
Transition to inference: What are we estimating?
So far, we have described sample data.
Statistical inference asks: what does this tell us about the population?
What are we estimating?
The goal of inference is to estimate an unknown population parameter using sample data.
Parameter of interest
Population mean is denoted by μ
μ is fixed but unknown
We never observe μ directly
Sample mean as an estimate
The sample mean Xˉ is a point estimate of μ
A point estimate provides a single best guess
Why a single mean is not enough
Different samples produce different means.
Therefore:
Xˉ varies from sample to sample
A single mean does not express precision or uncertainty
This motivates the need for:
A measure of precision → Standard Error
A way to express uncertainty → Confidence Intervals
How precise is the estimate? (Standard Error)
Key question:
Once we compute a sample mean Xˉ, we ask:
“How much would this mean change if we took another sample from the same population?”
This question concerns precision, not individual variability.
Variability Precision
how spread out individual observations are how stable the sample mean is across samples
📌 We are interested in the precision of the estimate, not the spread of raw scores.
Definition
, The standard error of the mean (SE or SEM) measures how much the sample mean is expected to
vary from sample to sample.
Interpretation
Smaller SE → more precise estimate of μ
Larger SE → less precise estimate
What SEM is NOT What SEM IS
Not variability between individuals Variability of the mean estimate
Not the same as the standard deviation A measure of precision
Formula:
Where:
s = sample standard deviation
N = sample size
Key intuition:
Larger N → smaller SE
Smaller SE → more precise estimate of the population mean
As N increases, the sample mean becomes more representative of μ.
Why does the sample mean vary?
We have seen that:
Different samples produce different sample means
The standard error quantifies this variability
→ The next question is why this happens and what shape this variability follows.
The Central Limit Theorem (CLT)
Statement of the CLT:
According to the Central Limit Theorem:
For sufficiently large samples (approximately N≥30), the sampling distribution of the sample
mean is approximately normal.
This holds:
Regardless of the shape of the population distribution