LECTURE NOTES 8
1 Statistical Inference
A central concern of statistics and machine learning is to estimate things about some under-
lying population on the basis of samples. Formally, given a sample,
X1 , . . . , Xn ∼ F,
what can we infer about F ?
To make meaningful inferences about F from samples we typically restrict F in some natural
way. A statistical model is a set of distributions F. Broadly, there are two possibilities:
1. Parametric model: In a parametric model, the set of possible distributions F can
be described by a finite number of parameters. Here are a few examples:
(a) A Gaussian model: This is a simple two parameter model. Here we suppose that:
(x − µ)2
1
F = f (x; µ, σ) = √ exp − , µ ∈ R, σ > 0 .
σ 2π 2σ 2
(b) The Bernoulli model: This is a one parameter model where:
F = pθ (x) = θx (1 − θ)1−x : 0 ≤ θ ≤ 1 .
2. Non-parametric model: A non-parametric model is one which where F cannot be
parameterized by a finite number of parameters. Here are a few popular examples:
(a) Estimating the CDF: Here the model consists of any valid CDF, i.e. a function
that is between 0 and 1, is monotonically increasing, right-continuous and equal
to 0 at −∞ and 1 at ∞. We are given samples X1 , . . . , Xn ∼ F and the goal is
to estimate F .
(b) Density estimation: In density estimation, we are given samples X1 , . . . , Xn ∼ fX ,
where fX is an unknown density that we would like to estimate. It turns out that
the class of all possible densities is too big for this problem to be well posed so
we need to assume some smoothness on the density. A typical assumption is that
the model is given by:
Z Z
00 2
F = f : (f (x)) dx < ∞, f (x)dx = 1, f (x) ≥ 0 .
1
, 2 Point Estimation
Point estimation in statistics refers to calculating a single “best guess” of the value of an
unknown quantity of interest. The quantity of interest could be a parameter or, for instance,
a density function. Typically, we will use θb or θbn to denote a point estimator. A point
estimator is a function of the data X1 , . . . , Xn :
θbn = g(X1 , . . . , Xn ),
so that θbn is a random variable. The bias of an estimator is written as:
b(θbn ) = Eθ (θbn ) − θ.
Similarly, the variance of an estimator is given by:
v(θbn ) = Eθ (θbn − θn )2
q
where θn = Eθ (θbn ). The standard error is defined to be se = v(θbn ).
In the olden days, there was a lot of emphasis on unbiased estimators, and we wanted to find
the unbiased estimators with small (or minimal variance). In modern statistics, we often use
biased estimators because the reduction in variance often justifies the bias.
We call an estimator of a parameter consistent if the estimator converges to the true param-
eter in probability, i.e. for any :
Pθ (|θbn − θ| ≥ ) → 0,
P
as n → ∞. In other words, θbn → θ or θbn − θ = oP (1).
3 The Bias-Variance decomposition
One way to compute the quality of an estimator is via its mean squared error:
MSE = Eθ (θ − θbn )2 .
The MSE can be decomposed as the sum of the squared bias and variance, i.e.:
MSE = Eθ (θ − θbn )2
= Eθ (θ − θn + θn − θbn )2
= b(θbn )2 + v(θbn ).
2
1 Statistical Inference
A central concern of statistics and machine learning is to estimate things about some under-
lying population on the basis of samples. Formally, given a sample,
X1 , . . . , Xn ∼ F,
what can we infer about F ?
To make meaningful inferences about F from samples we typically restrict F in some natural
way. A statistical model is a set of distributions F. Broadly, there are two possibilities:
1. Parametric model: In a parametric model, the set of possible distributions F can
be described by a finite number of parameters. Here are a few examples:
(a) A Gaussian model: This is a simple two parameter model. Here we suppose that:
(x − µ)2
1
F = f (x; µ, σ) = √ exp − , µ ∈ R, σ > 0 .
σ 2π 2σ 2
(b) The Bernoulli model: This is a one parameter model where:
F = pθ (x) = θx (1 − θ)1−x : 0 ≤ θ ≤ 1 .
2. Non-parametric model: A non-parametric model is one which where F cannot be
parameterized by a finite number of parameters. Here are a few popular examples:
(a) Estimating the CDF: Here the model consists of any valid CDF, i.e. a function
that is between 0 and 1, is monotonically increasing, right-continuous and equal
to 0 at −∞ and 1 at ∞. We are given samples X1 , . . . , Xn ∼ F and the goal is
to estimate F .
(b) Density estimation: In density estimation, we are given samples X1 , . . . , Xn ∼ fX ,
where fX is an unknown density that we would like to estimate. It turns out that
the class of all possible densities is too big for this problem to be well posed so
we need to assume some smoothness on the density. A typical assumption is that
the model is given by:
Z Z
00 2
F = f : (f (x)) dx < ∞, f (x)dx = 1, f (x) ≥ 0 .
1
, 2 Point Estimation
Point estimation in statistics refers to calculating a single “best guess” of the value of an
unknown quantity of interest. The quantity of interest could be a parameter or, for instance,
a density function. Typically, we will use θb or θbn to denote a point estimator. A point
estimator is a function of the data X1 , . . . , Xn :
θbn = g(X1 , . . . , Xn ),
so that θbn is a random variable. The bias of an estimator is written as:
b(θbn ) = Eθ (θbn ) − θ.
Similarly, the variance of an estimator is given by:
v(θbn ) = Eθ (θbn − θn )2
q
where θn = Eθ (θbn ). The standard error is defined to be se = v(θbn ).
In the olden days, there was a lot of emphasis on unbiased estimators, and we wanted to find
the unbiased estimators with small (or minimal variance). In modern statistics, we often use
biased estimators because the reduction in variance often justifies the bias.
We call an estimator of a parameter consistent if the estimator converges to the true param-
eter in probability, i.e. for any :
Pθ (|θbn − θ| ≥ ) → 0,
P
as n → ∞. In other words, θbn → θ or θbn − θ = oP (1).
3 The Bias-Variance decomposition
One way to compute the quality of an estimator is via its mean squared error:
MSE = Eθ (θ − θbn )2 .
The MSE can be decomposed as the sum of the squared bias and variance, i.e.:
MSE = Eθ (θ − θbn )2
= Eθ (θ − θn + θn − θbn )2
= b(θbn )2 + v(θbn ).
2