A Solution Manual and Notes for:
The Elements of Statistical Learning
by Jerome Friedman, Trevor Hastie,
and Robert Tibshirani
John L. Weatherwax∗
David Epstein†
Introduction
The Elements of Statistical Learning is an influential and widely studied book in the fields of
machine learning, statistical inference, and pattern recognition. It is a standard recom-
mended text in many graduate courses on these topics. It is also very challenging, particularly
if one faces it without the support of teachers who are expert in the subject matter. For
various reasons, both authors of these notes felt the need to understand the book well, and
therefore to produce notes on the text when we found the text difficult at first reading,
and answers to the exercises. Gaining understanding is time-consuming and intellectually
demanding, so it seemed sensible to record our efforts in LaTeX, and make it available on
the web to other readers. A valuable by-product of writing up mathematical material, is that
often one finds gaps and errors in what one has written.
Now it is well-known in all branches of learning, but in particular in mathematical learning, that
the way to learn is to do, rather than to read. It is all too common to read throughsome
material, convince oneself that one understands it well, but then find oneself at sea when trying
to apply it in even a slightly different situation. Moreover, material understood only at a shallow
level is easily forgotten.
It is therefore our strong recommendation that readers of the book should not look at our
responses to any of the exercises before making a substantial effort to understand it without
this aid. Any time spent in this way, even if it ends without success, will make our solutions
∗
†
1
,easier to understand and more memorable. Quite likely you will find better solutions than those
provided here—let us know when you do!
As far as teachers of courses based on the text are concerned, our notes may be regarded as
a disadvantage, because it may be felt that the exercises in the book can no longer be used
as a source of homework or questions for tests and examinations. This is a slight conflict
of interest between teachers of courses on the one hand, and independent learners on the other
hand. Finding ourselves in the ranks of the independent learners, the position we takeis hardly
surprising. Teachers of courses might benefit by comparing their own solutions with ours, as
might students in classes and independent learners. If you, the reader, finda problem difficult
and become stuck, our notes might enable you to unstick yourself. In addition, there is a myriad
of materials on any of the topics this book covers. A search for “statistical learning theory” on
Google (as of 1 January 2010) gave over 4 million hits. The possible disadvantage of not being
able to use the book’s problems in an academic courseis not really such a large one. Obtaining
additional supplementary problems at the right level for an academic course is simply a matter
of a visit to the library, or a little time spent surfing the net. And, of course, although one is not
supposed to say this, teachers of courses can also get stuck.
We are hoping that people will find errors, offer insightful suggestions, and generally improve
the understanding of the text, the exercises and of our notes, and that they write to us about
this. We will incorporate additions that we think are helpful. Our plan is to gradually add
material as we work through the book. Comments and criticisms are always welcome.
We use the numbering found in the on-line (second edition) version of this text. The first
edition should be similar but may have some differences.
2
,Chapter 2 (Overview of Supervised Learning)
Statistical Decision Theory
We assume a linear model: that is we assume y = f (x) + ε, where ε is a random variable with
mean 0 and variance σ2, and f (x) = xT β. Our expected predicted error (EPE) under the
squared error loss is
∫
EPE(β) = (y − xT β)2Pr(dx, dy) . (1)
We regard this expression as a function of β, a column vector of length p + 1. In order to find
the value of β for which it is minimized, we equate to zero the vector derivative with respect
to β. We have
∫ ∫
∂EPE
= 2 (y − xT β) (−1)x Pr(dx, dy) = −2 (y − xT β)xPr(dx, dy) . (2)
∂β
Now this expression has two parts. The first has integrand yx and the second has integrand
(xT β)x.
Before proceeding, we make a quick general remark about matrices. Suppose that A, B and
C are matrices of size 1×p matrix, p ×
1 and q 1×respectively, where p and q are positive
integers. Then AB can be regarded as a scalar, and we have (AB)C = C(AB), each of these
expressions meaning that each component of C is multiplied by the scalar AB. If q > 1,
the expressions BC, A(BC) and ABC are meaningless, and we must avoid writing them.
On the other hand, CAB is meaningful as a product of three matrices, and the result is the
q × 1 matrix (AB)C = C(AB) = CAB. In our situation we obtain (xT β)x = xxT β.
From ∂EPE/∂β = 0 we deduce
E[yx] − E[xxT β] = 0 (3)
for the particular value of β that minimizes the EPE. Since this value of β is a constant, it
can be taken out of the expectation to give
β = E[xxT ]−1E[yx] , (4)
which gives Equation 2.16 in the book.
We now discuss some points around Equations 2.26 and 2.27. We have
β̂ = (X T X)−1 X T y = (X T X)−1 X T (Xβ + ε) = β + (X T X)−1 X T ε.
So
ŷ0 = xT β̂ = xT β + xT (X T X)−1 X T ε. (5)
0 0 0
This immediately gives
ΣN
ŷ0 = x0T β + i=1
3
, ℓi(x0)εi
4
The Elements of Statistical Learning
by Jerome Friedman, Trevor Hastie,
and Robert Tibshirani
John L. Weatherwax∗
David Epstein†
Introduction
The Elements of Statistical Learning is an influential and widely studied book in the fields of
machine learning, statistical inference, and pattern recognition. It is a standard recom-
mended text in many graduate courses on these topics. It is also very challenging, particularly
if one faces it without the support of teachers who are expert in the subject matter. For
various reasons, both authors of these notes felt the need to understand the book well, and
therefore to produce notes on the text when we found the text difficult at first reading,
and answers to the exercises. Gaining understanding is time-consuming and intellectually
demanding, so it seemed sensible to record our efforts in LaTeX, and make it available on
the web to other readers. A valuable by-product of writing up mathematical material, is that
often one finds gaps and errors in what one has written.
Now it is well-known in all branches of learning, but in particular in mathematical learning, that
the way to learn is to do, rather than to read. It is all too common to read throughsome
material, convince oneself that one understands it well, but then find oneself at sea when trying
to apply it in even a slightly different situation. Moreover, material understood only at a shallow
level is easily forgotten.
It is therefore our strong recommendation that readers of the book should not look at our
responses to any of the exercises before making a substantial effort to understand it without
this aid. Any time spent in this way, even if it ends without success, will make our solutions
∗
†
1
,easier to understand and more memorable. Quite likely you will find better solutions than those
provided here—let us know when you do!
As far as teachers of courses based on the text are concerned, our notes may be regarded as
a disadvantage, because it may be felt that the exercises in the book can no longer be used
as a source of homework or questions for tests and examinations. This is a slight conflict
of interest between teachers of courses on the one hand, and independent learners on the other
hand. Finding ourselves in the ranks of the independent learners, the position we takeis hardly
surprising. Teachers of courses might benefit by comparing their own solutions with ours, as
might students in classes and independent learners. If you, the reader, finda problem difficult
and become stuck, our notes might enable you to unstick yourself. In addition, there is a myriad
of materials on any of the topics this book covers. A search for “statistical learning theory” on
Google (as of 1 January 2010) gave over 4 million hits. The possible disadvantage of not being
able to use the book’s problems in an academic courseis not really such a large one. Obtaining
additional supplementary problems at the right level for an academic course is simply a matter
of a visit to the library, or a little time spent surfing the net. And, of course, although one is not
supposed to say this, teachers of courses can also get stuck.
We are hoping that people will find errors, offer insightful suggestions, and generally improve
the understanding of the text, the exercises and of our notes, and that they write to us about
this. We will incorporate additions that we think are helpful. Our plan is to gradually add
material as we work through the book. Comments and criticisms are always welcome.
We use the numbering found in the on-line (second edition) version of this text. The first
edition should be similar but may have some differences.
2
,Chapter 2 (Overview of Supervised Learning)
Statistical Decision Theory
We assume a linear model: that is we assume y = f (x) + ε, where ε is a random variable with
mean 0 and variance σ2, and f (x) = xT β. Our expected predicted error (EPE) under the
squared error loss is
∫
EPE(β) = (y − xT β)2Pr(dx, dy) . (1)
We regard this expression as a function of β, a column vector of length p + 1. In order to find
the value of β for which it is minimized, we equate to zero the vector derivative with respect
to β. We have
∫ ∫
∂EPE
= 2 (y − xT β) (−1)x Pr(dx, dy) = −2 (y − xT β)xPr(dx, dy) . (2)
∂β
Now this expression has two parts. The first has integrand yx and the second has integrand
(xT β)x.
Before proceeding, we make a quick general remark about matrices. Suppose that A, B and
C are matrices of size 1×p matrix, p ×
1 and q 1×respectively, where p and q are positive
integers. Then AB can be regarded as a scalar, and we have (AB)C = C(AB), each of these
expressions meaning that each component of C is multiplied by the scalar AB. If q > 1,
the expressions BC, A(BC) and ABC are meaningless, and we must avoid writing them.
On the other hand, CAB is meaningful as a product of three matrices, and the result is the
q × 1 matrix (AB)C = C(AB) = CAB. In our situation we obtain (xT β)x = xxT β.
From ∂EPE/∂β = 0 we deduce
E[yx] − E[xxT β] = 0 (3)
for the particular value of β that minimizes the EPE. Since this value of β is a constant, it
can be taken out of the expectation to give
β = E[xxT ]−1E[yx] , (4)
which gives Equation 2.16 in the book.
We now discuss some points around Equations 2.26 and 2.27. We have
β̂ = (X T X)−1 X T y = (X T X)−1 X T (Xβ + ε) = β + (X T X)−1 X T ε.
So
ŷ0 = xT β̂ = xT β + xT (X T X)−1 X T ε. (5)
0 0 0
This immediately gives
ΣN
ŷ0 = x0T β + i=1
3
, ℓi(x0)εi
4