Uploaded by pk.burdak

Solution to elements of statistical learning

advertisement
A GUIDE AND SOLUTION MANUAL TO THE ELEMENTS OF STATISTICAL
LEARNING
by
JAMES CHUANBING MA
Under the direction of WILLIAM MCCORMICK
ABSTRACT
This Master’s thesis will provide R code and graphs that reproduce some of the figures in
the book Elements of Statistical Learning. Selected topics are also outlined and summarized so
that it is more readable. Additionally, it covers some of the solutions to the problems for
chapters 2, 3, and 4.
INDEX WORDS:
Elements of Statistical Learning, Solution Manual, Guide, ESL Guide
A GUIDE AND SOLUTION MANUAL TO THE ELEMENTS OF STATISTICAL
LEARNING
By
JAMES CHUANBING MA
B.S., Emory University, 2008
A Thesis Submitted to the Graduate Faculty of The University of Georgia in Partial
Fulfillment of the Requirements for the Degree
MASTER OF SCIENCE
ATHENS, GEORGIA
2014
© 2014
James Chuanbing Ma
All Rights Reserved
A GUIDE AND SOLUTION MANUAL TO THE ELEMENTS OF STATISTICAL
LEARNING
by
JAMES CHUANBING MA
Electronic Version Approved:
Julie Coffield Interim Dean of the Graduate School
The University of Georgia
December 2014
Major Professor:
William Mccormick
Committee:
Jaxk Reeves
Kim Love-Myers
TABLE OF CONTENTS
CHAPTER
PAGE
1
INTRODUCTION
1
2
OVERVIEW OF SUPERVISED LEARNING
3
3
4
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2
Mathematical Notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3
Least Squares and Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4
Statistical Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5
Problems and Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
LINEAR METHODS FOR REGRESSION
32
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32
3.2
Linear Regression Models and Least Squares . . . . . . . . . . .. . . . . . . . . . . . . 32
3.3
Shrinkage Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4
Methods Using Derived Input Directions. . . . . . . . . . . . . . . . . . .. . . . . . . . . . 43
3.5
Problems and Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
LINEAR METHODS FOR CLASSIFICATION
iv
57
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .57
4.2
Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3
Problems and Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
REFERENCES
61
v
CHAPTER 1
INTRODUCTION
The Elements of Statistical Learning is a popular book on data mining and machine
learning written by three statistics professors at Stanford. The book is intended for researchers in
the field and for people that want to build robust machine learning libraries and thus is
inaccessible to many people that are new into the field.
On Amazon, roughly 1 out of 5 people1 find the book too difficult to read and went as far
as to call the book a “disaster”. There are many instances of the expressions “it is easy to show”
or “the exercise is left to the reader”. Often times for the novice reader, these problems are not
so easy to show. My goal in writing the thesis is to provide clarity to the book for novice
readers. Therefore, my goal is not to reproduce the book, but to act as a supplement that covers
parts of the book the author overlooks. In particular, my contributions are as follows:
๏‚ท
Derive problems that are overlooked (“it is easy to show”).
๏‚ท
Provide solutions to exercises that can be understood to the novice reader.
๏‚ท
Provide R code that the reader can copy and paste.
With that said, there is a level requirement for reading this thesis. Specifically, I assume
that the reader is has taken courses in calculus, probability, linear algebra, and linear regression.
I will assume that the reader is comfortable with calculus topics such as gradients, derivatives,
and vector spaces. For probability, an understanding of random variables, probability mass
functions, cumulative distribution functions, expectation, variance and covariance, and basic
1
multivariate distributions is required. For linear algebra, the reader should have a solid
understanding of basic matrix operations, inverses, norms, determinants, eigenvalues and
eigenvectors, and definiteness of matrices. And for linear regression, the reader should be
familiar with sum-of-squared errors, least squares estimation of parameters, and general
hypothesis testing.
This thesis is an introduction and covers Chapters 2 (Overview of Supervised Learning),
3 (Linear Regression), and 4 (Classification). An updated copy with Chapters 7(Model
Validation), 8 (Model Inference), 9 (Additive Models), and 10 (Boosted Trees) is intended by
mid-January.
This manual is in no way complete and will be an ongoing project after graduating.
Please refer to the site1 below if you are interested in new updates or contributing to the project.
Suggestions and comments for improvement are always welcome!
1
http://eslsolutionmanual.weebly.com
2
CHAPTER 2
OVERVIEW OF SUPERVISED LEARNING
2.1 Introduction
This section goes over mathematical notation, least squares and nearest neighbors,
statistical decision theory, and the bias-variance decomposition.
2.2 Mathematical Notation
The mathematical notation adopted in this guide is identical to the one used in the book
and is summarized below.
๏‚ง
We bold matrices: ๐— ∈ ๐‘… ๐‘›×๐‘ is a matrix with n rows and p columns.
๏‚ง
We denote the jth column of matrix ๐— as ๐ฑ๐‘— :
|
๐ฑ
๐—=[ 1
|
๏‚ง
โ‹ฏ
|
๐ฑ ๐‘ ].
|
We denote the ith row of matrix ๐— as ๐ฑ ๐‘–๐‘‡ :
๐—=
−
−
[−
๏‚ง
|
๐ฑ2
|
๐ฑ1๐‘‡
๐ฑ 2๐‘‡
โ‹ฎ
๐ฑ ๐‘›๐‘‡
−
−
−]
Generic vectors are capitalized ๐‘‹ and observed vectors are in lowercase ๐‘ฅ.
3
๏‚ง
The ith observation of vector ๐‘ฅ is denoted ๐‘ฅ๐‘– :
๐‘ฅ1
๐‘ฅ2
๐‘ฅ=[ โ‹ฎ ]
๐‘ฅ๐‘›
๏‚ง
We denote ๐‘Œ as a quantitative output and takes values in the real numbers.
๏‚ง
We denote ๐บ as a qualitative output.
In the text, there are many terms that are not explained so we will try to define them here.
We denote ๐‘ฅ๐‘– as the input variable (also called features) and ๐‘ฆ๐‘– the output variable (also called
target) that we are trying to predict. A training example is the pair (๐‘ฅ๐‘– , ๐‘ฆ๐‘– ) and the training set is
a list of ๐‘› training examples {(๐‘ฅ๐‘– , ๐‘ฆ๐‘– ) โˆถ ๐‘– = 1,2, … , ๐‘›} often denoted ๐‘ป in the text.
Given a training set, our goal is to learn a function (also called model) ๐‘“: ๐‘‹ → ๐‘Œ so that
our function is “good” at mapping the inputs to the corresponding outputs. We will define what
“good” means in Section 2.4. Supervised learning refers to these types of functions with labeled
data.
2.3 Least Squares and Nearest Neighbors
On page 12 in Equation 2.6, the author provides the unique solution to the coefficient
vector ๐›ฝ as follows
๐›ฝฬ‚ = (๐— T ๐—)−1 ๐— ๐‘‡ ๐‘ฆ.
4
(1)
Recall that in linear regression, we find the solution to the parameter vector ๐›ฝ by minimizing the
sum-of-squared-errors written as follows
๐‘
๐‘
2
๐‘…๐‘†๐‘†(๐›ฝ) = ∑ (๐‘ฆ๐‘– − ๐›ฝ0 − ∑ ๐›ฝ๐‘— ๐‘ฅ๐‘–๐‘— ) .
๐‘–=1
(2)
๐‘—=1
Now define the vector ๐›ฝ = ⟨๐›ฝ0 , ๐›ฝ1 , … , ๐›ฝ๐‘ ⟩ so that ๐›ฝ ∈ ๐‘… ๐‘+1 . It is important to note that we have
included the intercept term in the vector and thus the corresponding observed values ๐‘ฅ๐‘– have an
implicit 1 as its first element so that we get
๐‘ฅ๐‘– = ⟨๐‘ฅ๐‘–1 , ๐‘ฅ๐‘–2 , … , ๐‘ฅ๐‘–๐‘ ⟩ → ⟨1, ๐‘ฅ๐‘–1 , ๐‘ฅ๐‘–2 , … , ๐‘ฅ๐‘–๐‘ ⟩
(3)
after including the intercept term.
Then we can rewrite the above quantity in vector form as follows
๐‘
๐‘…๐‘†๐‘†(๐›ฝ) = ∑(๐‘ฆ๐‘– − ๐‘ฅ๐‘–๐‘‡ ๐›ฝ)2 .
(4)
๐‘–=1
To get the above quantity in matrix form, we will introduce the design matrix ๐— ∈ ๐‘…๐‘×(๐‘+1) that
contains the training input vectors ๐‘ฅ๐‘– along the rows:
๐—=
−
−
[−
5
๐‘ฅ1๐‘‡
๐‘ฅ2๐‘‡
โ‹ฎ
๐‘ฅ๐‘›๐‘‡
−
−
−]
and define the vector ๐‘ฆ ∈ ๐‘… ๐‘ to be the vector with the training labels:
๐‘ฆ1
๐‘ฆ2
๐‘ฆ = [ โ‹ฎ ].
๐‘ฆ๐‘›
Then we can easily show that
๐‘ฆ1
๐‘ฅ1๐‘‡ ๐›ฝ
๐‘‡
๐‘ฆ2
๐‘ฆ − ๐—๐›ฝ = [ โ‹ฎ ] − ๐‘ฅ2 ๐›ฝ
โ‹ฎ
๐‘‡
๐‘ฆ๐‘›
[ ๐‘ฅ๐‘› ๐›ฝ ]
๐‘ฆ1 − ๐‘ฅ1๐‘‡ ๐›ฝ
๐‘‡
= ๐‘ฆ2 − ๐‘ฅ2 ๐›ฝ
โ‹ฎ
[ ๐‘ฆ๐‘› − ๐‘ฅ๐‘›๐‘‡ ๐›ฝ ]
so that we can rewrite the residual-sum-of-squared errors in matrix form as
๐‘…๐‘†๐‘†(๐›ฝ) = (๐‘ฆ − ๐—๐›ฝ)๐‘‡ (๐‘ฆ − ๐—๐›ฝ)
= ๐‘ฆ ๐‘‡ ๐‘ฆ − 2๐›ฝ ๐‘‡ ๐— ๐‘‡ ๐‘ฆ + ๐›ฝ ๐‘‡ ๐— ๐‘‡ ๐—๐›ฝ
(5)
(6)
Now, to minimize the function, set the derivative to zero
๐œ•๐‘…๐‘†๐‘†(๐›ฝ)
= −2๐— ๐‘‡ ๐‘ฆ + 2๐— ๐‘‡ ๐—๐›ฝ = 0
๐œ•๐›ฝ
(7)
⇒
๐— ๐‘‡ ๐—๐›ฝ = ๐— ๐‘‡ ๐‘ฆ
(8)
⇒
๐›ฝฬ‚ = (๐— T ๐—)−1 ๐— ๐‘‡ ๐‘ฆ
(9)
and if ๐— is full rank, we get the unique solution in ๐›ฝ shown in (1).
6
It is important to point out that for linear regression, our model here is ๐‘“(๐‘ฅ๐‘– ) = ๐‘ฅ๐‘–๐‘‡ ๐›ฝ so
that the prediction at an arbitrary point ๐‘ฅ0 is ๐‘ฆฬ‚0 = ๐‘ฅ0๐‘‡ ๐›ฝ.
On page 14, Equation (2.8) states that the k-nearest neighbors model has the form
๐‘Œฬ‚(๐‘ฅ) =
1
๐‘˜
∑
๐‘ฆ๐‘– .
(10)
๐‘ฅ๐‘– ∈๐‘๐‘˜ (๐‘ฅ)
Where ๐‘๐‘˜ (๐‘ฅ) is the neighborhood of x defined by the k closest points ๐‘ฅ๐‘– in the training sample.
The word closest implies a metric and here it is the Euclidean distance. More formally, define
the ๐ท = {๐‘‘(๐‘ฅ๐‘– , ๐‘ฅ): ๐‘ฅ๐‘– ∈ ๐‘‡} where ๐‘‘(๐‘ฅ๐‘– , ๐‘ฅ) is any metric where for Euclidean distance it is
๐‘‘(๐‘ฅ๐‘– , ๐‘ฅ) = ||๐‘ฅ๐‘– − ๐‘ฅ||2. Then the neighborhood is
๐‘๐‘˜ (๐‘ฅ) = {๐‘ฅ๐‘– : ๐‘‘(๐‘ฅ๐‘– , ๐‘ฅ) ≤ ๐‘‘๐‘˜ where dk is the ๐‘˜th smallest element of ๐ท, ๐‘ฅ๐‘– ∈ ๐‘‡}.
2.4 Statistical Decision Theory
On page 18, Equation (2.9) of the book defines the squared error loss function as
๐ฟ(๐‘Œ, ๐‘“(๐‘‹)) = (๐‘Œ − ๐‘“(๐‘‹))
2
(11)
which gives us the following expected prediction error
2
๐ธ๐‘ƒ๐ธ(๐‘“) = ๐ธ [(๐‘Œ − ๐‘“(๐‘‹)) ]
= ∫ ∫ [๐‘ฆ − ๐‘“(๐‘ฅ)]2 Pr(๐‘ฆ, ๐‘ฅ)๐‘‘๐‘ฆ๐‘‘๐‘ฅ
(12)
(13)
Now, we will fill in the steps that are skipped in the book. Recall that we can factor the joint
density as Pr(๐‘‹, ๐‘Œ) = Pr(๐‘Œ|๐‘‹) Pr(๐‘‹).
7
Then
= ∫ ∫[๐‘ฆ − ๐‘“(๐‘ฅ)]2 Pr(๐‘ฆ|๐‘ฅ) Pr(๐‘ฅ) ๐‘‘๐‘ฆ๐‘‘๐‘ฅ
๐‘ฅ ๐‘ฆ
๐ธ๐‘ƒ๐ธ(๐‘“) = ๐ธ๐‘‹ ๐ธ๐‘Œ|๐‘‹ ([๐‘Œ − ๐‘“(๐‘‹)]2 |๐‘‹).
(14)
Notice that by conditioning on ๐‘‹, we have freed the dependency of the function ๐‘“ on ๐‘‹ and since
the quantity [๐‘Œ − ๐‘“]2 is convex, there is a unique solution. We can now minimize to solve for ๐‘“
๐‘“(๐‘ฅ) = arg min ๐ธ๐‘Œ|๐‘‹ ([๐‘Œ − ๐‘“]2 |๐‘‹ = ๐‘ฅ)
(15)
๐œ•
∫ [๐‘Œ − ๐‘“]2 Pr(๐‘ฆ|๐‘ฅ) ๐‘‘๐‘ฆ = 0
๐œ•๐‘“
(16)
๐œ•
[๐‘ฆ − ๐‘“]2 Pr(๐‘ฆ|๐‘‹) ๐‘‘๐‘ฆ = 0
๐œ•๐‘“
(17)
๐‘“
⇒
=∫
= 2∫ ๐‘ฆ๐‘ƒ๐‘Ÿ(๐‘ฆ|๐‘ฅ)๐‘‘๐‘ฆ = 2๐‘“ ∫
๐‘ƒ๐‘Ÿ(๐‘ฆ|๐‘ฅ) ๐‘‘๐‘ฆ = 0
(18)
⇒ 2๐ธ[๐‘Œ|๐‘‹] = 2๐‘“
(19)
⇒ ๐‘“ = ๐ธ[๐‘Œ|๐‘‹ = ๐‘ฅ].
(20)
Thus we get the Equation (2.13) from the book.
Lets continue down this path and work out the problem from page 20, Equation (2.18) the
author replaces the squared loss function with the absolute loss function: ๐ธ[|๐‘Œ − ๐‘“(๐‘‹)|]. The
equation from the book is as follows
๐‘“ฬ‚(๐‘ฅ) = median(๐‘Œ|๐‘‹ = ๐‘ฅ)
8
(21)
then the expected prediction error for absolute loss is almost identical to that of squared loss. If
you become confused with this example, refer back to the squared loss derivation for the first
few steps. We can write the expected prediction error for absolute loss as follows
๐ธ๐‘ƒ๐ธ(๐‘“) = ๐ธ[|๐‘Œ − ๐‘“(๐‘‹)|]
(22)
= ∫ ∫ |๐‘ฆ − ๐‘“(๐‘ฅ)|Pr(๐‘ฆ, ๐‘ฅ)๐‘‘๐‘ฆ๐‘‘๐‘ฅ
(23)
then recall that by factoring joint density, we can rewrite the above quantity as
๐ธ๐‘ƒ๐ธ(๐‘“) = ๐ธ๐‘‹ ๐ธ๐‘Œ|๐‘‹ (|๐‘Œ − ๐‘“(๐‘‹)||๐‘‹).
(24)
and we free the dependency of f on x again so that it suffices to minimize the conditional density
of y
๐‘“(๐‘ฅ) = arg min ๐ธ๐‘Œ|๐‘‹ (|๐‘Œ − ๐‘“||๐‘‹ = ๐‘ฅ)
(25)
๐œ•
∫ |๐‘Œ − ๐‘“|Pr(๐‘ฆ|๐‘‹) ๐‘‘๐‘ฆ = 0
๐œ•๐‘“
(26)
๐‘“
=
From here, the integration is a little sophisticated and requires a branch of analysis known as
measure theory. For this reason, we will provide an approximation that does not address the
smaller details. By using the law of large numbers, we get the following
๐‘›
๐‘›
1
1
∫ |๐‘Œ − ๐‘“| Pr(๐‘ฆ|๐‘‹)๐‘‘๐‘ฆ = ๐ฟ๐‘–๐‘š๐‘›→∞ ∑ |๐‘Œ๐‘– − ๐‘“| ≈ ∑ |๐‘Œ๐‘– − ๐‘“|
๐‘›
๐‘›
๐‘–=1
(27)
๐‘–=1
and so that we can get an approximation that converges when n is large. For this example, we
will use the approximation, the last term in the above equation. Notice here that the absolute
function is piecewise
9
๐‘Œ๐‘– − ๐‘“,
|๐‘Œ๐‘– − ๐‘“| = {๐‘“ − ๐‘Œ๐‘– ,
0,
๐‘Œ๐‘– − ๐‘“ > 0
๐‘Œ๐‘– − ๐‘“ < 0
๐‘Œ๐‘– = ๐‘“
(28)
so that taking the derivative is not continuous at zero
−1,
๐œ•
|๐‘Œ − ๐‘“| = { 1,
๐œ•๐‘“ ๐‘–
0,
๐‘Œ๐‘– − ๐‘“ > 0
๐‘Œ๐‘– − ๐‘“ < 0 .
๐‘Œ๐‘– = ๐‘“
(29)
Now, we can introduce a new function, the sign function to make the definition clearer
1,
๐‘ฅ>0
๐‘ฅ<0
๐‘ ๐‘–๐‘”๐‘›(๐‘ฅ) = { −1,
0, ๐‘œ๐‘กโ„Ž๐‘’๐‘Ÿ๐‘ค๐‘–๐‘ ๐‘’
(30)
so that substituting it back in we get
๐‘›
๐œ• 1
≈
∑ |๐‘Œ๐‘– − ๐‘“| = 0
๐œ•๐‘“ ๐‘›
(31)
๐‘–=1
๐‘›
1
= ∑ −๐‘ ๐‘–๐‘”๐‘›(๐‘Œ๐‘– − ๐‘“) = 0
๐‘›
(32)
๐‘–=1
๐‘›
= ∑ ๐‘ ๐‘–๐‘”๐‘›(๐‘Œ๐‘– − ๐‘“) = 0.
(33)
๐‘–=1
At what value of f does the above quantity hold? It holds when there is an equal number of
positive and negative values; that is, where
๐‘๐‘Ž๐‘Ÿ๐‘‘(๐‘Œ๐‘– − ๐‘“ > 0) = ๐‘๐‘Ž๐‘Ÿ๐‘‘(๐‘Œ๐‘– − ๐‘“ < 0).
10
(34)
The value of f where that is true is the median. Recall that the median can be found by sorting a
finite list of numbers from lowest value to highest value and picking the middle one. When there
is an odd number of observations, there is a single number that divides the set (e.g. the median of
{2,3,4,5,6} is 4). When there is an even number, there is a range of values that can divide the set
(e.g. the median of {2,4,6,8} is the any value in ๐‘“ ∈ (4,6)). In conclusion, we have shown that
๐‘“ฬ‚(๐‘ฅ) = median(๐‘Œ|๐‘‹ = ๐‘ฅ).
On page 24, Equation (2.25) is nested in Equation (2.27) so we will derive Equation
(2.27). The derivation is a little tedious and algebraic but it is an important one. The expected
(squared) prediction error at an arbitrary point ๐‘ฅ0 is
๐ธ๐‘ƒ๐ธ(๐‘ฅ0 ) = ๐ธ๐‘ฆ0 |๐‘ฅ0 ๐ธ๐‘‡ (๐‘ฆ0 − ๐‘ฆฬ‚0 )2 .
(35)
The T here is the training set that we defined earlier, ๐‘ฆ0 |๐‘ฅ0 and ๐‘ฆ0 are identical quantities that
represent a random variable conditioned on ๐‘ฅ0 . Additionally, ๐‘ฆฬ‚0 is our prediction at ๐‘ฅ0 , that is
๐‘ฆฬ‚0 = ๐‘“ฬ‚(๐‘ฅ0 ) where ๐‘“ฬ‚ is our model estimate of the true model ๐‘“ fitted on the training data. We
assume the model
๐‘Œ = ๐‘“(๐‘‹) + ๐œ–
(36)
where ๐œ– ~ ๐‘(0, ๐œŽ 2 ) and independently distributed. To start the proof, we use a little trick by
inserting the true function ๐‘“(๐‘ฅ0 ) as follows
๐ธ๐‘ƒ๐ธ(๐‘ฅ0 ) = ๐ธ๐‘ฆ0 |๐‘ฅ0 ๐ธ๐‘‡ (๐‘ฆ0 − ๐‘ฆฬ‚0 )2
= ๐ธ๐‘ฆ0 |๐‘ฅ0 ๐ธ๐‘‡ ((๐‘ฆ0 − ๐‘“(๐‘ฅ0 )) + (๐‘“(๐‘ฅ0 ) − ๐‘ฆฬ‚0 ))2 .
11
(37)
Notice that by subtracting and adding ๐‘“(๐‘ฅ0 ), the value of the function does not change. It is also
important to point out here that the model ๐‘“ฬ‚(๐‘ฅ0 ) = ๐‘ฆฬ‚0 is fitted on the training set by our
definition in the introduction. Thus, the only term in the above function that depends on T is ๐‘ฆฬ‚0 .
We then continue with the problem and factor out the squared term as
2
= ๐ธ๐‘ฆ0 |๐‘ฅ0 ๐ธ๐‘‡ [(๐‘ฆ0 − ๐‘“(๐‘ฅ0 )) ] + 2๐ธ๐‘ฆ0 |๐‘ฅ0 ๐ธ๐‘‡ [(๐‘ฆ − ๐‘“(๐‘ฅ0 )) ⋅ (๐‘“(๐‘ฅ0 ) − ๐‘ฆฬ‚0 )]
+๐ธ๐‘ฆ0 |๐‘ฅ0 ๐ธ๐‘‡ [(๐‘“(๐‘ฅ0 ) − ๐‘ฆฬ‚0 )2 ].
(38)
2
The first quantity ๐ธ๐‘ฆ0 |๐‘ฅ0 ๐ธ๐‘‡ [(๐‘ฆ0 − ๐‘“(๐‘ฅ0 )) ] is independent of the training set so we can
reduce it to
2
2
๐ธ๐‘ฆ0 |๐‘ฅ0 [(๐‘ฆ0 − ๐‘“(๐‘ฅ0 )) ] = ∫(๐‘ฆ0 − ๐‘“(๐‘ฅ0 )) Pr(๐‘ฆ0 |๐‘ฅ0 ) ๐‘‘๐‘ฆ
(39)
where ๐ธ[๐‘ฆ0 ] = ๐‘“(๐‘ฅ0 ), the true mean. Then notice that this is just the function for the conditional
variance that we specified as ๐œŽ 2 .
For the second quantity, we can factor out the middle term:
๐ธ๐‘ฆ0 |๐‘ฅ0 ๐ธ๐‘‡ [(๐‘ฆ0 − ๐‘“(๐‘ฅ0 )) ⋅ (๐‘“(๐‘ฅ0 ) − ๐‘ฆฬ‚0 )] = ๐ธ๐‘ฆ0 |๐‘ฅ0 ๐ธ๐‘‡ [๐‘ฆ0 ๐‘“(๐‘ฅ0 ) − ๐‘ฆ0 ๐‘ฆฬ‚0 − ๐‘“(๐‘ฅ0 )2 + ๐‘“(๐‘ฅ0 )๐‘ฆฬ‚0 ] (40)
and by linearity of expectation,
= ๐ธ๐‘ฆ0 |๐‘ฅ0 ๐ธ๐‘‡ [๐‘ฆ0 ๐‘“(๐‘ฅ0 )] − ๐ธ๐‘ฆ0 |๐‘ฅ0 ๐ธ๐‘‡ [๐‘ฆ0 ๐‘ฆฬ‚0 ] − ๐ธ๐‘ฆ0 |๐‘ฅ0 ๐ธ๐‘‡ [๐‘“(๐‘ฅ0 )2 ] + ๐ธ๐‘ฆ0 |๐‘ฅ0 ๐ธ๐‘‡ [๐‘“(๐‘ฅ0 )๐‘ฆฬ‚0 ]. (41)
We stated that only ๐‘ฆฬ‚0 is dependent on T. Additionally, notice that ๐‘“(๐‘ฅ0 ) = ๐ธ[๐‘ฆ0 ] so it is a
constant term so that we can reduce the above quantity to
= ๐‘“(๐‘ฅ0 )๐ธ๐‘ฆ0 |๐‘ฅ0 [๐‘ฆ0 ] − ๐ธ๐‘ฆ0 |๐‘ฅ0 [๐‘ฆ0 ๐ธ๐‘‡ [๐‘ฆฬ‚0 ]] − ๐‘“(๐‘ฅ0 )2 + ๐‘“(๐‘ฅ0 )๐ธ๐‘ฆ0|๐‘ฅ0 ๐ธ๐‘‡ [๐‘ฆฬ‚0 ].
12
(42)
Notice that for the first term, the relationship
๐ธ๐‘ฆ0 |๐‘ฅ0 [๐‘ฆ0 ] = ∫ ๐‘ฆ0 Pr(y0 |๐‘ฅ0 ) ๐‘‘๐‘ฆ = ๐‘“(๐‘ฅ0 ).
(43)
For the second term, the relationship is
๐ธ๐‘ฆ0 |๐‘ฅ0 [๐‘ฆ0 ๐ธ๐‘‡ [๐‘ฆฬ‚0 ]] = ๐ธ๐‘‡ [๐‘ฆฬ‚0 ]๐ธ๐‘ฆ0 |๐‘ฅ0 [๐‘ฆ0 ] = ๐ธ๐‘‡ [๐‘ฆฬ‚0 ]๐‘“(๐‘ฅ0 )
(44)
and the last term, we get
๐‘“(๐‘ฅ0 )๐ธ๐‘ฆ0 |๐‘ฅ0 ๐ธ๐‘‡ [๐‘ฆฬ‚0 ] = ๐‘“(๐‘ฅ0 )๐ธ๐‘‡ [๐‘ฆฬ‚0 ]
(45)
and thus adding all of these terms together, the first and third quantities cancel, the second and
last quantities cancel so that the entire middle term equals 0.
Now, we are left with the last term ๐ธ๐‘ฆ0 |๐‘ฅ0 ๐ธ๐‘‡ [(๐‘“(๐‘ฅ0 ) − ๐‘ฆฬ‚0 )2 ] which we can factor out as
follows
= ๐ธ๐‘ฆ0 |๐‘ฅ0 ๐ธ๐‘‡ [๐‘“(๐‘ฅ0 )2 ] − 2๐ธ๐‘ฆ0 |๐‘ฅ0 ๐ธ๐‘‡ [๐‘“(๐‘ฅ0 )๐‘ฆฬ‚0 ] + ๐ธ๐‘ฆ0 |๐‘ฅ0 ๐ธ๐‘‡ [๐‘ฆฬ‚02 ].
(46)
We are going to use another property here, which is that
๐ธ๐‘ฆ0 |๐‘ฅ0 ๐ธ๐‘‡ [(๐‘“(๐‘ฅ0 ) − ๐ธ๐‘‡ [๐‘ฆฬ‚0 ])2 ] = ๐ธ๐‘ฆ0 |๐‘ฅ0 ๐ธ๐‘‡ [๐‘“(๐‘ฅ0 )2 ] − 2๐ธ๐‘ฆ0 |๐‘ฅ0 ๐ธ๐‘‡ [๐‘“(๐‘ฅ0 )๐‘ฆฬ‚0 ] + ๐ธ๐‘ฆ0 |๐‘ฅ0 ๐ธ๐‘‡ [๐‘ฆฬ‚0 ]2 (47)
and then substitute it back into the original equation to get
= ๐ธ๐‘ฆ0 |๐‘ฅ0 ๐ธ๐‘‡ [(๐‘“(๐‘ฅ0 ) − ๐ธ๐‘‡ [๐‘ฆฬ‚0 ])2 ] − ๐ธ๐‘ฆ0 |๐‘ฅ0 ๐ธ๐‘‡ [๐‘ฆฬ‚02 ] + ๐ธ๐‘ฆ0 |๐‘ฅ0 ๐ธ๐‘‡ [๐‘ฆฬ‚02 ]
and we can reduce it as we did in the previous quantities as
13
(48)
= [(๐‘“(๐‘ฅ0 ) − ๐ธ๐‘‡ [๐‘ฆฬ‚0 ])2 ] + ๐ธ๐‘‡ [๐‘ฆฬ‚02 ] − ๐ธ๐‘‡ [๐‘ฆฬ‚02 ]
(49)
2
= ๐ต๐‘–๐‘Ž๐‘ (๐‘“ฬ‚) + ๐‘‰๐‘Ž๐‘Ÿ(๐‘“ฬ‚).
(50)
Recall that ๐‘ฆฬ‚0 = ๐‘“ฬ‚(๐‘ฅ0 ) and so we use them interchangeably. The first quantity is called
the squared bias of an estimator (not to be confused with the vernacular bias meaning being
partial to something). Our estimator here is our approximating function ๐‘“ฬ‚ and the bias is the
difference between the expected value of our estimator and the true function. Often times, it is
desirable to have an unbiased estimator but, as we will soon see, this is not always the case. The
variance arises from our basic probability class ๐‘‰๐‘Ž๐‘Ÿ(๐‘ฆฬ‚0 ) = ๐ธ[๐‘ฆฬ‚02 ] − ๐ธ[๐‘ฆฬ‚0 ]2 and tells us how
much our function varies across different training sets. Notice also that the third quantity above
is also the same one in Equation (2.5) so that we have solved for the ๐‘€๐‘†๐ธ(๐‘“ฬ‚) here.
So recapping the expected squared prediction error at ๐‘ฅ0 equals
2
๐ธ๐‘ƒ๐ธ(๐‘ฅ0 ) = ๐œŽ 2 + ๐ต๐‘–๐‘Ž๐‘ (๐‘“ฬ‚) + ๐‘‰๐‘Ž๐‘Ÿ(๐‘“ฬ‚)
(51)
where ๐œŽ 2 is the irreducible error. This is a famous result in statistics and is also known as the
bias-variance decomposition. This completes Equations (2.25) and (2.27).
On page 31, Equation (2.35) gives us the equation for the likelihood of the data under the
least squares model as
๐‘›
๐‘›
1
2
๐ฟ(๐œƒ) = − log(2๐œ‹) − ๐‘› log ๐œŽ − 2 ∑(๐‘ฆ๐‘– − ๐‘“๐œƒ (๐‘ฅ๐‘– )) .
2
2๐œŽ
๐‘–=1
We will derive that here. Recall that for least squares, we assume the model
๐‘Œ = ๐‘‹๐‘‡ ๐›ฝ + ๐œ–
14
(52)
where ๐œ– ~ ๐‘(0, ๐œŽ 2 ). Notice that the density of ๐œ–๐‘– is
Pr(๐œ–๐‘– ) =
1
๐œŽ√2๐œ‹
exp (−
๐œ–๐‘– 2
)
2๐œŽ 2
(53)
exp (−
(๐‘ฆ๐‘– − ๐‘ฅ๐‘–๐‘‡ ๐›ฝ)2
)
2๐œŽ 2
(54)
then ๐œ–๐‘– = ๐‘ฆ๐‘– − ๐‘ฅ๐‘– ๐‘‡ ๐›ฝ so that we get the density
Pr(๐‘ฆ๐‘– |๐‘ฅ๐‘– , ๐›ฝ) =
1
๐œŽ√2๐œ‹
where here we assume that ๐‘ฅ๐‘– and ๐›ฝ are fixed and ๐‘ฆ๐‘– is random. We can also write the
distribution of ๐‘ฆ๐‘– |๐‘ฅ๐‘– ~ ๐‘(๐‘ฅ๐‘–๐‘‡ ๐›ฝ, ๐œŽ 2 ). Now suppose we have our training data, T, and we want to
estimate the parameters. As stated in the text, we would like to find the value of ๐›ฝ that
maximizes the probability of the data. When we write the quantity as Pr(๐‘ฆ|๐—, ๐›ฝ), this assumes
that ๐›ฝ is fixed. Instead, we will call this the likelihood and rewrite it
๐ฟ(๐œƒ) = ๐‘ƒ(๐‘ฆ|๐—, ๐›ฝ)
(55)
so that now ๐›ฝ is a function of the likelihood. Since we assume that our training examples our
drawn independently, we can write the likelihood of the data as
๐‘›
๐ฟ(๐›ฝ) = ∏ ๐‘(๐‘ฆ๐‘– |๐‘ฅ๐‘– , ๐›ฝ)
(56)
๐‘–=1
๐‘›
(๐‘ฆ๐‘– − ๐‘ฅ๐‘–๐‘‡ ๐›ฝ)2
=∏
exp (−
)
2๐œŽ 2
๐œŽ√2๐œ‹
1
(57)
๐‘–=1
Maximizing any monotonic transformation of the likelihood function is also the maximum. That
is, given a monotonic function ๐‘”(๐‘ฅ), we have that for the likelihood ๐ฟ(๐‘ฅ1 ) > ๐ฟ(๐‘ฅ2 ) for any
๐‘ฅ1 , ๐‘ฅ2 in the domain of ๐ฟ implies that ๐‘”(๐‘ฅ1 ) > ๐‘”(๐‘ฅ2 ). A common transformation used in
15
statistics is the log transformation since it turns the products of the likelihood into sums for the
log likelihood. By applying the log transformation, we get
๐‘›
(๐‘ฆ๐‘– − ๐‘ฅ๐‘–๐‘‡ ๐›ฝ)2
log(๐ฟ(๐›ฝ)) = โ„“(๐›ฝ) = ∑ log (
๐‘’๐‘ฅ๐‘ (−
))
2๐œŽ 2
√2๐œ‹๐œŽ
1
(58)
๐‘–=1
1
๐‘›
1
= ๐‘› log
− 2 ∑(๐‘ฆ๐‘– − ๐‘ฅ๐‘–๐‘‡ ๐›ฝ)2
√2๐œ‹๐œŽ 2๐œŽ
(59)
๐‘–=1
๐‘›
๐‘›
1
= − log 2๐œ‹ − ๐‘› log ๐œŽ − 2 ∑(๐‘ฆ๐‘– − ๐‘ฅ๐‘–๐‘‡ ๐›ฝ)2
2
2๐œŽ
(60)
๐‘–=1
which gives us Equation (2.35) from the book and maximizing the above quantity is the same as
minimizing the quantity ∑๐‘›๐‘–=1(๐‘ฆ๐‘– − ๐‘ฅ๐‘–๐‘‡ ๐›ฝ)2since it is again a monotonic transformation and
multiplying by (-1) turns a maximization problem into a minimization. Notice the surprising fact
here; we get the least squares loss function from linear regression from maximum likelihood
estimation.
To conclude this section, we illustrate the bias-variance tradeoff by reproducing Figure
2.11 from the book using nearest neighbors.
16
Figure 2.1 Bias-Variance Tradeoff.
The left side of Figure 2.1 represents a large value of ๐‘˜ using nearest neighbors which has
low variance in the test set but a high bias. The right side shows that as ๐‘˜ → 1, the prediction
error on the training set falls to zero but that the error on the testing set is high and thus the
model fails to generalize well.
Table 2.1. R Code for Figure 2.1.
library(FNN) # For nearest neighbor
library(lattice)
library(latticeExtra)
library(gridExtra)
# Color palette # http://www.cookbook-r.com/Graphs/Colors_%28ggplot2%29/
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442",
"#0072B2", "#D55E00", "#CC79A7")
# Define mean squared error
mse=function(x,y){return(mean((y-x)^2))}
# Generate random data
x1=rnorm(1000)
x2=x1+rnorm(1000)
x3=rnorm(1000)
17
y=x1+rnorm(1000)
data=data.frame(x1,x2,x3,y)
# Train/test (50/50) split
train=data[1:500,]
test=data[500:1000,]
# Instantiate error vectors
train_err=NULL
test_err=NULL
# Find train/test error for varying values of k
for(i in 15:1)
{
# Nearest Neighbor model
nearest_train <- knn.reg(train=train, test=train, y=y, k = i)
nearest_test <- knn.reg(train=train, test=test, y=y, k = i)
# MSE
train_err=append(train_err,mse(nearest_train$pred,train$y))
test_err=append(test_err,mse(nearest_test$pred,test$y))
}
# Plot
a1=xyplot(train_err~1:15,
xlab=list(label='Model Complexity',cex=0.7),
ylab=list(label='Prediction Error',cex=0.7),
main=list(label='Training and Test Error',cex=0.75),
scales = list(x = list(draw = FALSE),y=list(draw=FALSE)),
panel=function(...){
panel.lines(train_err,col=cbPalette[6])
})
a2=xyplot(test_err~1:15,
panel=function(...){
panel.lines(test_err,col=cbPalette[7])
})
grid.arrange(a1+a2,ncol=1)
2.5 Problems and Solutions
Exercise 2.1. Suppose each of K-classes has an associated target ๐‘ก๐‘˜ , which is a vector of
all zeros, except a one in the kth position. Show that classifying to the largest element of ๐‘ฆฬ‚
amounts to choosing the closest target,๐‘š๐‘–๐‘›๐‘˜ ||๐‘ก๐‘˜ − ๐‘ฆฬ‚||, if the elements of ๐‘ฆฬ‚ sum to one.
18
Proof: This is a classification problem defined in the book. Suppose we have a set ๐‘‡ with
K-elements (or classes) that hold the standard basis in ๐‘… ๐พ . That
is, ๐‘‡ = {(1,0, … ,0)๐‘‡ , (0,1, … ,0)๐‘‡ … , (0,0, … ,1)๐‘‡ } and ๐‘ก๐‘˜ ∈ ๐‘‡. In this problem, ๐‘ฆฬ‚ is a ๐พdimensional vector with the element ๐‘ฆฬ‚๐‘– being the probability that Pr(๐‘ฆ๐‘– = ๐‘ก๐‘˜ ). It is not clear
what model ๐‘ฆฬ‚ is generated from but presumably it is regression since ๐‘ฆฬ‚ is continuous. We will
not place any assumptions on the model.
To solve the equation, we can write it as
๐พ
min||๐‘ก๐‘˜ − ๐‘ฆฬ‚|| = min ∑(๐‘ก๐‘˜,๐‘– − ๐‘ฆ๐‘– )
๐‘˜
๐‘˜
2
๐‘–=1
๐พ
2
= min ∑ ๐‘ก๐‘˜,๐‘–
− 2๐‘ก๐‘˜,๐‘– ๐‘ฆ๐‘– + ๐‘ฆ๐‘–2
๐‘˜
๐‘–=1
Then notice here that for the first term, when ๐‘˜ = ๐‘–, the quantity equals 1 else it is 0. Thus,
2
∑๐‘– ๐‘ก๐‘˜,๐‘–
= 1 for all values of ๐‘˜. Likewise, the last term ∑๐‘– ๐‘ฆ๐‘–2 is independent of ๐‘˜ so that it is
constant with respect to ๐‘˜. Finally, the middle term ∑๐‘– −2๐‘ก๐‘˜,๐‘– ๐‘ฆ๐‘– = −2๐‘ฆ๐‘– when ๐‘˜ = ๐‘– and is 0
otherwise. Note that is also varies across different values of ๐‘˜ so that it is a function of ๐‘˜.
Then, we can rewrite the above function as a function of only the middle term as follows
๐พ
= min ∑ −2๐‘ก๐‘˜,๐‘– ๐‘ฆ๐‘– .
๐‘˜
๐‘–=1
And as we stated above, this is only non-zero at
= min −2๐‘ก๐‘˜ ๐‘ฆ๐‘˜
๐‘˜
19
⇔ min −๐‘ฆ๐‘˜
๐‘˜
and multiplying the above quantity by (−1), we can change the min to a max problem as follows
= max ๐‘ฆ๐‘˜
๐‘˜
and so we have shown that classifying the largest element ๐‘ฆฬ‚ amounts to choosing the closest
target.
Exercise 2.2. Show how to compute the Bayes decision boundary for the simulation
example in Figure 2.5.
Proof: For this example, the data has two classes and each was generated by a separate
๐ผ
mixture of Gaussians. That is, our generating density is ๐‘(๐‘š๐‘˜ , 5) is a weighted sum of 10
Gaussians generated from ๐‘((0,1)๐‘‡ , ๐ผ).
The generation process is described in the table below. They are identical except for the
first step.
Class 1
Class 2
1.
10 means generated from a
bivariate Gaussian ๐‘((0,1)๐‘‡ , ๐ผ).
2.
100 Samples selected as follows
a.
For each observation, ๐‘š๐‘˜
1
was selected with probability 10.
b.
Then a sample was
generated from the bivariate
๐ผ
Gaussian ๐‘(๐‘š๐‘˜ , 5).
20
1.
10 means generated from a
bivariate Gaussian ๐‘((1,0)๐‘‡ , ๐ผ).
2.
100 Samples selected as follows
a.
For each observation, ๐‘›๐‘–
1
was selected with probability 10.
b.
Then a sample was
generated from the bivariate
๐ผ
Gaussian ๐‘(๐‘›๐‘– , 5).
Recall that the Bayes classifier says that we classify to the most probable class using the
conditional distribution Pr(๐บ|๐‘‹). Hence, the decision boundary is the set of points that
partitions the vector space into two sets; one for each class. On the decision boundary itself, the
output label is ambiguous. Thus, the optimal Bayes decision boundary is defined as
๐ต๐‘œ๐‘ข๐‘›๐‘‘๐‘Ž๐‘Ÿ๐‘ฆ = {๐‘ฅ: max Pr(๐‘”|๐‘‹ = ๐‘ฅ) = max Pr(๐‘˜|๐‘‹ = ๐‘ฅ)}.
๐‘”∈๐บ
๐‘˜∈๐บ
That is, it is the set of points where the most probable class is tied between two or more
classes. In our example, there are only two classes so that ๐‘๐‘Ž๐‘Ÿ๐‘‘(๐บ) = 2.
Hence, we can rewrite the above quantity as
๐ต๐‘œ๐‘ข๐‘›๐‘‘๐‘Ž๐‘Ÿ๐‘ฆ = {๐‘ฅ: Pr(๐‘”|๐‘‹ = ๐‘ฅ) = Pr(๐‘˜|๐‘‹ = ๐‘ฅ)}
= {๐‘ฅ:
Pr(๐‘”|๐‘‹ = ๐‘ฅ)
= 1}.
Pr(๐‘˜|๐‘‹ = ๐‘ฅ)
We can then rewrite the above quantity by using Bayes rule as follows
Pr(๐‘”|๐‘‹ = ๐‘ฅ) Pr(๐‘‹ = ๐‘ฅ|๐‘”)Pr(๐‘”) /Pr(๐‘‹ = ๐‘ฅ) Pr(๐‘‹ = ๐‘ฅ|๐‘”) Pr(๐‘”)
=
=
=1
Pr(๐‘˜|๐‘‹ = ๐‘ฅ) Pr(๐‘‹ = ๐‘ฅ|๐‘˜) Pr(๐‘˜) /Pr(๐‘‹ = ๐‘ฅ) Pr(๐‘‹ = ๐‘ฅ|๐‘˜) Pr(๐‘˜)
because we have 100 points in each class, Pr(๐‘”) = Pr(๐‘˜) so we can remove those from the
above equation. Then the boundary is {๐‘ฅ: Pr(๐‘‹ = ๐‘ฅ|๐‘”) = Pr(๐‘‹ = ๐‘ฅ|๐‘˜)} which in this example,
since we know the generating density to be Gaussian, we can rewrite
10
Pr(๐‘‹ = ๐‘ฅ|๐‘”) = ∏
๐‘˜=1
1
5√2๐œ‹
21
exp (−
(๐‘ฅ − ๐‘š๐‘˜ )2
)
2 ⋅ 25
and since we know the log transformation is monotonic and preserves the ordering, we can log it
and write it as
10
(๐‘ฅ − ๐‘š๐‘˜ )2
log(Pr(๐‘‹ = ๐‘ฅ|๐‘”)) = ∑ log (
)−
.
2 ⋅ 25
5√2๐œ‹
1
๐‘˜=1
Now, equating class ๐‘” and ๐‘˜ to get the decision boundary, we get the following
10
10
(๐‘ฅ − ๐‘š๐‘˜ )2
(๐‘ฅ − ๐‘›๐‘– )2
1
๐ต๐‘œ๐‘ข๐‘›๐‘‘๐‘Ž๐‘Ÿ๐‘ฆ = {๐‘ฅ: ∑ log (
)−
= ∑ log (
)−
}
2 ⋅ 25
2 ⋅ 25
5√2๐œ‹
5√2๐œ‹
1
๐‘˜=1
๐‘–=1
10
10
= {๐‘ฅ: ∑(๐‘ฅ − ๐‘š๐‘˜ )2 = ∑(๐‘ฅ − ๐‘›๐‘– )2 }
๐‘˜=1
๐‘–=1
The exact boundary for figure 2.5 would depend on our generated means.
Exercise 2.4. The edge effect problem discussed on page 23 is not peculiar to uniform
sampling from bounded domains. Consider inputs drawn from a spherical multinormal
distribution ๐‘‹ ~ ๐‘(0, ๐ผ๐‘ ). The squared distance from any sample point to the origin has a ๐œ’๐‘2
distribution with mean ๐‘. Consider a prediction point ๐‘ฅ0 drawn from this distribution, and let
๐‘Ž = ๐‘ฅ0 /||๐‘ฅ0 || be an associated unit vector. Let ๐‘ง๐‘– = ๐‘Ž๐‘‡ ๐‘ฅ๐‘– be the projection of each of the
training points on this direction.
1. Show that the ๐‘ง๐‘– are distributed ๐‘(0,1).
2. Show that the target point has expected squared distance ๐‘ from the origin.
Hence for ๐‘ =10, a randomly drawn test point is about 3.1 standard deviations from the
origin, while all the training points are on average one standard deviation along direction ๐‘Ž. So
most prediction points see themselves as lying on the edge of the training set.
22
Proof: There is a well-known property that states that a linear combination of mutually
independent normal random variables is itself normal. We will not derive that proof but it can be
easily found on the internet or in a probability textbook. That is, if
๐‘
๐‘ง = ∑ ๐‘Ž๐‘– ๐‘ฅ๐‘–
๐‘–=1
where each ๐‘ฅ๐‘– ~ ๐‘(0,1) then the mean is
๐‘
๐ธ[๐‘ง] = ∑ ๐‘Ž๐‘– ๐‘ฅ๐‘–
๐‘–=1
and the variance is
๐‘
๐‘‰๐‘Ž๐‘Ÿ(๐‘ง) = ∑ ๐‘Ž๐‘–2 ๐‘‰๐‘Ž๐‘Ÿ(๐‘ฅ๐‘– )
๐‘–=1
and ๐‘ง๐‘– ~ ๐‘(๐ธ[๐‘ง], ๐‘‰๐‘Ž๐‘Ÿ(๐‘ง)).
For (1), ๐‘ง๐‘– = ๐‘Ž๐‘‡ ๐‘ฅ๐‘– so that the expectation is
๐ธ[๐‘ง๐‘– ] = ๐ธ[๐‘Ž๐‘‡ ๐‘ฅ๐‘– ] = ๐‘Ž๐‘‡ ๐ธ[๐‘ฅ๐‘– ] = ๐‘Ž๐‘‡ 0
by linearity since ๐‘Ž is constant and 0 is the ๐‘ × 1-dimensional vector of zeros. Notice here that ๐‘Ž
is constant although it is randomly drawn since once we draw it, we are using the same value and
hence it is no longer random.
For the variance,
๐‘‡
๐‘‡
๐‘‰๐‘Ž๐‘Ÿ(๐‘Ž ๐‘ฅ๐‘– ) = ๐‘Ž ๐‘‰๐‘Ž๐‘Ÿ(๐‘ฅ๐‘– )๐‘Ž = ๐‘Ž
๐‘‡
๐ผ๐‘ ๐‘Ž = ||๐‘Ž||22 =
๐‘ฅ0๐‘‡ ๐‘ฅ0
||๐‘ฅ0 ||22
=
=1
||๐‘ฅ0 ||22 ||๐‘ฅ0 ||22
by the property of variance where constant terms are squared. This leaves that the resulting
distribution ๐‘ง๐‘– to have mean 0 and variance 1.
23
Using that property, ๐‘ง๐‘– = ๐‘Ž๐‘‡ ๐‘ฅ states that ๐‘ง๐‘– is a linear combination of normal
distributions implies that ๐‘ง๐‘– is itself normally distributed. We have shown that ๐‘ง๐‘– ~ ๐‘(0,1) and
so that this completes the problem.
For (2), it is not clear from the problem what the “target” point is, but presumably it is ๐‘‹
since it can be shown ๐‘‹ has a squared expected distance ๐‘. Since ๐‘‹ is a ๐‘ × 1-dimensional
random vector generated from ๐‘(0, ๐ˆ๐‘ ), then the squared distance of ๐‘‹ can be written
conveniently in vector form as ๐‘‹ ๐‘‡ ๐‘‹ = ∑๐‘๐‘–=1 ๐‘ฅ๐‘–2 . Notice that the covariance between ๐‘ฅ๐‘– and ๐‘ฅ๐‘— is
0 for all ๐‘–, ๐‘— but this is not imply the independence required by the chi-squared distribution.
However, since the multivariate distribution is spherical, we will assume that the author implies
that they are independently drawn. Then assuming independence, each ๐‘ฅ๐‘– has mean 0 and
variance of 1 and
๐‘
๐‘ฅ๐‘–๐‘‡ ๐‘ฅ๐‘– = ∑ ๐‘ฅ๐‘–2 ~ ๐œ’๐‘2
๐‘–=1
and so is distributed ๐œ’๐‘2 with mean ๐‘.
Exercise 2.5. Derive Equation (2.27) where the expected prediction error at a point ๐‘ฅ0 is
as follows
๐ธ๐‘ƒ๐ธ(๐‘ฅ0 ) = ๐ธ๐‘ฆ0 |๐‘ฅ0 ๐ธ๐‘‡ (๐‘ฆ0 − ๐‘ฆฬ‚0 )2
= ๐‘‰๐‘Ž๐‘Ÿ(๐‘ฆ0 |๐‘ฅ0 ) + ๐ธ๐‘‡ [๐‘ฆฬ‚0 − ๐ธ๐‘‡ ๐‘ฆฬ‚0 ]2 + [๐ธ๐‘‡ ๐‘ฆฬ‚0 − ๐‘ฅ0๐‘‡ ๐›ฝ]2
= ๐‘‰๐‘Ž๐‘Ÿ(๐‘ฆ0 |๐‘ฅ0 ) + ๐‘‰๐‘Ž๐‘Ÿ๐‘‡ (๐‘ฆฬ‚0 ) + ๐ต๐‘–๐‘Ž๐‘  2 (๐‘ฆฬ‚0 )
= ๐œŽ 2 + ๐ธ๐‘‡ ๐‘ฅ0๐‘‡ (๐— T ๐—)−1 ๐‘ฅ0 ๐œŽ 2 + 02
24
(2.27)
Proof : On page 11 of this guide, we showed the derivation of lines 1-3. For the last line
4, we assume that the relationship between ๐‘Œ and ๐‘‹ is linear,
๐‘Œ = ๐‘‹๐‘‡ ๐›ฝ + ๐œ–
where ๐œ– ~ ๐‘(0, ๐œŽ 2 ) and the model is fit by least squares. Then the first term ๐‘‰๐‘Ž๐‘Ÿ(๐‘ฆ0 |๐‘ฅ0 ) is the
conditional variance and since is distributed ๐‘(0, ๐œŽ 2 ) , it equals ๐œŽ 2 . The least squares is
unbiased under for the linear assumption so the last term is 0.
This leaves us with the second term which we can write as follows
๐‘‰๐‘Ž๐‘Ÿ๐‘‡ (๐‘ฆฬ‚0 ) = ๐‘‰๐‘Ž๐‘Ÿ๐‘‡ (๐‘ฅ0๐‘‡ ๐›ฝฬ‚ ) = ๐‘ฅ0๐‘‡ ๐‘‰๐‘Ž๐‘Ÿ๐‘‡ (๐›ฝฬ‚ )๐‘ฅ0 .
The multivariate covariance is a generalization of the scalar case for squaring constant terms
under variance properties. That is, we used the property of covariance which is ๐ถ๐‘œ๐‘ฃ(๐€๐‘ฅ + ๐‘Ž) =
๐€๐ถ๐‘œ๐‘ฃ(๐‘ฅ)๐€๐‘‡ for ๐‘ฅ ∈ ๐‘… ๐‘ . Keep in mind that we are assuming the ๐‘ฅ๐‘– ’s are held fixed and ๐‘ฆ is
random. Now, we only need to find the ๐‘‰๐‘Ž๐‘Ÿ๐‘‡ (๐›ฝฬ‚ ). Recall that
๐›ฝฬ‚ = (๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ๐‘ฆ
and
๐‘Œ = ๐‘‹๐›ฝ + ๐œ–
then this implies that
๐›ฝฬ‚ = (๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ (๐—๐›ฝ + ๐œ–)
⇒ ๐›ฝฬ‚ = (๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ๐—๐›ฝ + (๐— ๐‘‡ ๐—)−1 ๐‘‹ ๐‘‡ ๐œ–
25
by doing some matrix operations. Notice that the left term (๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ๐—๐›ฝ is non-random and so
this leaves that the variance is
๐‘‰๐‘Ž๐‘Ÿ๐‘‡ (๐›ฝฬ‚ ) = ๐‘‰๐‘Ž๐‘Ÿ((๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ๐œ–)
= (๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ๐‘‰๐‘Ž๐‘Ÿ๐‘‡ (๐œ–)๐‘‹(๐— ๐‘‡ ๐—)−1
Since ๐‘‰๐‘Ž๐‘Ÿ๐‘‡ (๐œ–) = ๐œŽ 2
= (๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ๐—(๐— ๐‘‡ ๐—)−1 ๐œŽ 2
= (๐— ๐‘‡ ๐—)−1 ๐œŽ 2
then this implies that
๐‘‰๐‘Ž๐‘Ÿ๐‘‡ (๐‘ฆฬ‚0 ) = ๐‘ฅ0๐‘‡ (๐— ๐‘‡ ๐—)−1 ๐‘ฅ0๐‘‡ ๐œŽ 2
and since none of the remaining variables are random with respect to the training data, this is
equivalent to
= ๐ธ๐‘‡ ๐‘ฅ0๐‘‡ (๐— ๐‘‡ ๐—)−1 ๐‘ฅ0๐‘‡ ๐œŽ 2
which concludes the problem.
2. Derive equation (2.28), making use of the cyclic property of the trace operator
[๐‘ก๐‘Ÿ๐‘Ž๐‘๐‘’(๐ด๐ต) = ๐‘ก๐‘Ÿ๐‘Ž๐‘๐‘’(๐ต๐ด)], and its linearity (which allows us to interchange the order of trace
and expectation).
๐ธ๐‘ฅ0 ๐ธ๐‘ƒ๐ธ(๐‘ฅ0 ) ~ ๐ธ๐‘ฅ0 ๐‘ฅ0๐‘‡ ๐ถ๐‘œ๐‘ฃ(๐‘‹)−1 ๐‘ฅ0 ๐œŽ 2 /๐‘ + ๐œŽ 2
= trace[๐ถ๐‘œ๐‘ฃ(๐‘‹)−1 ๐ถ๐‘œ๐‘ฃ(๐‘ฅ0 )]๐œŽ 2 /๐‘ + ๐œŽ 2
26
๐‘
= ๐œŽ2 ( ) + ๐œŽ2
๐‘
(2.28)
which is the formula for expected prediction error.
Our assumption here is that the matrix ๐‘‹ has mean 0 along the columns. Then, by the
definition of covariance matrices, the covariance of a random vector ๐‘‹ is
๐ถ๐‘œ๐‘ฃ(๐‘‹) = ๐ธ[(๐‘‹ − ๐œ‡)(๐‘‹ − ๐œ‡)๐‘‡ ] = ๐ธ[๐‘‹๐‘‹ ๐‘‡ ] − ๐œ‡๐œ‡ ๐‘‡
where ๐œ‡ = 0 here since our matrix is centered. Then we get our sample covariance matrix to be
ฬ‚ (๐‘‹) =
๐ถ๐‘œ๐‘ฃ
๐—๐‘‡ ๐—
.
๐‘
To see this, recall that one way to view matrix products is
๐‘‡
๐— ๐—
=
N
−
−
[−
๐ฑ1๐‘‡
๐ฑ 2๐‘‡
โ‹ฎ
๐ฑ ๐‘›๐‘‡
๐ฑ1๐‘‡ ๐ฑ1
๐‘
๐ฑ 2๐‘‡ ๐ฑ1
= ๐‘
โ‹ฎ
๐‘‡
๐ฑ ๐‘ ๐ฑ1
[ ๐‘
−
−
−]
|
[๐ฑ 1
|
๐ฑ1๐‘‡ ๐ฑ 2
๐‘
๐ฑ 2๐‘‡ ๐ฑ 2
๐‘
โ‹ฎ
๐‘‡
๐ฑ ๐‘ ๐ฑ1
๐‘
|
๐ฑ2
|
โ‹ฏ
|
๐ฑ ๐‘ ] /๐‘
|
๐ฑ1๐‘‡ ๐ฑ ๐‘
โ‹ฏ
๐‘
๐ฑ 2๐‘‡ ๐ฑ ๐‘
โ‹ฏ
๐‘
โ‹ฑ
โ‹ฎ
๐‘‡
๐ฑ๐‘ ๐ฑ๐‘
โ‹ฏ
๐‘ ]
ฬ‚ (๐‘‹1 , ๐‘‹1 ) ๐ถ๐‘œ๐‘ฃ
ฬ‚ (๐‘‹1 , ๐‘‹2 ) โ‹ฏ ๐ถ๐‘œ๐‘ฃ
ฬ‚ (๐‘‹1 , ๐‘‹๐‘ )
๐ถ๐‘œ๐‘ฃ
ฬ‚
ฬ‚
ฬ‚
= ๐ถ๐‘œ๐‘ฃ (๐‘‹2 , ๐‘‹1 ) ๐ถ๐‘œ๐‘ฃ(๐‘‹2 , ๐‘‹2 ) โ‹ฏ ๐ถ๐‘œ๐‘ฃ(๐‘‹2, ๐‘‹๐‘ )
โ‹ฎ
โ‹ฎ
โ‹ฑ
โ‹ฎ
ฬ‚
ฬ‚
ฬ‚
๐ถ๐‘œ๐‘ฃ
(๐‘‹
,
๐‘‹
)
๐ถ๐‘œ๐‘ฃ
(๐‘‹
,
๐‘‹
)
โ‹ฏ
๐ถ๐‘œ๐‘ฃ
(๐‘‹
[
๐‘ 1
๐‘ 2
๐‘ , ๐‘‹๐‘ )]
So that in the first line, if ๐‘ is large and assuming ๐ธ(๐‘‹) = 0, then ๐— ๐‘‡ ๐— → ๐‘๐ถ๐‘œ๐‘ฃ(๐‘‹).
27
For the second line, ๐ธ๐‘ฅ0 ๐‘ฅ0๐‘‡ ๐ถ๐‘œ๐‘ฃ(๐‘‹)−1 ๐‘ฅ0 is scalar, thus we can use the linearity property
where the expectation distributes over the trace sum as follows
๐ธ๐‘ฅ0 trace[๐‘ฅ0๐‘‡ ๐ถ๐‘œ๐‘ฃ(๐‘‹)−1 ๐‘ฅ0 ] = trace[๐ธ๐‘ฅ0 ๐‘ฅ0๐‘‡ ๐ถ๐‘œ๐‘ฃ(๐‘‹)−1 ๐‘ฅ0 ]
and then use the cyclical property of the trace operator to rearrange terms to get
= trace[๐ธ๐‘ฅ0 ๐‘ฅ0 ๐‘ฅ0๐‘‡ ๐ถ๐‘œ๐‘ฃ(๐‘‹)−1 ]
and thus
trace[๐ธ๐‘ฅ0 [๐‘ฅ0 ๐‘ฅ0๐‘‡ ]๐ถ๐‘œ๐‘ฃ(๐‘‹)−1 ] = trace[๐ถ๐‘œ๐‘ฃ(๐‘‹) ⋅ ๐ถ๐‘œ๐‘ฃ(๐‘‹)−1 ]
= trace[๐ˆ๐‘ ]
=๐‘
๐‘
which gives us that trace[๐ถ๐‘œ๐‘ฃ(๐‘‹)−1 ๐ถ๐‘œ๐‘ฃ(๐‘ฅ0 )]๐œŽ 2 /๐‘ = ๐œŽ 2 ( ) and thus we have finished the
๐‘
problem. It is important to understand the motivation behind the problem – that is, for linear
regression models, the expected squared prediction error grows by a factor of ๐‘ so that it is
relatively small when ๐‘ is large.
Exercise 2.7. Suppose we have a sample of N pairs ๐‘ฅ๐‘– , ๐‘ฆ๐‘– drawn i.i.d. from the
distribution characterized as follows:
๐‘ฅ๐‘– ~ โ„Ž(๐‘ฅ), ๐‘กโ„Ž๐‘’ ๐‘‘๐‘’๐‘ ๐‘–๐‘”๐‘› ๐‘‘๐‘’๐‘›๐‘ ๐‘–๐‘ก๐‘ฆ
๐‘ฆ๐‘– = ๐‘“(๐‘ฅ๐‘– ) + ๐œ–๐‘– , ๐‘“ ๐‘–๐‘  ๐‘กโ„Ž๐‘’ ๐‘Ÿ๐‘’๐‘”๐‘Ÿ๐‘’๐‘ ๐‘ ๐‘–๐‘œ๐‘› ๐‘“๐‘ข๐‘›๐‘๐‘ก๐‘–๐‘œ๐‘›
๐œ–๐‘– ~ (0, ๐œŽ 2 ), (๐‘š๐‘’๐‘Ž๐‘› ๐‘ง๐‘’๐‘Ÿ๐‘œ, ๐‘ฃ๐‘Ž๐‘Ÿ๐‘–๐‘Ž๐‘›๐‘๐‘’ ๐œŽ 2 )
28
We construct an estimator for f linear in the ๐‘ฆ๐‘– ,
๐‘
๐‘“ฬ‚(๐‘ฅ0 ) = ∑ ๐‘™๐‘– (๐‘ฅ0 ; ๐‘‹)๐‘ฆ๐‘– ,
๐‘–=1
where the weights ๐‘™๐‘– (๐‘ฅ0 ; ๐‘‹) do not depend on the ๐‘ฆ๐‘– but do depend on the entire training
sequence of ๐‘ฅ๐‘– denoted here by X.
(a). Show that linear regression and k-nearest-neighbor regression are members of this
class of estimators. Describe explicitly the weights ๐‘™๐‘– (๐‘ฅ0 ; ๐‘‹) in each of these cases.
For linear regression, the coefficient vector ๐›ฝฬ‚ = ⟨๐›ฝฬ‚0 , … , ๐›ฝฬ‚๐‘ ⟩ depends on the entire
training set ๐— through its derivation
๐›ฝฬ‚ = (๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ๐‘ฆ
since each training example is a row in the design matrix
๐—=
−
−
[−
๐ฑ1๐‘‡
๐ฑ 2๐‘‡
โ‹ฎ
๐ฑ ๐‘›๐‘‡
−
−
.
−]
Then, to make a prediction at a point ๐‘ฅ0 , we would need to solve the following
๐‘“ฬ‚(๐‘ฅ0 ) = ๐‘ฅ0๐‘‡ ๐›ฝฬ‚ = ๐‘ฅ0๐‘‡ (๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ๐‘ฆ.
Define [(๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ]๐‘‡๐‘— to be the jth row of (๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ . Then the weight vector is
๐‘™๐‘– (๐‘ฅ0 ; ๐‘‹) = ๐‘ฅ01 [(๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ]1๐‘‡ + ๐‘ฅ02 [(๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ]๐‘‡2 + โ‹ฏ + ๐‘ฅ0๐‘ [(๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ]๐‘‡๐‘
29
๐‘
= ∑(๐‘ฅ0๐‘‡ (๐— ๐‘‡ ๐—)−1๐— ๐‘‡ )๐‘—
๐‘—=1
so that we have shown that the weight vector depends on the entire training set but not on ๐‘ฆ๐‘– .
For k-nearest-neighbor, the function is as follows
๐‘“ฬ‚(๐‘ฅ) =
1
๐‘˜
∑
๐‘ฆ๐‘– .
๐‘ฅ๐‘– ∈๐‘๐‘˜ (๐‘ฅ)
Then the weight is an indicator function ๐‘™๐‘– (๐‘ฅ0 ; ๐‘‹) = 1[๐‘ฅ๐‘– ∈ ๐‘๐‘˜ (๐‘ฅ)] where recall we defined a set
๐ท = {๐‘‘(๐‘ฅ๐‘– , ๐‘ฅ0 ): ๐‘ฅ๐‘– ∈ ๐‘‡} where ๐‘‘(๐‘ฅ0 , ๐‘ฅ๐‘– ) is any metric. We also defined
๐‘๐‘˜ (๐‘ฅ0 ) = {๐‘ฅ๐‘– : ๐‘‘(๐‘ฅ๐‘– , ๐‘ฅ0 ) ≤ ๐‘‘๐‘˜ where dk is the ๐‘˜th smallest element of ๐ท, ๐‘ฅ๐‘– ∈ ๐‘‡}.
Notice here that to construct the set ๐ท for any new point ๐‘ฅ0 , we have to search through the entire
training set and thus the weight depends on the entire set X.
(b). Decompose the conditional mean-squared error
2
๐ธ๐‘ฆ|๐‘ฅ (๐‘“(๐‘ฅ0 ) − ๐‘“ฬ‚(๐‘ฅ0 ))
into a conditional squared bias and a conditional variance component. Let ๐‘‹, ๐‘Œ represent the
entire training sequence of ๐‘ฆ๐‘– .
Proof: For readability, assume the following: ๐‘“ = ๐‘“(๐‘ฅ0 ) and ๐‘“ฬ‚(๐‘ฅ0 ) = ๐‘“ฬ‚, and recall that
only ๐‘“ฬ‚ depends on the training set. Also, refer back to page 11 of the thesis for clarity, since we
solved the same problem there. Then
2
๐ธ๐‘ฆ|๐‘ฅ (๐‘“ − ๐‘“ฬ‚) = ๐‘“ 2 − 2๐‘“๐ธ๐‘ฆ|๐‘ฅ (๐‘“ฬ‚) + ๐ธ๐‘ฆ|๐‘ฅ (๐‘“ฬ‚ 2 )
30
2
2
= ๐ธ๐‘ฆ|๐‘ฅ (๐‘“ − ๐ธ๐‘ฆ|๐‘ฅ (๐‘“ฬ‚)) + ๐ธ๐‘ฆ|๐‘ฅ (๐‘“ฬ‚ 2 ) − ๐ธ๐‘ฆ|๐‘ฅ (๐‘“ฬ‚)
2
= ๐ต๐‘–๐‘Ž๐‘ ๐‘ฆ|๐‘ฅ
(๐‘“ฬ‚) + ๐‘‰๐‘Ž๐‘Ÿ๐‘ฆ|๐‘ฅ (๐‘“ฬ‚)
(c). Decompose the unconditional mean-squared error.
Proof: Using the same notation above where ๐‘“ = ๐‘“(๐‘ฅ0 ) and ๐‘“ฬ‚(๐‘ฅ0 ) = ๐‘“ฬ‚, then again, only
๐‘“ฬ‚ depends on the training data and we get
2
๐ธ๐‘ฆ,๐‘ฅ (๐‘“ − ๐‘“ฬ‚) = ๐‘“ 2 − 2๐‘“๐ธ๐‘ฆ,๐‘ฅ (๐‘“ฬ‚) + ๐ธ๐‘ฆ,๐‘ฅ (๐‘“ฬ‚ 2 )
2
2
= ๐ธ๐‘ฆ,๐‘ฅ (๐‘“ − ๐ธ๐‘ฆ,๐‘ฅ (๐‘“ฬ‚)) + ๐ธ๐‘ฆ,๐‘ฅ (๐‘“ฬ‚ 2 ) − ๐ธ๐‘ฆ,๐‘ฅ (๐‘“ฬ‚)
2
= ๐ต๐‘–๐‘Ž๐‘ ๐‘ฆ,๐‘ฅ
(๐‘“ฬ‚) + ๐‘‰๐‘Ž๐‘Ÿ๐‘ฆ,๐‘ฅ (๐‘“ฬ‚)
31
CHAPTER 3
LINEAR METHODS FOR REGRESSION
3.1 Introduction
This section goes over linear regression models, subset selection, shrinkage methods, and
methods using derived input directions.
3.2 Linear Regression Models and Least Squares
On page 47, Equation (3.11) states that linear regression model
๐‘Œ = ๐‘‹๐›ฝ + ๐œ–
where the error ๐œ– ~ ๐‘(0, ๐œŽ 2 ) implies that ๐›ฝฬ‚ ~ ๐‘(๐›ฝ, (๐— ๐‘‡ ๐—)−1 ๐œŽ 2 ). We will show that this
statement holds. Consider the following estimate of ๐›ฝฬ‚
๐›ฝฬ‚ = (๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ๐‘ฆ
(61)
where ๐‘ฆ is defined above, then this implies that
๐›ฝฬ‚ = (๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ (๐‘‹๐›ฝ + ๐œ–)
(62)
= (๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ๐—๐›ฝ + (๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ๐œ–
(63)
= ๐›ฝ + (๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ๐œ–
(64)
32
Then we can use the property that a linear transformation of a multivariate normal random vector
also has a multivariate normal distribution. We won’t prove this fact here but it states that given
๐‘‹~๐‘(๐œ‡, ๐‘‰) and suppose we have a linear transformation of ๐‘‹
๐‘ฆ = ๐ด + ๐๐‘‹
then ๐‘Œ also has a multivariate normal distribution with mean
๐ธ[๐‘ฆ] = ๐ด + ๐๐œ‡
and covariance matrix
๐‘‰๐‘Ž๐‘Ÿ[๐‘ฆ] = ๐๐•๐ ๐‘‡ .
Substituting this above, this implies that ๐›ฝฬ‚ ~ ๐‘(๐›ฝ, (๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ๐œŽ 2 )
๐ธ[๐›ฝฬ‚ ] = ๐›ฝ + (๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ 0
=๐›ฝ
(65)
๐‘‰๐‘Ž๐‘Ÿ(๐›ฝฬ‚ ) = (๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ๐—(๐— ๐‘‡ ๐—)−1 ๐œŽ 2
= (๐— ๐‘‡ ๐—)−1 ๐œŽ 2
(66)
⇒ ๐›ฝฬ‚ ~ ๐‘(๐›ฝ, (๐— ๐‘‡ ๐—)−1 ๐œŽ 2 )
(67)
and so we have shown Equation 3.10.
3.3
Shrinkage Methods
On page 54, Algorithm 3.1 has the Gram-Schmidt procedure for multiple regression. We
will use the Gram-Schmidt procedure to show how this leads to the QR decomposition Equation
33
(3.31). Then we will perform the QR decomposition on a small 3 × 3 matrix in R and show how
it is beneficial in describing least squares.
The QR decomposition is
๐— = ๐๐‘.
(68)
If ๐— ∈ ๐‘… ๐‘›×๐‘+1 then ๐ ∈ ๐‘… ๐‘›×๐‘+1 and is an orthogonal matrix (i.e.๐๐‘‡ ๐ = ๐ผ) and ๐‘ ∈ ๐‘… ๐‘×๐‘ where
๐‘ is an upper triangular matrix. It is important to remember that an orthogonal matrix is a matrix
with orthogonal unit column vectors rather than just orthogonal columns as the name suggests.
Often times orthogonal matrices are assumed to be square, however, it is not the case in the
book.
Recall the algorithm is as follows:
Table 3.1. Algorithm for Gram-Schmidt Orthogonalization.
๐‘ง
1. We initialize ๐‘ง0 = ๐‘ฅ0 = 1, ๐‘’0 = ||๐‘ง0 ||
0
2. For ๐‘— = 1,2, … , ๐‘
a. Dot product ๐‘ฅ๐‘— and ๐‘’0 , ๐‘’1 , … , ๐‘’๐‘—−1 to produce coefficients ๐›พฬ‚๐‘™๐‘— = ๐‘’๐‘™ ๐‘ฅ๐‘— ,โ„“ = 0, … , ๐‘— −
๐‘—−1
1 and residual vector ๐‘ง๐‘— = ๐‘ฅ๐‘— − ∑๐‘˜=1 ๐›พฬ‚๐‘˜๐‘— ๐‘’๐‘˜
3. Regress ๐‘ฆ on the residual ๐‘ง๐‘ to give the estimate for ๐›ฝฬ‚๐‘ .
Suppose we have ๐— ∈ ๐‘… ๐‘›×๐‘+1 as follows
|
๐‘ฅ
๐—=[ 0
|
|
๐‘ฅ1
|
34
… |
… ๐‘ฅ๐‘ ]
… |
then
๐‘ง
๐‘ง0 = ๐‘ฅ0 = 1
๐‘’0 = ||๐‘ง 0||
๐‘ง1 = ๐‘ฅ1 − (๐‘ฅ1 ⋅ ๐‘’0 )๐‘’0
๐‘’1 = ||๐‘ง 1||
๐‘ง2 = ๐‘ฅ2 − (๐‘ฅ2 ⋅ ๐‘’0 )๐‘’0 − (๐‘ฅ2 ⋅ ๐‘’1 )๐‘’1
๐‘’2 = ||๐‘ง 2||
๐‘ง๐‘ = ๐‘ฅ๐‘ − (๐‘ฅ๐‘ ⋅ ๐‘’0 )๐‘’0 − (๐‘ฅ๐‘ ⋅ ๐‘’1 )๐‘’1 − โ‹ฏ − (๐‘ฅ๐‘ ⋅ ๐‘’๐‘−1 )๐‘’๐‘−1
๐‘’๐‘ =
0 2
๐‘ง
1 2
๐‘ง
2 2
๐‘ง๐‘
||๐‘ง๐‘ ||2
and so that the resulting matrix is
|
๐‘ฅ
๐—=[ 0
|
|
๐‘ฅ1
|
…
…
…
|
|
๐‘ฅ๐‘ ] = [๐‘’0
|
|
|
๐‘’1
|
๐‘ฅ0 ⋅ ๐‘’0
… |
0
… ๐‘’๐‘ ] [
โ‹ฎ
… |
0
๐‘ฅ1 ⋅ ๐‘’0
๐‘ฅ1 ⋅ ๐‘’1
โ‹ฎ
0
…
…
โ‹ฑ
…
๐‘ฅ๐‘ ⋅ ๐‘’0
๐‘ฅ๐‘ ⋅ ๐‘’1
] = ๐๐‘
โ‹ฎ
๐‘ฅ๐‘ ⋅ ๐‘’๐‘
For example, consider the matrix
1 1
๐‘‹ = [1 1
1 0
0
1
1] , ๐‘ฆ = [2]
1
3
We will use this matrix to show that solving least squares by the QR decomposition is equal to
the normal equation.
Table 3.2. R Code for Gram-Schmidt Orthogonalization.
# Create Matrix
x0=c(1,1,1)
x1=c(1,1,0)
x2=c(0,1,1)
y=c(1,2,3)
X=data.frame(x0,x1,x2,y)
35
# Create empty matrices Q and R
Q=R=matrix(0,nrow=3,ncol=3)
# Perform Operations
# First iteration
z0=x0
e0=z0/(sqrt(z0%*%z0))
Q[,1]=e0
# Second iteration
z1=x1-x1%*%e0*e0
e1=z1/(sqrt(z1%*%z1))
Q[,2]=e1
# Third iteration
z2=x2-x2%*%e0*e0-x2%*%e1*e1
e2=z2/(sqrt(z2%*%z2))
Q[,3]=e2
# Fill in R matrix
R[1,1]=x0%*%e0
R[1,2]=x1%*%e0
R[1,3]=x2%*%e0
R[2,2]=x1%*%e1
R[2,3]=x2%*%e1
R[3,3]=x2%*%e2
# Check the matrix
Q%*%R
# Beta Hat
solve(R)%*%t(Q)%*%y
# Check using linear regression
lm(y~x0+x1+x2-1,data=X)
Where the solution for ๐›ฝฬ‚ under the QR Decomposition is
๐›ฝฬ‚ = ๐‘−1 ๐๐‘‡ ๐‘ฆ
(69)
and gives the coefficient vector identical to the one fit by the R function lm()
2
๐›ฝฬ‚ = [−1].
−1
In order to solve from Equation (3.31) given as
36
(70)
๐— = ๐๐‘
(71)
๐›ฝฬ‚ = ๐‘−1 ๐๐‘‡ ๐‘ฆ
(72)
to equation (3.32)
we can just use basic matrix operations. That is ๐— = ๐๐‘, and ๐›ฝฬ‚ = (๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ๐‘ฆ
then,
๐›ฝฬ‚ = (๐‘๐‘‡ ๐๐‘‡ ๐๐‘)−1 ๐‘๐‘‡ ๐๐‘‡ ๐‘ฆ
= (๐‘๐‘‡ ๐‘)−1 ๐‘๐‘‡ ๐๐‘‡ ๐‘ฆ
= ๐‘−1 ๐‘๐‘‡
−1
๐‘๐‘‡ ๐๐‘‡ ๐‘ฆ
= ๐‘−1 ๐๐‘‡ ๐‘ฆ
(73)
and so we get Equation (3.32).
By decomposing ๐— this way, we can save a lot on computation. Recall that the column
space of the matrix ๐— is the set of all possible linear combinations of its column vectors. More
formally, suppose ๐— ∈ ๐‘… ๐‘›×๐‘ , then the column space of ๐— is ๐ถ(๐—) = {๐‘ฃ: ๐—๐‘Ž = ๐‘ฃ, ∀๐‘Ž ∈ ๐‘… ๐‘ }. Q is
in the column space of ๐—, as seen by ๐— = ๐๐‘ ⇒ ๐—๐‘−1 = ๐, and it provides an orthonormal basis
in the column space of ๐—. Recall that the geometric interpretation of least squares fitted vector
was to orthogonally project ๐‘ฆ into the column space of ๐— by using the projection matrix of ๐—
(also known as hat matrix). Now with the QR decomposition, we can just project ๐‘ฆ onto ๐ to get
the fitted vector ๐‘ฆฬ‚ which turns out to be
๐(๐๐‘‡ ๐)−1 ๐๐‘‡ ๐‘ฆ = ๐๐๐‘‡ ๐‘ฆ
37
(74)
and so we can save considerably on computation by not inverting the matrix (๐— ๐‘‡ ๐—)−1. We can
check this is true by using Equation (3.32)
๐›ฝฬ‚ = ๐‘๐‘‡ ๐๐‘‡ ๐‘ฆ
(75)
⇒ ๐‘‹๐›ฝฬ‚ = ๐๐‘๐‘๐‘‡ ๐๐‘‡ ๐‘ฆ
(76)
= ๐๐๐‘‡ ๐‘ฆ
(77)
which is the same as the projection. We will also show in Exercise 3.9 that we can use the QR
decomposition to implement the forward selection algorithm efficiently.
On page 64, Equation (3.44) gives the solution to the ridge regression coefficient as
๐›ฝฬ‚ ๐‘Ÿ๐‘–๐‘‘๐‘”๐‘’ = (๐— ๐‘‡ ๐— + ๐œ†๐ผ)−1 ๐— ๐‘‡ ๐‘ฆ.
(78)
We will derive this here so that we can show other properties of ridge regression. Equation
(3.43) shows that the loss function for ridge regression is as follows
๐‘…๐‘†๐‘†(๐œ†) = (๐‘ฆ − ๐—๐›ฝ)๐‘‡ (๐‘ฆ − ๐—๐›ฝ) + ๐œ†๐›ฝ ๐‘‡ ๐›ฝ
= ๐‘ฆ ๐‘‡ ๐‘ฆ − 2๐›ฝ ๐‘‡ ๐— ๐‘‡ ๐‘ฆ + ๐›ฝ ๐‘‡ ๐— ๐‘‡ ๐—๐›ฝ + ๐œ†๐›ฝ ๐‘‡ ๐›ฝ
(79)
(80)
and then set the gradient to 0 and solve
∇๐›ฝ ๐‘…๐‘†๐‘†(๐œ†) = −2๐— ๐‘‡ ๐‘ฆ + 2๐— ๐‘‡ ๐—๐›ฝ + 2๐œ†๐›ฝ = 0
(81)
⇒ (๐— ๐‘‡ ๐— + ๐œ†๐ผ)๐›ฝ = ๐— ๐‘‡ ๐‘ฆ
(82)
⇒ ๐›ฝฬ‚ ๐‘Ÿ๐‘–๐‘‘๐‘”๐‘’ = (๐— ๐‘‡ ๐— + ๐œ†๐ผ)−1 ๐— ๐‘‡ ๐‘ฆ
(83)
and so we have shown Equation (3.44).
38
Now, we will introduce the singular value decomposition of the centered matrix ๐—. The
author does not define centered so we will make explicit here what he means. Centering a matrix
means that we normalize the columns to have mean 0 and variance 1. That is we use the
following algorithm,
Table 3.2. Algorithm for Centering.
1
1. Let ๐œ‡๐‘— = ๐‘› ∑๐‘›๐‘–=1 ๐‘ฅ๐‘–๐‘—
2. Replace each ๐‘ฅ๐‘— ∈ ๐— with ๐‘ฅ๐‘— − ๐œ‡๐‘—
1
3. Let ๐œŽ๐‘—2 = ๐‘›−1 ∑๐‘›๐‘–=1(๐‘ฅ๐‘–๐‘— )
2
4. Replace each ๐‘ฅ๐‘— with ๐‘ฅ๐‘— /๐œŽ๐‘— .
Once we do that, we can decompose matrix ๐— to its singular value decomposition
๐— = ๐”๐ƒ๐• ๐“ .
(84)
The actual decomposition is quite complicated so we will not go over derive it here. Here, ๐”
and ๐• are ๐‘› × ๐‘ and ๐‘ × ๐‘ orthogonal matrices with the columns of ๐” spanning the column
space of ๐— and the columns of ๐• spanning the row space. ๐ƒ is a ๐‘ × ๐‘ diagonal matrix, with
diagonal entries ๐‘‘1 ≥ ๐‘‘2 ≥ โ‹ฏ ≥ ๐‘‘๐‘ ≥ 0 and if one or more ๐‘‘๐‘— = 0, then ๐— is singular.
We can use the singular value decomposition to represent the matrix ๐— ๐‘‡ ๐—.
๐— ๐“ ๐— = (๐”๐ƒ๐• ๐“ )๐“ ๐”๐ƒ๐•
= ๐•๐ƒ๐”๐“ ๐”๐ƒ๐• ๐“
= ๐•๐ƒ๐Ÿ ๐• ๐“
39
(85)
and use this as in page 66, Equation (3.48) which states that the eigen decomposition of ๐— ๐‘‡ ๐— is
๐— ๐‘‡ ๐— = ๐•๐ƒ2 ๐• ๐‘‡ .
(86)
ฬ‚ (๐‘‹) =
To show that this is the eigen decomposition, recall that the sample covariance matrix ๐ถ๐‘œ๐‘ฃ
๐—๐‘‡๐—
๐—๐‘‡๐—
since the matrix has been pre-centered. Then by the definition of covariance matrices, ๐‘ is
๐‘
symmetric which implies that ๐— ๐‘‡ ๐— is symmetric. One property of symmetric matrices we will
use is that the eigen vectors of symmetric matrices are orthonormal. We will not go over the
proof but it can be found in a linear algebra textbook. Since the inverse of an orthogonal matrix
is its transpose (for an orthogonal matrix ๐€, by definition ๐€๐‘‡ ๐€ = ๐ˆ then taking the inverse,
๐€๐‘‡ ๐€๐€−1 = ๐€−1 implies that ๐€๐‘‡ = ๐€−1), we can write it in eigen form as
๐— ๐‘‡ ๐—๐• = ๐•๐ƒ2
(87)
⇒ ๐— ๐‘‡ ๐—๐•๐• −1 = ๐•๐ƒ2 ๐• −1
(88)
= ๐— ๐‘‡ ๐— = ๐•๐ƒ2 ๐• ๐‘‡
(89)
so we get Equation 3.48. Here, the matrix ๐• holds the eigenvectors of ๐— ๐‘‡ ๐— and the matrix ๐ƒ2
are the eigenvalues.
What happens if we multiply ๐— and ๐•? Recall that the ๐‘‘๐‘– ’s in ๐ƒ are ordered ๐‘‘1 ≥ ๐‘‘2 ≥
โ‹ฏ ≥ ๐‘‘๐‘ ≥ 0 and we multiply ๐—๐‘ฃ๐‘– as in Equation (3.49). Then we can look at its variance as
follows
๐‘‰๐‘Ž๐‘Ÿ(๐—๐‘ฃ๐‘– ) = ๐‘ฃ๐‘–๐‘‡ ๐‘‰๐‘Ž๐‘Ÿ(๐—)๐‘ฃ๐‘–
(90)
๐—๐‘‡ ๐—
) ๐‘ฃ๐‘–
๐‘
(91)
= ๐‘ฃ๐‘–๐‘‡ (
40
= ๐‘ฃ๐‘–๐‘‡ ๐•๐ƒ2 ๐• ๐‘‡ ๐‘ฃ๐‘– /๐‘
(92)
recall that ๐• is an orthogonal matrix which implies that (๐‘ฃ๐‘– ⋅ ๐‘ฃ๐‘— ) = 0, ∀๐‘– ≠ ๐‘—, we get
= ๐‘ฃ๐‘–๐‘‡ ๐‘ฃ๐‘– ๐‘‘๐‘–2 ๐‘ฃ๐‘–๐‘ก ๐‘ฃ๐‘– /๐‘
(93)
and since ๐‘‰ has orthonormal columns, ๐‘ฃ๐‘–๐‘‡ ๐‘ฃ๐‘– = 1 and the above implies
๐‘‰๐‘Ž๐‘Ÿ(๐—๐‘ฃ๐‘– ) =
๐‘‘๐‘–2
๐‘
(94)
which completes Equation (3.49). This equation is stating that by taking the eigenvectors of
matrix ๐• and multiplying it by the design matrix ๐—, the variance is proportional to the
eigenvalues. This variance is sorted since the ๐‘‘๐‘– ’s of ๐ƒ are sorted. In fact, let ๐‘ง๐‘– = ๐—๐‘ฃ๐‘– , then ๐‘ง๐‘–
has the ๐‘–th largest variance among all normalized linear combinations of the columns of ๐—
subject to being orthogonal to the earlier ones. We will prove this in the next section. Since we
have this nice property, the matrix ๐• gets a special name and is called the principal components
of the variables in ๐—.
To tie this together, we will derive Equation (3.47) of ridge regression using the singular
value decomposition. Showing this requires a lot of matrix operations and can be seen as
follows
๐—๐›ฝฬ‚ ๐‘Ÿ๐‘–๐‘‘๐‘”๐‘’ = ๐—(๐— ๐‘‡ ๐— + ๐œ†๐ˆ)−1 ๐— ๐‘‡ ๐‘ฆ
= ๐”๐ƒ๐• ๐‘‡ ((๐”๐ƒ๐• ๐‘‡ )−1 ๐”๐ƒ๐• ๐‘‡ + ๐œ†๐ˆ)−1 (๐”๐ƒ๐• ๐‘‡ )๐‘‡ ๐‘ฆ
= ๐”๐ƒ๐• ๐‘‡ (๐•๐ƒ2 ๐• ๐‘‡ + ๐œ†๐ˆ)−1 ๐•๐ƒ๐”๐‘‡ ๐‘ฆ
= ๐”๐ƒ๐• ๐‘‡ ๐‘‰ ๐‘‡
−1
(๐ƒ−2 )๐• −1 ๐•๐ƒ๐”๐‘‡ ๐‘ฆ + ๐”๐ƒ๐• ๐‘‡ (๐œ†−1 ๐ผ)๐•๐ƒ๐”๐‘‡ ๐‘ฆ
41
(95)
= ๐”๐ƒ(๐ƒ−2 )๐ƒ๐”๐‘‡ ๐‘ฆ + ๐”๐ƒ(๐œ†−1 ๐ˆ)๐ƒ๐”๐‘‡ ๐‘ฆ
= ๐”๐ƒ(๐ƒ2 + ๐œ†๐ˆ)−1 ๐ƒ๐”๐‘‡ ๐‘ฆ
๐‘
= ∑ ๐‘ข๐‘—
๐‘—=1
๐‘‘๐‘—2
๐‘ข๐‘‡ ๐‘ฆ
๐‘‘๐‘—2 + ๐œ† ๐‘—
(96)
๐‘‘2
๐‘—
so that we have shown equation (3.47). By looking at the middle term ๐‘‘2 +๐œ†
we can see what
๐‘—
ridge regression does. When ๐œ† = 0, we get the least squares estimates
๐‘‘๐‘—2
๐‘‘๐‘—2
= 1. Ridge regression
๐‘‘2
๐‘—
gives a value ๐‘‘2 +๐œ†
< 1. Recall what the ๐‘‘๐‘— ’s mean here; they are the eigenvalues of ๐— ๐‘‡ ๐— and we
๐‘—
they are the variance of ๐— in the direction of its principal components ๐‘ง๐‘– = ๐—๐‘ฃ๐‘– . Thus for ridge
regression, smaller values of ๐‘‘๐‘— and thus the smaller variances in the direction of its principal
components are shrunken most. To prove this, we have the original equation
๐‘‘๐‘—2
๐‘‘๐‘—2 + ๐œ†
.
(97)
Since ๐‘‘1 > ๐‘‘2 implies that ๐‘‘12 > ๐‘‘22 since the square function is monotonic for non-negative
values, then we need to show that
๐‘‘22
๐‘‘12 − ๐‘
๐‘‘12
=
<
.
๐‘‘22 + ๐œ† ๐‘‘12 − ๐‘ + ๐œ† ๐‘‘12 + ๐œ†
(98)
Where ๐‘ is some small constant 0 < ๐‘ ≤ ๐‘‘12 and ๐œ† is the parameter in ridge regression ๐œ† > 0.
By cross multiplying the function above, we get the following
(๐‘‘12 − ๐‘)(๐‘‘12 + ๐œ†) < ๐‘‘12 (๐‘‘12 − ๐‘ + ๐œ†)
42
๐‘‘14 + ๐œ†๐‘‘12 − ๐‘๐‘‘12 − ๐‘๐œ† < ๐‘‘14 − ๐‘๐‘‘12 + ๐œ†๐‘‘12
⇒ −๐‘๐œ† < 0
(99)
and since ๐œ†, ๐‘ > 0 this is last inequality is true and so the statement that smaller directions of the
principal components are shrunken the most holds.
3.4 Methods Using Derived Input Directions
On page 81, Equation (3.63) we see that the mth principal component direction ๐‘ฃ๐‘š
solves
max ๐‘‰๐‘Ž๐‘Ÿ(๐—๐›ผ)
๐›ผ
subject to ||๐›ผ|| = 1, ๐›ผ ๐‘‡ ๐’๐‘ฃโ„“ = 0, โ„“ = 1, … , ๐‘š − 1.
(100)
We will show that ๐›ผ here are the eigenvalues; that is, we will show ๐‘Ž = ๐‘ฃ๐‘š . Recall from the text
that our design matrix ๐— is standardized so that that we can rewrite the function ๐‘‰๐‘Ž๐‘Ÿ(๐—๐›ผ ๐‘‡ ) =
๐‘Ž๐‘‡ ๐‘‰๐‘Ž๐‘Ÿ(๐—)๐‘Ž = ๐‘Ž๐‘‡ ๐— ๐‘‡ ๐—๐‘Ž. A standard way of solving optimization problems with equality
constraints is by including the constraints in the objective function; this is known as the
Lagrangian and it can often be found in a calculus book. So rewriting the above quantity in
Lagrangian form, we get the following
๐ฟ(๐‘‹, ๐œ†) = ๐‘Ž๐‘‡ ๐— ๐‘‡ ๐—๐‘Ž − ๐œ†๐‘Ž๐‘‡ ๐‘Ž
(101)
where ๐œ† is called the Lagrange multiplier. Then, to find the optimal point, we have to set the
gradient of the Lagrangian to zero and solve for ๐‘ฅ.
∇๐‘Ž ๐ฟ(๐‘‹, ๐œ†) = 2๐— ๐‘‡ ๐—๐‘Ž − 2๐œ†๐‘Ž = 0.
43
(102)
Notice here that we get the standard eigen form ๐— ๐‘‡ ๐—๐‘Ž = ๐œ†๐‘Ž. This shows that the only points
which can maximize the functions are the eigenvectors of ๐— ๐‘‡ ๐— and we showed in the previous
section that this is the matrix ๐• under the ๐‘†๐‘‰๐ท decomposition. Since ๐• is an orthogonal matrix,
we know that the second constraint (๐›ผ ๐‘‡ ๐’๐‘ฃโ„“ = 0) holds so we have solved Equation (3.63).
3.5 Problems and Solutions
Exercise 3.2. Given data on two variables ๐‘‹ and ๐‘Œ, consider fitting a cubic polynomial
regression model ๐‘“(๐‘‹) = ∑3๐‘—=0 ๐›ฝ๐‘— ๐‘‹๐‘— . In addition to plotting the fitted curve, you would like a
95% confidence band about the curve. Consider the following two approaches:
(1). At each point ๐‘ฅ0 , form a 95% confidence interval for the linear function at ๐‘Ž๐‘‡ ๐›ฝ =
∑3๐‘—=0 ๐›ฝ๐‘— ๐‘ฅ0๐‘— .
Proof: For this problem, the data matrix is ๐‘› × 4 matrix as follows
1
๐— = [โ‹ฎ
1
๐‘ฅ๐‘–1
โ‹ฎ
๐‘ฅ๐‘›1
2
๐‘ฅ๐‘–1
โ‹ฎ
2
๐‘ฅ๐‘›1
3
๐‘ฅ๐‘–1
โ‹ฎ ]
3
๐‘ฅ๐‘›1
and ๐›ฝฬ‚ is estimated in the usual least-squares way by setting up the normal equations. Then, to
generate confidence intervals around an estimated mean ๐‘“ฬ‚(๐‘ฅ0 ) = ๐ธ[๐‘ฆฬ‚
0 |๐‘ฅ0 ], use the Wald
confidence interval as follows
95% Confidence Interval(๐‘ฅ0 ) = ๐‘ฅ0๐‘‡ ๐›ฝฬ‚ ± 1.96 ⋅ √๐‘‰๐‘Ž๐‘Ÿ(๐‘ฅ0๐‘‡ ๐›ฝฬ‚ )
where the variance from Exercise 2.5 is as follows
๐‘‰๐‘Ž๐‘Ÿ(๐‘ฅ0๐‘‡ ๐›ฝฬ‚ ) = ๐‘ฅ0๐‘‡ ๐‘‰๐‘Ž๐‘Ÿ(๐›ฝฬ‚ )๐‘ฅ0 = ๐œŽ 2 ๐‘ฅ0๐‘‡ (๐— ๐‘‡ ๐—)−1 ๐‘ฅ0
44
then the interval is
95% Confidence Interval(๐‘ฅ0 ) = ๐‘ฅ0๐‘‡ ๐›ฝฬ‚ ± 1.96 ⋅ ๐œŽ√๐‘ฅ0๐‘‡ (๐— ๐‘‡ ๐—)−1 ๐‘ฅ0 โˆŽ
(2). Form a 95% confidence set for ๐›ฝ as in (3.15), which in turn generates confidence
intervals for ๐‘“(๐‘ฅ0 ).
Proof: Equation (3.15) is as follows
๐‘‡
2 (1−๐‘Ž)
Cβ = {๐›ฝ|(๐›ฝฬ‚ − ๐›ฝ) ๐— ๐‘‡ ๐—(๐›ฝฬ‚ − ๐›ฝ) ≤ ๐œŽฬ‚ 2 ๐œ’๐‘+1
}
2
where C๐›ฝ is the confidence set and ๐œ’๐‘+1
(1−๐‘Ž)
is the (1 − ๐‘Ž)-percentile of the chi-squared
distribution on (p + 1)-degrees of freedom. Then, at ๐›ผ = 0.05, χ2 = 3.84 so that the interval for
this particular set ๐›ฝ is
Cβ = {๐›ฝ|(๐›ฝฬ‚ ๐‘‡ − ๐›ฝ)(๐— ๐‘‡ ๐—)(๐›ฝฬ‚ − ๐›ฝ) ≤ ๐œŽฬ‚ 2 ⋅ 3.84 }
which implies that the interval for the conditional mean is
⇒ ๐ถ๐‘ฅ0๐‘‡ ๐›ฝฬ‚ = {๐‘ฅ0๐‘‡ ๐›ฝฬ‚ |๐›ฝ ∈ Cβ }.
Exercise 3.3. (1). Prove the Gauss–Markov theorem: the least squares estimate of a
parameter ๐‘Ž๐‘‡ ๐›ฝ has variance no bigger than that of any other linear unbiased estimate of ๐‘Ž๐‘‡ ๐›ฝ
(Section 3.2.2).
Proof: Recall that the least squares estimator for ๐›ฝฬ‚ is
๐›ฝฬ‚ = (๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ๐‘ฆ.
Suppose we have another unbiased linear estimator ๐›ฝฬƒ so that ๐›ฝฬƒ = ๐€๐‘ฆ and let ๐€ =
(๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ + ๐‚. Then we get the following
45
๐›ฝฬƒ = ((๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ + ๐‚)๐‘ฆ
= (๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ๐‘ฆ + ๐‚๐‘ฆ.
Since least squares is unbiased and we assume the model ๐‘Œ = ๐‘‹ ๐‘‡ ๐›ฝ, the following holds ๐ธ[๐‘ฆ] =
๐‘‹ ๐‘‡ ๐›ฝ. Then, use this below to find the expectation on ๐›ฝฬƒ so that
๐ธ[๐›ฝฬƒ] = ๐ธ[(๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ๐‘ฆ] + ๐ธ[๐‚๐‘ฆ]
= (๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ๐—๐›ฝ + ๐‚๐—๐›ฝ
= (๐ˆ๐‘ + ๐‚๐—)๐›ฝ
then ๐›ฝฬƒ is unbiased if and only if ๐‚๐— = 0.
Then the variance of ๐›ฝฬƒ is as follows
๐‘‰๐‘Ž๐‘Ÿ(๐›ฝฬƒ) = ๐‘‰๐‘Ž๐‘Ÿ(((๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ + ๐‚)๐‘ฆ)
Let ๐ƒ = ((๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ + ๐‚), then the above is equal to
= ๐ƒ๐‘‰๐‘Ž๐‘Ÿ(๐‘ฆ)๐ƒ๐‘‡
= ๐œŽ 2 ⋅ ๐ƒ๐ƒ๐‘‡
= ๐œŽ 2 ⋅ ((๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ + ๐‚)(๐—(๐— ๐‘‡ ๐—)−1 + ๐‚ ๐‘‡ )
= ๐œŽ 2 ⋅ [(๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ๐—(๐— ๐‘‡ ๐—)−1 + (๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ๐‚ ๐‘‡ + ๐‚๐—(๐— ๐‘‡ ๐—)−1 + ๐‚๐‚ ๐‘‡ ]
And above we stated that ๐‚๐— = 0 so we remove the 2nd and 3rd terms to reduce it to the following
= ๐œŽ 2 ⋅ [(๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ๐—(๐— ๐‘‡ ๐—)−1 + ๐‚๐‚ ๐‘‡ ]
46
= ๐‘‰๐‘Ž๐‘Ÿ(๐›ฝฬ‚ ) + ๐œŽ 2 [๐‚๐‚ ๐‘‡ ]
and ๐‚๐‚ ๐‘‡ is a positive semidefinite matrix since ๐‘Ž๐‘‡ ๐ถ๐ถ ๐‘‡ ๐‘Ž = (๐‘Ž๐‘‡ ๐ถ ⋅ ๐ถ ๐‘‡ ๐‘Ž) ≥ 0, so that
๐‘‰๐‘Ž๐‘Ÿ(๐›ฝฬƒ) > ๐‘‰๐‘Ž๐‘Ÿ(๐›ฝ) holds and we complete the proof.
Exercise 3.4. (1). Show how the vector of least squares coefficients can be obtained from
a single pass of the Gram–Schmidt procedure (Algorithm 3.1). Represent your solution in terms
of the QR decomposition of X.
Proof: After a single pass of the Gram-Schmidt process, we get the following
decomposition
๐— = ๐๐‘
where Q is an ๐‘ × (๐‘ + 1) orthogonal matrix and ๐‘… is an (๐‘ + 1) × (๐‘ + 1) upper triangular
matrix equal. Then by least squares, we get the following equation
๐— ๐‘‡ ๐—๐›ฝ = ๐— ๐‘‡ ๐‘ฆ
⇒ (๐๐‘)๐‘‡ ๐๐‘๐›ฝ = (๐๐‘)๐‘‡ ๐‘ฆ
⇒
๐‘๐‘‡ ๐‘๐›ฝ = ๐‘๐‘‡ ๐๐‘‡ ๐‘ฆ
⇒
๐‘๐›ฝ = ๐๐‘‡ ๐‘ฆ
and we can solve the last equation by back-substitution since R is an upper-triangular matrix.
47
For example,
๐‘Ÿ00
0
[
โ‹ฎ
0
๐‘Ÿ01
๐‘Ÿ11
โ‹ฑ
0
โ‹ฏ ๐‘Ÿ0๐‘ ๐›ฝ0
๐‘ž0๐‘‡ ๐‘ฆ
โ‹ฏ ๐‘Ÿ1๐‘ ๐›ฝ1
๐‘ž๐‘‡ ๐‘ฆ
][ ] = 1
โ‹ฑ
โ‹ฎ
โ‹ฎ
โ‹ฎ
๐‘‡
โ‹ฏ ๐‘Ÿ๐‘๐‘ ๐›ฝ๐‘
[๐‘ž๐‘ ๐‘ฆ]
๐‘‡๐‘ฆ
๐‘ž๐‘
the last element is ๐›ฝฬ‚๐‘ = ๐‘Ÿ , and solving the next elements in order by back-substitution
๐‘๐‘
๐›ฝฬ‚๐‘ , ๐›ฝฬ‚๐‘−1 , … ๐›ฝฬ‚0 .
Exercise 3.5. (1). Consider the ridge regression problem (3.41). Show that this problem
is equivalent to the problem
๐‘
๐‘
ฬ‚๐‘
๐›ฝ
๐‘
2
= arg min
{∑[๐‘ฆ๐‘– − ๐›ฝ0๐‘ − ∑(๐‘ฅ๐‘–๐‘— − ๐‘ฅฬ…๐‘— )๐›ฝ๐‘—๐‘ ]2 + ๐œ† ∑ ๐›ฝ๐‘—๐‘ }
๐›ฝ๐‘
๐‘–=1
๐‘—=1
๐‘—=1
Give the correspondence between ๐›ฝ ๐‘ and the original ๐›ฝ in (3.41).
๐‘
๐‘
๐‘
๐›ฝฬ‚ ๐‘Ÿ๐‘–๐‘‘๐‘”๐‘’ = ๐‘Ž๐‘Ÿ๐‘” ๐‘š๐‘–๐‘› {∑[๐‘ฆ๐‘– − ๐›ฝ0 − ∑ ๐‘ฅ๐‘–๐‘— ๐›ฝ๐‘— ]2 + ๐œ† ∑ ๐›ฝ๐‘—2 }
๐›ฝ
๐‘–=1
๐‘—=1
(3.41)
๐‘—=1
Proof: For the first equation above, we can expand out the middle summation so that we
get the following
๐‘
๐‘
๐‘
∑(๐‘ฅ๐‘–๐‘— − ๐‘ฅฬ…๐‘— )๐›ฝ๐‘—๐‘ = ∑ ๐‘ฅ๐‘–๐‘— ๐›ฝ๐‘—๐‘ − ∑ ๐‘ฅฬ…๐‘— ๐›ฝ๐‘—๐‘
๐‘—=1
๐‘—=1
๐‘—=1
and we get the correspondence between ๐›ฝ0๐‘ and ๐›ฝ0
๐‘
๐›ฝ0 = ๐›ฝ0๐‘ + ∑ ๐‘ฅฬ…๐‘— ๐›ฝ๐‘—๐‘ .
๐‘—=1
48
That is, ๐›ฝ0 is equivalent to ๐›ฝ0๐‘ except it is shifted. At this point, the two functions are equal and
thus all other coefficients will stay the same
๐›ฝ1๐‘ = ๐›ฝ1 , ๐›ฝ2๐‘ = ๐›ฝ2 … , ๐›ฝ๐‘๐‘ = ๐›ฝ๐‘
so that ๐›ฝ ๐‘ has the same slopes as the ๐›ฝ’s so we have shown the correspondence.
Exercise 3.6. Show that the ridge regression estimate is the mean (and mode) of the
posterior distribution, under a Gaussian prior ๐›ฝ ~ ๐‘(0, ๐œ๐‘ฐ), and Gaussian sampling
model ๐‘ฆ ~ ๐‘(๐‘ฟ๐›ฝ, ๐œŽ 2 ๐‘ฐ). Find the relationship between the regularization parameter ๐œ† in the
ridge formula, and the variances ๐œ and ๐œŽ 2 .
Proof: From Bayes rule, we can write the posterior distribution of ๐›ฝ as follows
Pr(๐›ฝ|๐ท) =
Pr(๐ท|๐›ฝ) ⋅ Pr(๐›ฝ)
∝ Pr(๐ท|๐›ฝ) ⋅ Pr(๐›ฝ)
Pr(๐ท)
where ๐‘ƒ(๐ท|๐›ฝ) is the likelihood; that is, the probability of the data given ๐›ฝ. The last term states
that the posterior distribution is proportional to Pr(๐ท|๐›ฝ) ⋅ Pr(๐›ฝ). Since the denominator is the
marginal density of ๐ท, this will be constant for all ๐›ฝ, we can leave it out when solving for ๐›ฝ.
Next, notice that each density of the last two terms are
Pr(๐ท|๐›ฝ) ~ ๐‘(๐‘‹ ๐‘‡ ๐›ฝ, ๐œŽ 2 ๐ผ)
Pr(๐›ฝ) ~ ๐‘(0, ๐œ๐ˆ)
The first comes from the likelihood of the data under regression assumptions and the second is
stated in the problem. The maximum likelihood with respect to ๐›ฝ under the log transformation,
we get the following
log(๐‘ƒ๐‘Ÿ(๐ท|๐›ฝ) ⋅ ๐‘ƒ๐‘Ÿ(๐›ฝ)) = −
(๐‘ฆ − ๐—๐›ฝ)๐‘‡ (๐‘ฆ − ๐—๐›ฝ) ๐›ฝ ๐‘‡ ๐›ฝ
−
2๐œŽ 2
2๐œ
49
๐‘
๐‘
๐‘
๐‘–=1
๐‘—=1
๐‘—=1
1
1
= 2 ∑[๐‘ฆ๐‘– − ๐›ฝ0 − ∑ ๐‘ฅ๐‘–๐‘— ๐›ฝ๐‘— ]2 + ∑ ๐›ฝ๐‘—2
2๐œŽ
2๐œ
๐œŽ2
and if we multiply the equation by 2๐œŽ 2 , then we get the ridge equation (3.41) with ๐œ† = ๐œ thus
we have showed the relationships between ๐œ† and the two variance terms. Finally, by plugging
๐œ† into the above equation, we get the following,
๐‘
๐‘
๐‘
๐›ฝฬ‚ ๐‘Ÿ๐‘–๐‘‘๐‘”๐‘’ = arg max ∑[๐‘ฆ๐‘– − ๐›ฝ0 − ∑ ๐‘ฅ๐‘–๐‘— ๐›ฝ๐‘— ]2 + ๐œ† ∑ ๐›ฝ๐‘—2 .
๐›ฝ
๐‘–=1
๐‘—=1
๐‘—=1
which is the ridge regression form. Since we assumed our data were generated from a normal
distribution and we used a normal prior (conjugate prior), from statistics, we know that the
posterior will be normal. This implies that the maximum ๐›ฝ under the posterior distribution gives
the mean. Since the mean and mode are equal for normal distributions (due to symmetry) we
have shown the relationship between them and answered the problem.
Exercise 3.9. Forward stepwise regression. Suppose we have the QR decomposition for
the ๐‘ × ๐‘ž matrix ๐‘ฟ1 in a multiple regression problem with response ๐‘ฆ, and we have an
additional ๐‘ − ๐‘ž predictors in the matrix ๐‘ฟ2 . Denote the current residual by ๐’“. We wish to
establish which one of these additional variables will reduce the residual-sum-of squares the
most when included with those in ๐‘ฟ1 . Describe an efficient procedure for doing this.
Proof: Recall from linear algebra that the nullspace is ๐‘(๐‘‹) = {๐‘ฃ: ๐‘‹๐‘ฃ = 0}. This
implies that the residual vector r lives in the nullspace of ๐— ๐‘‡ since
๐— ๐‘‡ ๐‘Ÿ = ๐— ๐‘‡ (๐‘ฆ − ๐‘ฆฬ‚)
50
= ๐— ๐‘‡ (๐‘ฆ − ๐—(๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ๐‘ฆ)
= ๐— ๐‘‡ ๐‘ฆ − ๐— ๐‘‡ ๐—(๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ๐‘ฆ
= ๐— ๐‘‡ ๐‘ฆ − ๐— ๐‘‡ ๐‘ฆ = 0.
Since the QR decomposition uses the Gram-Schmidt procedure, we have seen that this
means that this forms an orthonormal basis for ๐—. This implies that the current fitted values have
been projected onto the subspace spanned by ๐—1 and so that the remaining ๐‘ − ๐‘ž variables in ๐— 2
(assuming linear independence) live in the nullspace of ๐—1 . Since the residual sum-of-squares is
defined as ||๐‘ฆ − ๐‘ฆฬ‚||22 which is the squared distance between ๐‘ฆ and its projected vector, we would
like to find the vector ๐‘ฃ ∈ ๐‘… ๐‘−๐‘ž where such that we decrease the length of the residual vector.
This means that reducing the residual sum-of-square error the most amounts to finding the
largest projection of r onto the nullspace. That is, we need to find the following
⟨๐‘‹๐‘— , ๐‘Ÿ⟩
๐‘‹๐‘— ∈๐— 2 ⟨๐‘‹๐‘— , ๐‘‹๐‘— ⟩
๐‘‹๐‘— = arg max
So that completes the first part of the problem.
For the second part, define a new matrix ๐— ∗ that is ๐—1 with an appended vector ๐‘‹๐‘— as
follows
∗
๐— =[
๐—1
|
๐‘‹๐‘— ]
|
we can use the QR decomposition for a more efficient fit. Recall that the QR decomposition fits
each row iteratively and from page 35 of this thesis, we have the last row
51
๐‘ง๐‘— = ๐‘ฅ๐‘— − (๐‘ฅ๐‘— ⋅ ๐‘’0 )๐‘’0 − (๐‘ฅ๐‘— ⋅ ๐‘’1 )๐‘’1 − โ‹ฏ − (๐‘ฅ๐‘— ⋅ ๐‘’๐‘ )๐‘’๐‘
๐‘’๐‘— =
๐‘ง๐‘—
||๐‘ง๐‘— ||2
and then we get the QR matrix by
|
๐— = [๐‘ฅ0
|
|
๐‘ฅ1
|
… |
|
… ๐‘ฅ๐‘ ] = [๐‘’0
… |
|
|
๐‘’1
|
๐‘ฅ0 ⋅ ๐‘’0
… |
0
… ๐‘’๐‘— ] [
โ‹ฎ
… |
0
๐‘ฅ1 ⋅ ๐‘’0
๐‘ฅ1 ⋅ ๐‘’1
โ‹ฎ
0
…
…
โ‹ฑ
…
๐‘ฅ๐‘— ⋅ ๐‘’0
๐‘ฅ๐‘— ⋅ ๐‘’1
]
โ‹ฎ
๐‘ฅ๐‘— ⋅ ๐‘’๐‘—
And then we can solve for the coefficients as in Exercise 3.4 by back substitution
๐‘๐›ฝ = ๐๐‘‡ ๐‘ฆ
so that we have completed the problem. Solving this problem using the QR decomposition is
much faster solving for ๐›ฝฬ‚ using the normal equations and ๐‘‹ ∗ .
Exercise 3.10. Backward stepwise regression. Suppose we have the multiple regression
fit of y on ๐‘ฟ, along with the standard errors and Z-scores as in Table 3.2. We wish to establish
which variable, when dropped, will increase the residual sum-of-squares the least. How would
you do this?
Exercise 3.1 established that the ๐น statistic for dropping a single coefficient is equal to
the square of the corresponding ๐‘ง-score so that we get the following relationship
๐น=
๐‘…๐‘†๐‘†0 − ๐‘…๐‘†๐‘†1 ๐‘…๐‘†๐‘†0 − ๐‘…๐‘†๐‘†1
=
= ๐‘ง๐‘—2
๐‘…๐‘†๐‘†1
๐œŽฬ‚ 2
๐‘−๐‘−1
and then shuffling terms around, we get
⇒ ๐‘…๐‘†๐‘†0 − ๐‘…๐‘†๐‘†1 = ๐œŽฬ‚ 2 ๐‘ง๐‘—2
52
⇒ ๐‘…๐‘†๐‘†0 = ๐œŽฬ‚ 2 ๐‘ง๐‘—2 + ๐‘…๐‘†๐‘†1
which states that the smallest increase in the residual sum-of-squares corresponds to dropping the
smallest ๐‘ง-score (i.e. smallest ๐‘ง๐‘—2 value) and we have concluded the problem.
Exercise 3.13. Derive the expression (3.62), and show that ๐›ฝฬ‚ ๐‘๐‘๐‘Ÿ (๐‘) = ๐›ฝฬ‚ ๐‘™๐‘  .
Proof: Using the singular value decomposition of ๐— = ๐”๐ƒ๐• ๐‘‡ we can represent ๐›ฝฬ‚ ๐‘™๐‘  as
follows
๐— ๐‘‡ ๐—๐›ฝ = ๐— ๐‘‡ ๐‘ฆ
๐•๐ƒ๐‘‡ ๐”๐‘‡ ๐”๐ƒ๐• ๐‘‡ ๐›ฝ = ๐•๐ƒ๐‘‡ ๐”๐‘‡ ๐‘ฆ
๐•๐ƒ2 ๐• ๐‘‡ ๐›ฝ = ๐•๐ƒ๐”๐‘‡ ๐‘ฆ
where ๐”๐‘‡ ๐” = ๐ˆ since ๐” is an orthogonal matrix and ๐ƒ๐‘‡ = ๐ƒ because it is a diagonal matrix.
Then we can solve for ๐›ฝฬ‚ and get the following result
๐›ฝฬ‚ ๐‘™๐‘  = ๐•๐ƒ−1 ๐” ๐‘‡ ๐‘ฆ
Equation (3.62) on principal component regression has the form
๐‘€
๐›ฝฬ‚ ๐‘๐‘๐‘Ÿ (๐‘€) = ∑ ๐œƒฬ‚๐‘š ๐‘ฃ๐‘š
๐‘š=1
where ๐œƒฬ‚๐‘š = ⟨๐‘ง๐‘š , ๐‘ฆ⟩/⟨๐‘ง๐‘š , ๐‘ง๐‘š ⟩ and where ๐‘ง๐‘š are the principal components and is the ๐‘šth column
in the matrix ๐™ = ๐—๐• = ๐”๐ƒ. ๐‘ฃ๐‘š is the column in the singular value decomposition matrix ๐•.
When ๐‘€ = ๐‘, the vector ๐œƒฬ‚ is the coefficient vector when regressing ๐‘ฆ onto the principal
components and is solved as follows
53
๐œƒฬ‚ = (๐™๐‘‡ ๐™)−1 ๐™๐‘‡ ๐‘ฆ = (๐ƒ๐‘‡ ๐”๐‘‡ ๐”๐ƒ)−1 ๐ƒ๐‘‡ ๐”๐‘‡ ๐‘ฆ = ๐ƒ−1 ๐”๐‘‡ ๐‘ฆ
and to get the vector ๐›ฝฬ‚ ๐‘๐‘๐‘Ÿ we do the following
๐›ฝฬ‚ ๐‘๐‘๐‘Ÿ = ๐•๐œƒฬ‚ = ๐•๐ƒ−1 ๐”๐‘‡ ๐‘ฆ
which is equivalent to that of least squares when ๐‘€ = ๐‘โˆŽ
Exercise 3.16. Derive the entries in Table 3.4, the explicit forms for estimators in the
orthogonal case.
Estimator
Formula
Best subset (size M)
๐›ฝฬ‚๐‘— ⋅ ๐ผ(|๐›ฝฬ‚๐‘— | ≥ |๐›ฝฬ‚๐‘€ |)
Ridge
๐›ฝฬ‚๐‘— /(1 + ๐œ†)
Lasso
sign(๐›ฝฬ‚๐‘— )(|๐›ฝฬ‚๐‘— | − ๐œ†)+
Proof: Table 3.4 Estimators of ๐›ฝ๐‘— in the case of orthonormal columns of ๐—. ๐‘€ and ๐œ† are
constants chosen by the corresponding techniques; sign denotes the sign of its argument (±1).
For least-squares
๐›ฝฬ‚ ๐‘™๐‘  = (๐— ๐‘‡ ๐—)−1 ๐— ๐‘‡ ๐‘ฆ = ๐— ๐‘‡ ๐‘ฆ
For best-subset selection we get the following
๐›ฝฬ‚ ๐‘™๐‘  = ๐›ฝฬ‚ ๐‘๐‘  (๐‘€ = ๐‘) = ๐— ๐‘‡ ๐‘ฆ
where the coefficients are identical even if we take ๐‘€ ≤ ๐‘ since the design matrix is orthogonal.
For ridge regression we get the following estimates
๐›ฝฬ‚ ๐‘Ÿ๐‘–๐‘‘๐‘”๐‘’ = (๐— ๐‘‡ ๐— + ๐œ†๐ˆ)−1 ๐— ๐‘‡ ๐‘ฆ = (๐ˆ + ๐œ†๐ˆ)−1 ๐— ๐‘‡ ๐‘ฆ = ๐›ฝฬ‚ ๐‘™๐‘  /(1 + ๐œ†)
For lasso regression, we get the following
54
๐ฟ(๐›ฝ) = (๐‘ฆ − ๐—๐›ฝ)๐‘‡ (๐‘ฆ − ๐—๐›ฝ) + ๐œ†|๐›ฝ|
⇒
๐œ•๐ฟ(๐›ฝ)
= −๐— ๐‘‡ ๐‘ฆ + ๐— ๐‘‡ ๐—๐›ฝ + ๐œ† ⋅ sign(๐›ฝ)
๐œ•๐›ฝ
and by setting the gradient to zero and solving with respect to ๐›ฝ we get
๐›ฝฬ‚ ๐‘™๐‘Ž๐‘ ๐‘ ๐‘œ = ๐ˆ(๐— ๐‘‡ ๐‘ฆ − ๐œ† ⋅ sign(๐›ฝ))
= sign(๐›ฝ)(|๐— ๐‘‡ ๐‘ฆ| − ๐œ†)โˆŽ
Exercise 3.17. Repeat the analysis of Table 3.3 on the spam data discussed in Chapter 1
where Table 3.3 compares the coefficient estimates of least-squares, best subset, ridge, lasso,
principal component regression, and partial least-squares.
This dataset has 57 variables with a binary response indicating spam or not spam. Fitting
each model and using cross validation to generate the mean squared errors, we get the following
table
LS
Parameters
Test Error
Std Error
Best Subset Ridge
๐œ† = .08
0.334
0.0169
0.334
0.0115
Lasso
๐œ† = 0.0017
0.334
0.0182
PCR
ncomp
= 56
0.335
0.0177
PLS
ncomp
= 12
0.333
0.0158
Best subset regression did not complete as it took too long. The author mentions that it
can work when the number of variables ๐‘ up to 30 or 40 but is not feasible since it is an
exponential time search time.
Exercise 3.19. Show that ||๐›ฝฬ‚๐‘Ÿ๐‘–๐‘‘๐‘”๐‘’ || increases as its tuning parameter ๐œ† → 0. Does the
same property hold for the lasso and partial least squares estimates? For the latter, consider the
“tuning parameter” to be the successive steps in the algorithm.
Proof: The solution to ๐›ฝฬ‚ ๐‘Ÿ๐‘–๐‘‘๐‘”๐‘’ can be found using equation (3.47)
55
๐›ฝฬ‚ ๐‘Ÿ๐‘–๐‘‘๐‘”๐‘’ = (๐— ๐‘‡ ๐— + ๐œ†๐ˆ)−1 ๐— ๐‘‡ ๐‘ฆ
and using the singular value decomposition on the centered matrix ๐—
= (๐ƒ2 + ๐œ†๐ˆ)−1 ๐•๐ƒ๐”๐‘‡ ๐‘ฆ
(3.47)
so that
||๐›ฝฬ‚ ๐‘Ÿ๐‘–๐‘‘๐‘”๐‘’ ||2 = ๐‘ฆ ๐‘‡ ๐”๐ƒ๐• ๐‘‡ (๐ƒ2 + ๐œ†๐ˆ)−2 ๐•๐ƒ๐”๐‘‡ ๐‘ฆ
= ๐‘ฆ ๐‘‡ ๐”๐ƒ(๐ƒ2 + ๐œ†๐ˆ)−2 ๐ƒ๐”๐‘‡ ๐‘ฆ
๐‘
= ∑ ๐‘ฆ๐‘ข๐‘—
๐‘—=1
and as ๐œ† → 0, the quantity
๐‘‘๐‘—2
(๐‘‘๐‘—2 +๐œ†)
2
๐‘‘๐‘—2
2
(๐‘‘๐‘—2 + ๐œ†)
๐‘ข๐‘—๐‘‡ ๐‘ฆ
is increasing and thus the vector ||๐›ฝฬ‚ ๐‘Ÿ๐‘–๐‘‘๐‘”๐‘’ ||2 increases and so
we have concluded the exercise.
56
CHAPTER 4
LINEAR METHODS FOR CLASSIFICATION
4.1 Introduction
This section goes over topics including the classification task and linear discriminant
analysis.
4.2 Linear Discriminant Analysis
This chapter is an extension of Chapter 3 from a regression point a view to a
classification one. On page 113, we will derive the equations listed in the center of the page.
Notice that the second equation is
๐พ
ฬ‚ ๐‘˜ | = ∑ log ๐‘‘๐‘˜๐‘™ .
log |๐ถ๐‘œ๐‘ฃ
(103)
๐‘™=1
This is the determinant of the covariance matrix and uses the property of determinants where
ฬ‚ ๐‘˜ ) = ∏๐พ
det(๐ถ๐‘œ๐‘ฃ
๐‘™=1 ๐‘‘๐‘˜๐‘™ . That is, the determinant is equal to the product of the eigenvalues which
can be obtained from matrix ๐ƒ of the eigen decomposition of the covariance matrix.
The next bullet point on the same page states that we can sphere the data with respect to
the common covariance matrix by applying
1
๐‘‹ ∗ ← ๐ƒ−2 ๐”๐‘‡ ๐‘‹.
57
(104)
This occurs by first decomposing the covariance of ๐‘‹ by its eigen decomposition; that
is, ๐ถ๐‘œ๐‘ฃ(๐‘‹) = ๐”๐ƒ๐”๐‘‡ . We can take the negative square root of a matrix by performing each
operation 1-by-1. That is, we take the square root and then invert it as follows
1
(๐”๐ƒ๐”๐‘‡ )−2
1
−
1 1
2
๐‘‡
= (๐”๐ƒ2 ๐ƒ2 ๐” )
1
= (๐”๐ƒ(๐”๐ƒ)๐‘‡ )−2
= (๐”๐ƒ)−1
1
1
= ๐ƒ−2 ๐”−1 = ๐ƒ−2 ๐”๐‘‡
(105)
and then by sphering the data, we change the covariance of ๐‘‹ ∗ to be the identity matrix. This can
be shown as follows
1
๐ถ๐‘œ๐‘ฃ(๐‘‹ ∗ ) = ๐ถ๐‘œ๐‘ฃ (๐ƒ−2 ๐” ๐‘‡ ๐—)
(106)
1
1
= ๐ƒ−2 ๐”๐‘‡ ๐ถ๐‘œ๐‘ฃ(๐‘‹)๐”๐ƒ−2
1
1
= ๐ƒ−2 ๐”๐‘‡ ๐”๐ƒ๐” ๐‘‡ ๐”๐ƒ−2
=๐ˆ
(107)
so that we get the identity matrix as the covariance of ๐‘‹ ∗ and complete the proof.
58
4.3 Problems and Solutions
Exercise 4.1. Show how to solve the generalized eigenvalue problem ๐‘š๐‘Ž๐‘ฅ ๐›ผ ๐‘‡ ๐‘ฉ๐›ผ subject
to ๐›ผ ๐‘‡ ๐‘พ๐›ผ = 1 by transforming to a standard eigenvalue problem where ๐‘ฉ is the between-class
covariance matrix and ๐‘พ is the within-class covariance matrix.
Proof: Since this is an equality constraint, we can set it up in Lagrangian form and solve
using lagrangian multipliers. The problem is of the form
max ๐›ผ ๐‘‡ ๐๐›ผ
๐›ผ
subject to ๐›ผ ๐‘‡ ๐–๐›ผ = 1
Then, in Lagrangian form, this is
L(๐‘Ž, ๐œ†) = ๐‘Ž๐‘‡ ๐๐‘Ž + ๐œ†(๐‘Ž๐‘‡ ๐–๐‘Ž − 1)
We can take partials with respect to ๐‘Ž and ๐œ† so that
๐œ•L(๐‘Ž, ๐œ†)
= 2๐๐‘Ž + 2๐œ†๐–๐‘Ž = 0
๐œ•๐‘Ž
(1)
๐œ•L(๐‘Ž, ๐œ†)
= ๐‘Ž๐‘‡ ๐–๐‘Ž − 1 = 0
๐œ•๐œ†
(2)
And so for the first equation,
⇒ −๐๐‘Ž = ๐œ†๐–๐‘Ž
⇒ −๐– −1 ๐๐‘Ž = ๐œ†๐‘Ž
59
Notice that this is in eigen decomposition form and since we want to maximize the
original quantity, we know that ๐‘Ž must be the first eigenvector and ๐œ† the corresponding
eigenvalue to the matrix −๐– −1 ๐.
60
REFERENCES
[1] Agresti, Alan. (2012). Categorical Data Analysis, 3rd edition.
[2] Hastie, Tibshirani, & Friedman. (2009). The Elements of Statistical Learning, 2nd edition.
[3] Ng, Andrew. CS229 Lecture Notes on Learning Theory.
[4] Schmidt, Mark. (2005). Least Squares Optimization with L1-Norm Regularization.
[5] Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
[6] Weatherwax, John & Epstein, David. (2013). A Solution Manual and Notes for: The
Elements of Statistical Learning by Jerome Friedman, Trevor Hastie, and Robert Tibshirani.
[7] Wolfram Alpha LLC. (2009). Wolfram|Alpha.
http://www.wolframalpha.com/input/?i=x^2%2By^2 (access October 20, 2014).
61
Download