A GUIDE AND SOLUTION MANUAL TO THE ELEMENTS OF STATISTICAL LEARNING by JAMES CHUANBING MA Under the direction of WILLIAM MCCORMICK ABSTRACT This Master’s thesis will provide R code and graphs that reproduce some of the figures in the book Elements of Statistical Learning. Selected topics are also outlined and summarized so that it is more readable. Additionally, it covers some of the solutions to the problems for chapters 2, 3, and 4. INDEX WORDS: Elements of Statistical Learning, Solution Manual, Guide, ESL Guide A GUIDE AND SOLUTION MANUAL TO THE ELEMENTS OF STATISTICAL LEARNING By JAMES CHUANBING MA B.S., Emory University, 2008 A Thesis Submitted to the Graduate Faculty of The University of Georgia in Partial Fulfillment of the Requirements for the Degree MASTER OF SCIENCE ATHENS, GEORGIA 2014 © 2014 James Chuanbing Ma All Rights Reserved A GUIDE AND SOLUTION MANUAL TO THE ELEMENTS OF STATISTICAL LEARNING by JAMES CHUANBING MA Electronic Version Approved: Julie Coffield Interim Dean of the Graduate School The University of Georgia December 2014 Major Professor: William Mccormick Committee: Jaxk Reeves Kim Love-Myers TABLE OF CONTENTS CHAPTER PAGE 1 INTRODUCTION 1 2 OVERVIEW OF SUPERVISED LEARNING 3 3 4 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Mathematical Notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3 Least Squares and Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.4 Statistical Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.5 Problems and Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 LINEAR METHODS FOR REGRESSION 32 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32 3.2 Linear Regression Models and Least Squares . . . . . . . . . . .. . . . . . . . . . . . . 32 3.3 Shrinkage Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4 Methods Using Derived Input Directions. . . . . . . . . . . . . . . . . . .. . . . . . . . . . 43 3.5 Problems and Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 LINEAR METHODS FOR CLASSIFICATION iv 57 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .57 4.2 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3 Problems and Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 REFERENCES 61 v CHAPTER 1 INTRODUCTION The Elements of Statistical Learning is a popular book on data mining and machine learning written by three statistics professors at Stanford. The book is intended for researchers in the field and for people that want to build robust machine learning libraries and thus is inaccessible to many people that are new into the field. On Amazon, roughly 1 out of 5 people1 find the book too difficult to read and went as far as to call the book a “disaster”. There are many instances of the expressions “it is easy to show” or “the exercise is left to the reader”. Often times for the novice reader, these problems are not so easy to show. My goal in writing the thesis is to provide clarity to the book for novice readers. Therefore, my goal is not to reproduce the book, but to act as a supplement that covers parts of the book the author overlooks. In particular, my contributions are as follows: ๏ท Derive problems that are overlooked (“it is easy to show”). ๏ท Provide solutions to exercises that can be understood to the novice reader. ๏ท Provide R code that the reader can copy and paste. With that said, there is a level requirement for reading this thesis. Specifically, I assume that the reader is has taken courses in calculus, probability, linear algebra, and linear regression. I will assume that the reader is comfortable with calculus topics such as gradients, derivatives, and vector spaces. For probability, an understanding of random variables, probability mass functions, cumulative distribution functions, expectation, variance and covariance, and basic 1 multivariate distributions is required. For linear algebra, the reader should have a solid understanding of basic matrix operations, inverses, norms, determinants, eigenvalues and eigenvectors, and definiteness of matrices. And for linear regression, the reader should be familiar with sum-of-squared errors, least squares estimation of parameters, and general hypothesis testing. This thesis is an introduction and covers Chapters 2 (Overview of Supervised Learning), 3 (Linear Regression), and 4 (Classification). An updated copy with Chapters 7(Model Validation), 8 (Model Inference), 9 (Additive Models), and 10 (Boosted Trees) is intended by mid-January. This manual is in no way complete and will be an ongoing project after graduating. Please refer to the site1 below if you are interested in new updates or contributing to the project. Suggestions and comments for improvement are always welcome! 1 http://eslsolutionmanual.weebly.com 2 CHAPTER 2 OVERVIEW OF SUPERVISED LEARNING 2.1 Introduction This section goes over mathematical notation, least squares and nearest neighbors, statistical decision theory, and the bias-variance decomposition. 2.2 Mathematical Notation The mathematical notation adopted in this guide is identical to the one used in the book and is summarized below. ๏ง We bold matrices: ๐ ∈ ๐ ๐×๐ is a matrix with n rows and p columns. ๏ง We denote the jth column of matrix ๐ as ๐ฑ๐ : | ๐ฑ ๐=[ 1 | ๏ง โฏ | ๐ฑ ๐ ]. | We denote the ith row of matrix ๐ as ๐ฑ ๐๐ : ๐= − − [− ๏ง | ๐ฑ2 | ๐ฑ1๐ ๐ฑ 2๐ โฎ ๐ฑ ๐๐ − − −] Generic vectors are capitalized ๐ and observed vectors are in lowercase ๐ฅ. 3 ๏ง The ith observation of vector ๐ฅ is denoted ๐ฅ๐ : ๐ฅ1 ๐ฅ2 ๐ฅ=[ โฎ ] ๐ฅ๐ ๏ง We denote ๐ as a quantitative output and takes values in the real numbers. ๏ง We denote ๐บ as a qualitative output. In the text, there are many terms that are not explained so we will try to define them here. We denote ๐ฅ๐ as the input variable (also called features) and ๐ฆ๐ the output variable (also called target) that we are trying to predict. A training example is the pair (๐ฅ๐ , ๐ฆ๐ ) and the training set is a list of ๐ training examples {(๐ฅ๐ , ๐ฆ๐ ) โถ ๐ = 1,2, … , ๐} often denoted ๐ป in the text. Given a training set, our goal is to learn a function (also called model) ๐: ๐ → ๐ so that our function is “good” at mapping the inputs to the corresponding outputs. We will define what “good” means in Section 2.4. Supervised learning refers to these types of functions with labeled data. 2.3 Least Squares and Nearest Neighbors On page 12 in Equation 2.6, the author provides the unique solution to the coefficient vector ๐ฝ as follows ๐ฝฬ = (๐ T ๐)−1 ๐ ๐ ๐ฆ. 4 (1) Recall that in linear regression, we find the solution to the parameter vector ๐ฝ by minimizing the sum-of-squared-errors written as follows ๐ ๐ 2 ๐ ๐๐(๐ฝ) = ∑ (๐ฆ๐ − ๐ฝ0 − ∑ ๐ฝ๐ ๐ฅ๐๐ ) . ๐=1 (2) ๐=1 Now define the vector ๐ฝ = 〈๐ฝ0 , ๐ฝ1 , … , ๐ฝ๐ 〉 so that ๐ฝ ∈ ๐ ๐+1 . It is important to note that we have included the intercept term in the vector and thus the corresponding observed values ๐ฅ๐ have an implicit 1 as its first element so that we get ๐ฅ๐ = 〈๐ฅ๐1 , ๐ฅ๐2 , … , ๐ฅ๐๐ 〉 → 〈1, ๐ฅ๐1 , ๐ฅ๐2 , … , ๐ฅ๐๐ 〉 (3) after including the intercept term. Then we can rewrite the above quantity in vector form as follows ๐ ๐ ๐๐(๐ฝ) = ∑(๐ฆ๐ − ๐ฅ๐๐ ๐ฝ)2 . (4) ๐=1 To get the above quantity in matrix form, we will introduce the design matrix ๐ ∈ ๐ ๐×(๐+1) that contains the training input vectors ๐ฅ๐ along the rows: ๐= − − [− 5 ๐ฅ1๐ ๐ฅ2๐ โฎ ๐ฅ๐๐ − − −] and define the vector ๐ฆ ∈ ๐ ๐ to be the vector with the training labels: ๐ฆ1 ๐ฆ2 ๐ฆ = [ โฎ ]. ๐ฆ๐ Then we can easily show that ๐ฆ1 ๐ฅ1๐ ๐ฝ ๐ ๐ฆ2 ๐ฆ − ๐๐ฝ = [ โฎ ] − ๐ฅ2 ๐ฝ โฎ ๐ ๐ฆ๐ [ ๐ฅ๐ ๐ฝ ] ๐ฆ1 − ๐ฅ1๐ ๐ฝ ๐ = ๐ฆ2 − ๐ฅ2 ๐ฝ โฎ [ ๐ฆ๐ − ๐ฅ๐๐ ๐ฝ ] so that we can rewrite the residual-sum-of-squared errors in matrix form as ๐ ๐๐(๐ฝ) = (๐ฆ − ๐๐ฝ)๐ (๐ฆ − ๐๐ฝ) = ๐ฆ ๐ ๐ฆ − 2๐ฝ ๐ ๐ ๐ ๐ฆ + ๐ฝ ๐ ๐ ๐ ๐๐ฝ (5) (6) Now, to minimize the function, set the derivative to zero ๐๐ ๐๐(๐ฝ) = −2๐ ๐ ๐ฆ + 2๐ ๐ ๐๐ฝ = 0 ๐๐ฝ (7) ⇒ ๐ ๐ ๐๐ฝ = ๐ ๐ ๐ฆ (8) ⇒ ๐ฝฬ = (๐ T ๐)−1 ๐ ๐ ๐ฆ (9) and if ๐ is full rank, we get the unique solution in ๐ฝ shown in (1). 6 It is important to point out that for linear regression, our model here is ๐(๐ฅ๐ ) = ๐ฅ๐๐ ๐ฝ so that the prediction at an arbitrary point ๐ฅ0 is ๐ฆฬ0 = ๐ฅ0๐ ๐ฝ. On page 14, Equation (2.8) states that the k-nearest neighbors model has the form ๐ฬ(๐ฅ) = 1 ๐ ∑ ๐ฆ๐ . (10) ๐ฅ๐ ∈๐๐ (๐ฅ) Where ๐๐ (๐ฅ) is the neighborhood of x defined by the k closest points ๐ฅ๐ in the training sample. The word closest implies a metric and here it is the Euclidean distance. More formally, define the ๐ท = {๐(๐ฅ๐ , ๐ฅ): ๐ฅ๐ ∈ ๐} where ๐(๐ฅ๐ , ๐ฅ) is any metric where for Euclidean distance it is ๐(๐ฅ๐ , ๐ฅ) = ||๐ฅ๐ − ๐ฅ||2. Then the neighborhood is ๐๐ (๐ฅ) = {๐ฅ๐ : ๐(๐ฅ๐ , ๐ฅ) ≤ ๐๐ where dk is the ๐th smallest element of ๐ท, ๐ฅ๐ ∈ ๐}. 2.4 Statistical Decision Theory On page 18, Equation (2.9) of the book defines the squared error loss function as ๐ฟ(๐, ๐(๐)) = (๐ − ๐(๐)) 2 (11) which gives us the following expected prediction error 2 ๐ธ๐๐ธ(๐) = ๐ธ [(๐ − ๐(๐)) ] = ∫ ∫ [๐ฆ − ๐(๐ฅ)]2 Pr(๐ฆ, ๐ฅ)๐๐ฆ๐๐ฅ (12) (13) Now, we will fill in the steps that are skipped in the book. Recall that we can factor the joint density as Pr(๐, ๐) = Pr(๐|๐) Pr(๐). 7 Then = ∫ ∫[๐ฆ − ๐(๐ฅ)]2 Pr(๐ฆ|๐ฅ) Pr(๐ฅ) ๐๐ฆ๐๐ฅ ๐ฅ ๐ฆ ๐ธ๐๐ธ(๐) = ๐ธ๐ ๐ธ๐|๐ ([๐ − ๐(๐)]2 |๐). (14) Notice that by conditioning on ๐, we have freed the dependency of the function ๐ on ๐ and since the quantity [๐ − ๐]2 is convex, there is a unique solution. We can now minimize to solve for ๐ ๐(๐ฅ) = arg min ๐ธ๐|๐ ([๐ − ๐]2 |๐ = ๐ฅ) (15) ๐ ∫ [๐ − ๐]2 Pr(๐ฆ|๐ฅ) ๐๐ฆ = 0 ๐๐ (16) ๐ [๐ฆ − ๐]2 Pr(๐ฆ|๐) ๐๐ฆ = 0 ๐๐ (17) ๐ ⇒ =∫ = 2∫ ๐ฆ๐๐(๐ฆ|๐ฅ)๐๐ฆ = 2๐ ∫ ๐๐(๐ฆ|๐ฅ) ๐๐ฆ = 0 (18) ⇒ 2๐ธ[๐|๐] = 2๐ (19) ⇒ ๐ = ๐ธ[๐|๐ = ๐ฅ]. (20) Thus we get the Equation (2.13) from the book. Lets continue down this path and work out the problem from page 20, Equation (2.18) the author replaces the squared loss function with the absolute loss function: ๐ธ[|๐ − ๐(๐)|]. The equation from the book is as follows ๐ฬ(๐ฅ) = median(๐|๐ = ๐ฅ) 8 (21) then the expected prediction error for absolute loss is almost identical to that of squared loss. If you become confused with this example, refer back to the squared loss derivation for the first few steps. We can write the expected prediction error for absolute loss as follows ๐ธ๐๐ธ(๐) = ๐ธ[|๐ − ๐(๐)|] (22) = ∫ ∫ |๐ฆ − ๐(๐ฅ)|Pr(๐ฆ, ๐ฅ)๐๐ฆ๐๐ฅ (23) then recall that by factoring joint density, we can rewrite the above quantity as ๐ธ๐๐ธ(๐) = ๐ธ๐ ๐ธ๐|๐ (|๐ − ๐(๐)||๐). (24) and we free the dependency of f on x again so that it suffices to minimize the conditional density of y ๐(๐ฅ) = arg min ๐ธ๐|๐ (|๐ − ๐||๐ = ๐ฅ) (25) ๐ ∫ |๐ − ๐|Pr(๐ฆ|๐) ๐๐ฆ = 0 ๐๐ (26) ๐ = From here, the integration is a little sophisticated and requires a branch of analysis known as measure theory. For this reason, we will provide an approximation that does not address the smaller details. By using the law of large numbers, we get the following ๐ ๐ 1 1 ∫ |๐ − ๐| Pr(๐ฆ|๐)๐๐ฆ = ๐ฟ๐๐๐→∞ ∑ |๐๐ − ๐| ≈ ∑ |๐๐ − ๐| ๐ ๐ ๐=1 (27) ๐=1 and so that we can get an approximation that converges when n is large. For this example, we will use the approximation, the last term in the above equation. Notice here that the absolute function is piecewise 9 ๐๐ − ๐, |๐๐ − ๐| = {๐ − ๐๐ , 0, ๐๐ − ๐ > 0 ๐๐ − ๐ < 0 ๐๐ = ๐ (28) so that taking the derivative is not continuous at zero −1, ๐ |๐ − ๐| = { 1, ๐๐ ๐ 0, ๐๐ − ๐ > 0 ๐๐ − ๐ < 0 . ๐๐ = ๐ (29) Now, we can introduce a new function, the sign function to make the definition clearer 1, ๐ฅ>0 ๐ฅ<0 ๐ ๐๐๐(๐ฅ) = { −1, 0, ๐๐กโ๐๐๐ค๐๐ ๐ (30) so that substituting it back in we get ๐ ๐ 1 ≈ ∑ |๐๐ − ๐| = 0 ๐๐ ๐ (31) ๐=1 ๐ 1 = ∑ −๐ ๐๐๐(๐๐ − ๐) = 0 ๐ (32) ๐=1 ๐ = ∑ ๐ ๐๐๐(๐๐ − ๐) = 0. (33) ๐=1 At what value of f does the above quantity hold? It holds when there is an equal number of positive and negative values; that is, where ๐๐๐๐(๐๐ − ๐ > 0) = ๐๐๐๐(๐๐ − ๐ < 0). 10 (34) The value of f where that is true is the median. Recall that the median can be found by sorting a finite list of numbers from lowest value to highest value and picking the middle one. When there is an odd number of observations, there is a single number that divides the set (e.g. the median of {2,3,4,5,6} is 4). When there is an even number, there is a range of values that can divide the set (e.g. the median of {2,4,6,8} is the any value in ๐ ∈ (4,6)). In conclusion, we have shown that ๐ฬ(๐ฅ) = median(๐|๐ = ๐ฅ). On page 24, Equation (2.25) is nested in Equation (2.27) so we will derive Equation (2.27). The derivation is a little tedious and algebraic but it is an important one. The expected (squared) prediction error at an arbitrary point ๐ฅ0 is ๐ธ๐๐ธ(๐ฅ0 ) = ๐ธ๐ฆ0 |๐ฅ0 ๐ธ๐ (๐ฆ0 − ๐ฆฬ0 )2 . (35) The T here is the training set that we defined earlier, ๐ฆ0 |๐ฅ0 and ๐ฆ0 are identical quantities that represent a random variable conditioned on ๐ฅ0 . Additionally, ๐ฆฬ0 is our prediction at ๐ฅ0 , that is ๐ฆฬ0 = ๐ฬ(๐ฅ0 ) where ๐ฬ is our model estimate of the true model ๐ fitted on the training data. We assume the model ๐ = ๐(๐) + ๐ (36) where ๐ ~ ๐(0, ๐ 2 ) and independently distributed. To start the proof, we use a little trick by inserting the true function ๐(๐ฅ0 ) as follows ๐ธ๐๐ธ(๐ฅ0 ) = ๐ธ๐ฆ0 |๐ฅ0 ๐ธ๐ (๐ฆ0 − ๐ฆฬ0 )2 = ๐ธ๐ฆ0 |๐ฅ0 ๐ธ๐ ((๐ฆ0 − ๐(๐ฅ0 )) + (๐(๐ฅ0 ) − ๐ฆฬ0 ))2 . 11 (37) Notice that by subtracting and adding ๐(๐ฅ0 ), the value of the function does not change. It is also important to point out here that the model ๐ฬ(๐ฅ0 ) = ๐ฆฬ0 is fitted on the training set by our definition in the introduction. Thus, the only term in the above function that depends on T is ๐ฆฬ0 . We then continue with the problem and factor out the squared term as 2 = ๐ธ๐ฆ0 |๐ฅ0 ๐ธ๐ [(๐ฆ0 − ๐(๐ฅ0 )) ] + 2๐ธ๐ฆ0 |๐ฅ0 ๐ธ๐ [(๐ฆ − ๐(๐ฅ0 )) ⋅ (๐(๐ฅ0 ) − ๐ฆฬ0 )] +๐ธ๐ฆ0 |๐ฅ0 ๐ธ๐ [(๐(๐ฅ0 ) − ๐ฆฬ0 )2 ]. (38) 2 The first quantity ๐ธ๐ฆ0 |๐ฅ0 ๐ธ๐ [(๐ฆ0 − ๐(๐ฅ0 )) ] is independent of the training set so we can reduce it to 2 2 ๐ธ๐ฆ0 |๐ฅ0 [(๐ฆ0 − ๐(๐ฅ0 )) ] = ∫(๐ฆ0 − ๐(๐ฅ0 )) Pr(๐ฆ0 |๐ฅ0 ) ๐๐ฆ (39) where ๐ธ[๐ฆ0 ] = ๐(๐ฅ0 ), the true mean. Then notice that this is just the function for the conditional variance that we specified as ๐ 2 . For the second quantity, we can factor out the middle term: ๐ธ๐ฆ0 |๐ฅ0 ๐ธ๐ [(๐ฆ0 − ๐(๐ฅ0 )) ⋅ (๐(๐ฅ0 ) − ๐ฆฬ0 )] = ๐ธ๐ฆ0 |๐ฅ0 ๐ธ๐ [๐ฆ0 ๐(๐ฅ0 ) − ๐ฆ0 ๐ฆฬ0 − ๐(๐ฅ0 )2 + ๐(๐ฅ0 )๐ฆฬ0 ] (40) and by linearity of expectation, = ๐ธ๐ฆ0 |๐ฅ0 ๐ธ๐ [๐ฆ0 ๐(๐ฅ0 )] − ๐ธ๐ฆ0 |๐ฅ0 ๐ธ๐ [๐ฆ0 ๐ฆฬ0 ] − ๐ธ๐ฆ0 |๐ฅ0 ๐ธ๐ [๐(๐ฅ0 )2 ] + ๐ธ๐ฆ0 |๐ฅ0 ๐ธ๐ [๐(๐ฅ0 )๐ฆฬ0 ]. (41) We stated that only ๐ฆฬ0 is dependent on T. Additionally, notice that ๐(๐ฅ0 ) = ๐ธ[๐ฆ0 ] so it is a constant term so that we can reduce the above quantity to = ๐(๐ฅ0 )๐ธ๐ฆ0 |๐ฅ0 [๐ฆ0 ] − ๐ธ๐ฆ0 |๐ฅ0 [๐ฆ0 ๐ธ๐ [๐ฆฬ0 ]] − ๐(๐ฅ0 )2 + ๐(๐ฅ0 )๐ธ๐ฆ0|๐ฅ0 ๐ธ๐ [๐ฆฬ0 ]. 12 (42) Notice that for the first term, the relationship ๐ธ๐ฆ0 |๐ฅ0 [๐ฆ0 ] = ∫ ๐ฆ0 Pr(y0 |๐ฅ0 ) ๐๐ฆ = ๐(๐ฅ0 ). (43) For the second term, the relationship is ๐ธ๐ฆ0 |๐ฅ0 [๐ฆ0 ๐ธ๐ [๐ฆฬ0 ]] = ๐ธ๐ [๐ฆฬ0 ]๐ธ๐ฆ0 |๐ฅ0 [๐ฆ0 ] = ๐ธ๐ [๐ฆฬ0 ]๐(๐ฅ0 ) (44) and the last term, we get ๐(๐ฅ0 )๐ธ๐ฆ0 |๐ฅ0 ๐ธ๐ [๐ฆฬ0 ] = ๐(๐ฅ0 )๐ธ๐ [๐ฆฬ0 ] (45) and thus adding all of these terms together, the first and third quantities cancel, the second and last quantities cancel so that the entire middle term equals 0. Now, we are left with the last term ๐ธ๐ฆ0 |๐ฅ0 ๐ธ๐ [(๐(๐ฅ0 ) − ๐ฆฬ0 )2 ] which we can factor out as follows = ๐ธ๐ฆ0 |๐ฅ0 ๐ธ๐ [๐(๐ฅ0 )2 ] − 2๐ธ๐ฆ0 |๐ฅ0 ๐ธ๐ [๐(๐ฅ0 )๐ฆฬ0 ] + ๐ธ๐ฆ0 |๐ฅ0 ๐ธ๐ [๐ฆฬ02 ]. (46) We are going to use another property here, which is that ๐ธ๐ฆ0 |๐ฅ0 ๐ธ๐ [(๐(๐ฅ0 ) − ๐ธ๐ [๐ฆฬ0 ])2 ] = ๐ธ๐ฆ0 |๐ฅ0 ๐ธ๐ [๐(๐ฅ0 )2 ] − 2๐ธ๐ฆ0 |๐ฅ0 ๐ธ๐ [๐(๐ฅ0 )๐ฆฬ0 ] + ๐ธ๐ฆ0 |๐ฅ0 ๐ธ๐ [๐ฆฬ0 ]2 (47) and then substitute it back into the original equation to get = ๐ธ๐ฆ0 |๐ฅ0 ๐ธ๐ [(๐(๐ฅ0 ) − ๐ธ๐ [๐ฆฬ0 ])2 ] − ๐ธ๐ฆ0 |๐ฅ0 ๐ธ๐ [๐ฆฬ02 ] + ๐ธ๐ฆ0 |๐ฅ0 ๐ธ๐ [๐ฆฬ02 ] and we can reduce it as we did in the previous quantities as 13 (48) = [(๐(๐ฅ0 ) − ๐ธ๐ [๐ฆฬ0 ])2 ] + ๐ธ๐ [๐ฆฬ02 ] − ๐ธ๐ [๐ฆฬ02 ] (49) 2 = ๐ต๐๐๐ (๐ฬ) + ๐๐๐(๐ฬ). (50) Recall that ๐ฆฬ0 = ๐ฬ(๐ฅ0 ) and so we use them interchangeably. The first quantity is called the squared bias of an estimator (not to be confused with the vernacular bias meaning being partial to something). Our estimator here is our approximating function ๐ฬ and the bias is the difference between the expected value of our estimator and the true function. Often times, it is desirable to have an unbiased estimator but, as we will soon see, this is not always the case. The variance arises from our basic probability class ๐๐๐(๐ฆฬ0 ) = ๐ธ[๐ฆฬ02 ] − ๐ธ[๐ฆฬ0 ]2 and tells us how much our function varies across different training sets. Notice also that the third quantity above is also the same one in Equation (2.5) so that we have solved for the ๐๐๐ธ(๐ฬ) here. So recapping the expected squared prediction error at ๐ฅ0 equals 2 ๐ธ๐๐ธ(๐ฅ0 ) = ๐ 2 + ๐ต๐๐๐ (๐ฬ) + ๐๐๐(๐ฬ) (51) where ๐ 2 is the irreducible error. This is a famous result in statistics and is also known as the bias-variance decomposition. This completes Equations (2.25) and (2.27). On page 31, Equation (2.35) gives us the equation for the likelihood of the data under the least squares model as ๐ ๐ 1 2 ๐ฟ(๐) = − log(2๐) − ๐ log ๐ − 2 ∑(๐ฆ๐ − ๐๐ (๐ฅ๐ )) . 2 2๐ ๐=1 We will derive that here. Recall that for least squares, we assume the model ๐ = ๐๐ ๐ฝ + ๐ 14 (52) where ๐ ~ ๐(0, ๐ 2 ). Notice that the density of ๐๐ is Pr(๐๐ ) = 1 ๐√2๐ exp (− ๐๐ 2 ) 2๐ 2 (53) exp (− (๐ฆ๐ − ๐ฅ๐๐ ๐ฝ)2 ) 2๐ 2 (54) then ๐๐ = ๐ฆ๐ − ๐ฅ๐ ๐ ๐ฝ so that we get the density Pr(๐ฆ๐ |๐ฅ๐ , ๐ฝ) = 1 ๐√2๐ where here we assume that ๐ฅ๐ and ๐ฝ are fixed and ๐ฆ๐ is random. We can also write the distribution of ๐ฆ๐ |๐ฅ๐ ~ ๐(๐ฅ๐๐ ๐ฝ, ๐ 2 ). Now suppose we have our training data, T, and we want to estimate the parameters. As stated in the text, we would like to find the value of ๐ฝ that maximizes the probability of the data. When we write the quantity as Pr(๐ฆ|๐, ๐ฝ), this assumes that ๐ฝ is fixed. Instead, we will call this the likelihood and rewrite it ๐ฟ(๐) = ๐(๐ฆ|๐, ๐ฝ) (55) so that now ๐ฝ is a function of the likelihood. Since we assume that our training examples our drawn independently, we can write the likelihood of the data as ๐ ๐ฟ(๐ฝ) = ∏ ๐(๐ฆ๐ |๐ฅ๐ , ๐ฝ) (56) ๐=1 ๐ (๐ฆ๐ − ๐ฅ๐๐ ๐ฝ)2 =∏ exp (− ) 2๐ 2 ๐√2๐ 1 (57) ๐=1 Maximizing any monotonic transformation of the likelihood function is also the maximum. That is, given a monotonic function ๐(๐ฅ), we have that for the likelihood ๐ฟ(๐ฅ1 ) > ๐ฟ(๐ฅ2 ) for any ๐ฅ1 , ๐ฅ2 in the domain of ๐ฟ implies that ๐(๐ฅ1 ) > ๐(๐ฅ2 ). A common transformation used in 15 statistics is the log transformation since it turns the products of the likelihood into sums for the log likelihood. By applying the log transformation, we get ๐ (๐ฆ๐ − ๐ฅ๐๐ ๐ฝ)2 log(๐ฟ(๐ฝ)) = โ(๐ฝ) = ∑ log ( ๐๐ฅ๐ (− )) 2๐ 2 √2๐๐ 1 (58) ๐=1 1 ๐ 1 = ๐ log − 2 ∑(๐ฆ๐ − ๐ฅ๐๐ ๐ฝ)2 √2๐๐ 2๐ (59) ๐=1 ๐ ๐ 1 = − log 2๐ − ๐ log ๐ − 2 ∑(๐ฆ๐ − ๐ฅ๐๐ ๐ฝ)2 2 2๐ (60) ๐=1 which gives us Equation (2.35) from the book and maximizing the above quantity is the same as minimizing the quantity ∑๐๐=1(๐ฆ๐ − ๐ฅ๐๐ ๐ฝ)2since it is again a monotonic transformation and multiplying by (-1) turns a maximization problem into a minimization. Notice the surprising fact here; we get the least squares loss function from linear regression from maximum likelihood estimation. To conclude this section, we illustrate the bias-variance tradeoff by reproducing Figure 2.11 from the book using nearest neighbors. 16 Figure 2.1 Bias-Variance Tradeoff. The left side of Figure 2.1 represents a large value of ๐ using nearest neighbors which has low variance in the test set but a high bias. The right side shows that as ๐ → 1, the prediction error on the training set falls to zero but that the error on the testing set is high and thus the model fails to generalize well. Table 2.1. R Code for Figure 2.1. library(FNN) # For nearest neighbor library(lattice) library(latticeExtra) library(gridExtra) # Color palette # http://www.cookbook-r.com/Graphs/Colors_%28ggplot2%29/ cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7") # Define mean squared error mse=function(x,y){return(mean((y-x)^2))} # Generate random data x1=rnorm(1000) x2=x1+rnorm(1000) x3=rnorm(1000) 17 y=x1+rnorm(1000) data=data.frame(x1,x2,x3,y) # Train/test (50/50) split train=data[1:500,] test=data[500:1000,] # Instantiate error vectors train_err=NULL test_err=NULL # Find train/test error for varying values of k for(i in 15:1) { # Nearest Neighbor model nearest_train <- knn.reg(train=train, test=train, y=y, k = i) nearest_test <- knn.reg(train=train, test=test, y=y, k = i) # MSE train_err=append(train_err,mse(nearest_train$pred,train$y)) test_err=append(test_err,mse(nearest_test$pred,test$y)) } # Plot a1=xyplot(train_err~1:15, xlab=list(label='Model Complexity',cex=0.7), ylab=list(label='Prediction Error',cex=0.7), main=list(label='Training and Test Error',cex=0.75), scales = list(x = list(draw = FALSE),y=list(draw=FALSE)), panel=function(...){ panel.lines(train_err,col=cbPalette[6]) }) a2=xyplot(test_err~1:15, panel=function(...){ panel.lines(test_err,col=cbPalette[7]) }) grid.arrange(a1+a2,ncol=1) 2.5 Problems and Solutions Exercise 2.1. Suppose each of K-classes has an associated target ๐ก๐ , which is a vector of all zeros, except a one in the kth position. Show that classifying to the largest element of ๐ฆฬ amounts to choosing the closest target,๐๐๐๐ ||๐ก๐ − ๐ฆฬ||, if the elements of ๐ฆฬ sum to one. 18 Proof: This is a classification problem defined in the book. Suppose we have a set ๐ with K-elements (or classes) that hold the standard basis in ๐ ๐พ . That is, ๐ = {(1,0, … ,0)๐ , (0,1, … ,0)๐ … , (0,0, … ,1)๐ } and ๐ก๐ ∈ ๐. In this problem, ๐ฆฬ is a ๐พdimensional vector with the element ๐ฆฬ๐ being the probability that Pr(๐ฆ๐ = ๐ก๐ ). It is not clear what model ๐ฆฬ is generated from but presumably it is regression since ๐ฆฬ is continuous. We will not place any assumptions on the model. To solve the equation, we can write it as ๐พ min||๐ก๐ − ๐ฆฬ|| = min ∑(๐ก๐,๐ − ๐ฆ๐ ) ๐ ๐ 2 ๐=1 ๐พ 2 = min ∑ ๐ก๐,๐ − 2๐ก๐,๐ ๐ฆ๐ + ๐ฆ๐2 ๐ ๐=1 Then notice here that for the first term, when ๐ = ๐, the quantity equals 1 else it is 0. Thus, 2 ∑๐ ๐ก๐,๐ = 1 for all values of ๐. Likewise, the last term ∑๐ ๐ฆ๐2 is independent of ๐ so that it is constant with respect to ๐. Finally, the middle term ∑๐ −2๐ก๐,๐ ๐ฆ๐ = −2๐ฆ๐ when ๐ = ๐ and is 0 otherwise. Note that is also varies across different values of ๐ so that it is a function of ๐. Then, we can rewrite the above function as a function of only the middle term as follows ๐พ = min ∑ −2๐ก๐,๐ ๐ฆ๐ . ๐ ๐=1 And as we stated above, this is only non-zero at = min −2๐ก๐ ๐ฆ๐ ๐ 19 ⇔ min −๐ฆ๐ ๐ and multiplying the above quantity by (−1), we can change the min to a max problem as follows = max ๐ฆ๐ ๐ and so we have shown that classifying the largest element ๐ฆฬ amounts to choosing the closest target. Exercise 2.2. Show how to compute the Bayes decision boundary for the simulation example in Figure 2.5. Proof: For this example, the data has two classes and each was generated by a separate ๐ผ mixture of Gaussians. That is, our generating density is ๐(๐๐ , 5) is a weighted sum of 10 Gaussians generated from ๐((0,1)๐ , ๐ผ). The generation process is described in the table below. They are identical except for the first step. Class 1 Class 2 1. 10 means generated from a bivariate Gaussian ๐((0,1)๐ , ๐ผ). 2. 100 Samples selected as follows a. For each observation, ๐๐ 1 was selected with probability 10. b. Then a sample was generated from the bivariate ๐ผ Gaussian ๐(๐๐ , 5). 20 1. 10 means generated from a bivariate Gaussian ๐((1,0)๐ , ๐ผ). 2. 100 Samples selected as follows a. For each observation, ๐๐ 1 was selected with probability 10. b. Then a sample was generated from the bivariate ๐ผ Gaussian ๐(๐๐ , 5). Recall that the Bayes classifier says that we classify to the most probable class using the conditional distribution Pr(๐บ|๐). Hence, the decision boundary is the set of points that partitions the vector space into two sets; one for each class. On the decision boundary itself, the output label is ambiguous. Thus, the optimal Bayes decision boundary is defined as ๐ต๐๐ข๐๐๐๐๐ฆ = {๐ฅ: max Pr(๐|๐ = ๐ฅ) = max Pr(๐|๐ = ๐ฅ)}. ๐∈๐บ ๐∈๐บ That is, it is the set of points where the most probable class is tied between two or more classes. In our example, there are only two classes so that ๐๐๐๐(๐บ) = 2. Hence, we can rewrite the above quantity as ๐ต๐๐ข๐๐๐๐๐ฆ = {๐ฅ: Pr(๐|๐ = ๐ฅ) = Pr(๐|๐ = ๐ฅ)} = {๐ฅ: Pr(๐|๐ = ๐ฅ) = 1}. Pr(๐|๐ = ๐ฅ) We can then rewrite the above quantity by using Bayes rule as follows Pr(๐|๐ = ๐ฅ) Pr(๐ = ๐ฅ|๐)Pr(๐) /Pr(๐ = ๐ฅ) Pr(๐ = ๐ฅ|๐) Pr(๐) = = =1 Pr(๐|๐ = ๐ฅ) Pr(๐ = ๐ฅ|๐) Pr(๐) /Pr(๐ = ๐ฅ) Pr(๐ = ๐ฅ|๐) Pr(๐) because we have 100 points in each class, Pr(๐) = Pr(๐) so we can remove those from the above equation. Then the boundary is {๐ฅ: Pr(๐ = ๐ฅ|๐) = Pr(๐ = ๐ฅ|๐)} which in this example, since we know the generating density to be Gaussian, we can rewrite 10 Pr(๐ = ๐ฅ|๐) = ∏ ๐=1 1 5√2๐ 21 exp (− (๐ฅ − ๐๐ )2 ) 2 ⋅ 25 and since we know the log transformation is monotonic and preserves the ordering, we can log it and write it as 10 (๐ฅ − ๐๐ )2 log(Pr(๐ = ๐ฅ|๐)) = ∑ log ( )− . 2 ⋅ 25 5√2๐ 1 ๐=1 Now, equating class ๐ and ๐ to get the decision boundary, we get the following 10 10 (๐ฅ − ๐๐ )2 (๐ฅ − ๐๐ )2 1 ๐ต๐๐ข๐๐๐๐๐ฆ = {๐ฅ: ∑ log ( )− = ∑ log ( )− } 2 ⋅ 25 2 ⋅ 25 5√2๐ 5√2๐ 1 ๐=1 ๐=1 10 10 = {๐ฅ: ∑(๐ฅ − ๐๐ )2 = ∑(๐ฅ − ๐๐ )2 } ๐=1 ๐=1 The exact boundary for figure 2.5 would depend on our generated means. Exercise 2.4. The edge effect problem discussed on page 23 is not peculiar to uniform sampling from bounded domains. Consider inputs drawn from a spherical multinormal distribution ๐ ~ ๐(0, ๐ผ๐ ). The squared distance from any sample point to the origin has a ๐๐2 distribution with mean ๐. Consider a prediction point ๐ฅ0 drawn from this distribution, and let ๐ = ๐ฅ0 /||๐ฅ0 || be an associated unit vector. Let ๐ง๐ = ๐๐ ๐ฅ๐ be the projection of each of the training points on this direction. 1. Show that the ๐ง๐ are distributed ๐(0,1). 2. Show that the target point has expected squared distance ๐ from the origin. Hence for ๐ =10, a randomly drawn test point is about 3.1 standard deviations from the origin, while all the training points are on average one standard deviation along direction ๐. So most prediction points see themselves as lying on the edge of the training set. 22 Proof: There is a well-known property that states that a linear combination of mutually independent normal random variables is itself normal. We will not derive that proof but it can be easily found on the internet or in a probability textbook. That is, if ๐ ๐ง = ∑ ๐๐ ๐ฅ๐ ๐=1 where each ๐ฅ๐ ~ ๐(0,1) then the mean is ๐ ๐ธ[๐ง] = ∑ ๐๐ ๐ฅ๐ ๐=1 and the variance is ๐ ๐๐๐(๐ง) = ∑ ๐๐2 ๐๐๐(๐ฅ๐ ) ๐=1 and ๐ง๐ ~ ๐(๐ธ[๐ง], ๐๐๐(๐ง)). For (1), ๐ง๐ = ๐๐ ๐ฅ๐ so that the expectation is ๐ธ[๐ง๐ ] = ๐ธ[๐๐ ๐ฅ๐ ] = ๐๐ ๐ธ[๐ฅ๐ ] = ๐๐ 0 by linearity since ๐ is constant and 0 is the ๐ × 1-dimensional vector of zeros. Notice here that ๐ is constant although it is randomly drawn since once we draw it, we are using the same value and hence it is no longer random. For the variance, ๐ ๐ ๐๐๐(๐ ๐ฅ๐ ) = ๐ ๐๐๐(๐ฅ๐ )๐ = ๐ ๐ ๐ผ๐ ๐ = ||๐||22 = ๐ฅ0๐ ๐ฅ0 ||๐ฅ0 ||22 = =1 ||๐ฅ0 ||22 ||๐ฅ0 ||22 by the property of variance where constant terms are squared. This leaves that the resulting distribution ๐ง๐ to have mean 0 and variance 1. 23 Using that property, ๐ง๐ = ๐๐ ๐ฅ states that ๐ง๐ is a linear combination of normal distributions implies that ๐ง๐ is itself normally distributed. We have shown that ๐ง๐ ~ ๐(0,1) and so that this completes the problem. For (2), it is not clear from the problem what the “target” point is, but presumably it is ๐ since it can be shown ๐ has a squared expected distance ๐. Since ๐ is a ๐ × 1-dimensional random vector generated from ๐(0, ๐๐ ), then the squared distance of ๐ can be written conveniently in vector form as ๐ ๐ ๐ = ∑๐๐=1 ๐ฅ๐2 . Notice that the covariance between ๐ฅ๐ and ๐ฅ๐ is 0 for all ๐, ๐ but this is not imply the independence required by the chi-squared distribution. However, since the multivariate distribution is spherical, we will assume that the author implies that they are independently drawn. Then assuming independence, each ๐ฅ๐ has mean 0 and variance of 1 and ๐ ๐ฅ๐๐ ๐ฅ๐ = ∑ ๐ฅ๐2 ~ ๐๐2 ๐=1 and so is distributed ๐๐2 with mean ๐. Exercise 2.5. Derive Equation (2.27) where the expected prediction error at a point ๐ฅ0 is as follows ๐ธ๐๐ธ(๐ฅ0 ) = ๐ธ๐ฆ0 |๐ฅ0 ๐ธ๐ (๐ฆ0 − ๐ฆฬ0 )2 = ๐๐๐(๐ฆ0 |๐ฅ0 ) + ๐ธ๐ [๐ฆฬ0 − ๐ธ๐ ๐ฆฬ0 ]2 + [๐ธ๐ ๐ฆฬ0 − ๐ฅ0๐ ๐ฝ]2 = ๐๐๐(๐ฆ0 |๐ฅ0 ) + ๐๐๐๐ (๐ฆฬ0 ) + ๐ต๐๐๐ 2 (๐ฆฬ0 ) = ๐ 2 + ๐ธ๐ ๐ฅ0๐ (๐ T ๐)−1 ๐ฅ0 ๐ 2 + 02 24 (2.27) Proof : On page 11 of this guide, we showed the derivation of lines 1-3. For the last line 4, we assume that the relationship between ๐ and ๐ is linear, ๐ = ๐๐ ๐ฝ + ๐ where ๐ ~ ๐(0, ๐ 2 ) and the model is fit by least squares. Then the first term ๐๐๐(๐ฆ0 |๐ฅ0 ) is the conditional variance and since is distributed ๐(0, ๐ 2 ) , it equals ๐ 2 . The least squares is unbiased under for the linear assumption so the last term is 0. This leaves us with the second term which we can write as follows ๐๐๐๐ (๐ฆฬ0 ) = ๐๐๐๐ (๐ฅ0๐ ๐ฝฬ ) = ๐ฅ0๐ ๐๐๐๐ (๐ฝฬ )๐ฅ0 . The multivariate covariance is a generalization of the scalar case for squaring constant terms under variance properties. That is, we used the property of covariance which is ๐ถ๐๐ฃ(๐๐ฅ + ๐) = ๐๐ถ๐๐ฃ(๐ฅ)๐๐ for ๐ฅ ∈ ๐ ๐ . Keep in mind that we are assuming the ๐ฅ๐ ’s are held fixed and ๐ฆ is random. Now, we only need to find the ๐๐๐๐ (๐ฝฬ ). Recall that ๐ฝฬ = (๐ ๐ ๐)−1 ๐ ๐ ๐ฆ and ๐ = ๐๐ฝ + ๐ then this implies that ๐ฝฬ = (๐ ๐ ๐)−1 ๐ ๐ (๐๐ฝ + ๐) ⇒ ๐ฝฬ = (๐ ๐ ๐)−1 ๐ ๐ ๐๐ฝ + (๐ ๐ ๐)−1 ๐ ๐ ๐ 25 by doing some matrix operations. Notice that the left term (๐ ๐ ๐)−1 ๐ ๐ ๐๐ฝ is non-random and so this leaves that the variance is ๐๐๐๐ (๐ฝฬ ) = ๐๐๐((๐ ๐ ๐)−1 ๐ ๐ ๐) = (๐ ๐ ๐)−1 ๐ ๐ ๐๐๐๐ (๐)๐(๐ ๐ ๐)−1 Since ๐๐๐๐ (๐) = ๐ 2 = (๐ ๐ ๐)−1 ๐ ๐ ๐(๐ ๐ ๐)−1 ๐ 2 = (๐ ๐ ๐)−1 ๐ 2 then this implies that ๐๐๐๐ (๐ฆฬ0 ) = ๐ฅ0๐ (๐ ๐ ๐)−1 ๐ฅ0๐ ๐ 2 and since none of the remaining variables are random with respect to the training data, this is equivalent to = ๐ธ๐ ๐ฅ0๐ (๐ ๐ ๐)−1 ๐ฅ0๐ ๐ 2 which concludes the problem. 2. Derive equation (2.28), making use of the cyclic property of the trace operator [๐ก๐๐๐๐(๐ด๐ต) = ๐ก๐๐๐๐(๐ต๐ด)], and its linearity (which allows us to interchange the order of trace and expectation). ๐ธ๐ฅ0 ๐ธ๐๐ธ(๐ฅ0 ) ~ ๐ธ๐ฅ0 ๐ฅ0๐ ๐ถ๐๐ฃ(๐)−1 ๐ฅ0 ๐ 2 /๐ + ๐ 2 = trace[๐ถ๐๐ฃ(๐)−1 ๐ถ๐๐ฃ(๐ฅ0 )]๐ 2 /๐ + ๐ 2 26 ๐ = ๐2 ( ) + ๐2 ๐ (2.28) which is the formula for expected prediction error. Our assumption here is that the matrix ๐ has mean 0 along the columns. Then, by the definition of covariance matrices, the covariance of a random vector ๐ is ๐ถ๐๐ฃ(๐) = ๐ธ[(๐ − ๐)(๐ − ๐)๐ ] = ๐ธ[๐๐ ๐ ] − ๐๐ ๐ where ๐ = 0 here since our matrix is centered. Then we get our sample covariance matrix to be ฬ (๐) = ๐ถ๐๐ฃ ๐๐ ๐ . ๐ To see this, recall that one way to view matrix products is ๐ ๐ ๐ = N − − [− ๐ฑ1๐ ๐ฑ 2๐ โฎ ๐ฑ ๐๐ ๐ฑ1๐ ๐ฑ1 ๐ ๐ฑ 2๐ ๐ฑ1 = ๐ โฎ ๐ ๐ฑ ๐ ๐ฑ1 [ ๐ − − −] | [๐ฑ 1 | ๐ฑ1๐ ๐ฑ 2 ๐ ๐ฑ 2๐ ๐ฑ 2 ๐ โฎ ๐ ๐ฑ ๐ ๐ฑ1 ๐ | ๐ฑ2 | โฏ | ๐ฑ ๐ ] /๐ | ๐ฑ1๐ ๐ฑ ๐ โฏ ๐ ๐ฑ 2๐ ๐ฑ ๐ โฏ ๐ โฑ โฎ ๐ ๐ฑ๐ ๐ฑ๐ โฏ ๐ ] ฬ (๐1 , ๐1 ) ๐ถ๐๐ฃ ฬ (๐1 , ๐2 ) โฏ ๐ถ๐๐ฃ ฬ (๐1 , ๐๐ ) ๐ถ๐๐ฃ ฬ ฬ ฬ = ๐ถ๐๐ฃ (๐2 , ๐1 ) ๐ถ๐๐ฃ(๐2 , ๐2 ) โฏ ๐ถ๐๐ฃ(๐2, ๐๐ ) โฎ โฎ โฑ โฎ ฬ ฬ ฬ ๐ถ๐๐ฃ (๐ , ๐ ) ๐ถ๐๐ฃ (๐ , ๐ ) โฏ ๐ถ๐๐ฃ (๐ [ ๐ 1 ๐ 2 ๐ , ๐๐ )] So that in the first line, if ๐ is large and assuming ๐ธ(๐) = 0, then ๐ ๐ ๐ → ๐๐ถ๐๐ฃ(๐). 27 For the second line, ๐ธ๐ฅ0 ๐ฅ0๐ ๐ถ๐๐ฃ(๐)−1 ๐ฅ0 is scalar, thus we can use the linearity property where the expectation distributes over the trace sum as follows ๐ธ๐ฅ0 trace[๐ฅ0๐ ๐ถ๐๐ฃ(๐)−1 ๐ฅ0 ] = trace[๐ธ๐ฅ0 ๐ฅ0๐ ๐ถ๐๐ฃ(๐)−1 ๐ฅ0 ] and then use the cyclical property of the trace operator to rearrange terms to get = trace[๐ธ๐ฅ0 ๐ฅ0 ๐ฅ0๐ ๐ถ๐๐ฃ(๐)−1 ] and thus trace[๐ธ๐ฅ0 [๐ฅ0 ๐ฅ0๐ ]๐ถ๐๐ฃ(๐)−1 ] = trace[๐ถ๐๐ฃ(๐) ⋅ ๐ถ๐๐ฃ(๐)−1 ] = trace[๐๐ ] =๐ ๐ which gives us that trace[๐ถ๐๐ฃ(๐)−1 ๐ถ๐๐ฃ(๐ฅ0 )]๐ 2 /๐ = ๐ 2 ( ) and thus we have finished the ๐ problem. It is important to understand the motivation behind the problem – that is, for linear regression models, the expected squared prediction error grows by a factor of ๐ so that it is relatively small when ๐ is large. Exercise 2.7. Suppose we have a sample of N pairs ๐ฅ๐ , ๐ฆ๐ drawn i.i.d. from the distribution characterized as follows: ๐ฅ๐ ~ โ(๐ฅ), ๐กโ๐ ๐๐๐ ๐๐๐ ๐๐๐๐ ๐๐ก๐ฆ ๐ฆ๐ = ๐(๐ฅ๐ ) + ๐๐ , ๐ ๐๐ ๐กโ๐ ๐๐๐๐๐๐ ๐ ๐๐๐ ๐๐ข๐๐๐ก๐๐๐ ๐๐ ~ (0, ๐ 2 ), (๐๐๐๐ ๐ง๐๐๐, ๐ฃ๐๐๐๐๐๐๐ ๐ 2 ) 28 We construct an estimator for f linear in the ๐ฆ๐ , ๐ ๐ฬ(๐ฅ0 ) = ∑ ๐๐ (๐ฅ0 ; ๐)๐ฆ๐ , ๐=1 where the weights ๐๐ (๐ฅ0 ; ๐) do not depend on the ๐ฆ๐ but do depend on the entire training sequence of ๐ฅ๐ denoted here by X. (a). Show that linear regression and k-nearest-neighbor regression are members of this class of estimators. Describe explicitly the weights ๐๐ (๐ฅ0 ; ๐) in each of these cases. For linear regression, the coefficient vector ๐ฝฬ = 〈๐ฝฬ0 , … , ๐ฝฬ๐ 〉 depends on the entire training set ๐ through its derivation ๐ฝฬ = (๐ ๐ ๐)−1 ๐ ๐ ๐ฆ since each training example is a row in the design matrix ๐= − − [− ๐ฑ1๐ ๐ฑ 2๐ โฎ ๐ฑ ๐๐ − − . −] Then, to make a prediction at a point ๐ฅ0 , we would need to solve the following ๐ฬ(๐ฅ0 ) = ๐ฅ0๐ ๐ฝฬ = ๐ฅ0๐ (๐ ๐ ๐)−1 ๐ ๐ ๐ฆ. Define [(๐ ๐ ๐)−1 ๐ ๐ ]๐๐ to be the jth row of (๐ ๐ ๐)−1 ๐ ๐ . Then the weight vector is ๐๐ (๐ฅ0 ; ๐) = ๐ฅ01 [(๐ ๐ ๐)−1 ๐ ๐ ]1๐ + ๐ฅ02 [(๐ ๐ ๐)−1 ๐ ๐ ]๐2 + โฏ + ๐ฅ0๐ [(๐ ๐ ๐)−1 ๐ ๐ ]๐๐ 29 ๐ = ∑(๐ฅ0๐ (๐ ๐ ๐)−1๐ ๐ )๐ ๐=1 so that we have shown that the weight vector depends on the entire training set but not on ๐ฆ๐ . For k-nearest-neighbor, the function is as follows ๐ฬ(๐ฅ) = 1 ๐ ∑ ๐ฆ๐ . ๐ฅ๐ ∈๐๐ (๐ฅ) Then the weight is an indicator function ๐๐ (๐ฅ0 ; ๐) = 1[๐ฅ๐ ∈ ๐๐ (๐ฅ)] where recall we defined a set ๐ท = {๐(๐ฅ๐ , ๐ฅ0 ): ๐ฅ๐ ∈ ๐} where ๐(๐ฅ0 , ๐ฅ๐ ) is any metric. We also defined ๐๐ (๐ฅ0 ) = {๐ฅ๐ : ๐(๐ฅ๐ , ๐ฅ0 ) ≤ ๐๐ where dk is the ๐th smallest element of ๐ท, ๐ฅ๐ ∈ ๐}. Notice here that to construct the set ๐ท for any new point ๐ฅ0 , we have to search through the entire training set and thus the weight depends on the entire set X. (b). Decompose the conditional mean-squared error 2 ๐ธ๐ฆ|๐ฅ (๐(๐ฅ0 ) − ๐ฬ(๐ฅ0 )) into a conditional squared bias and a conditional variance component. Let ๐, ๐ represent the entire training sequence of ๐ฆ๐ . Proof: For readability, assume the following: ๐ = ๐(๐ฅ0 ) and ๐ฬ(๐ฅ0 ) = ๐ฬ, and recall that only ๐ฬ depends on the training set. Also, refer back to page 11 of the thesis for clarity, since we solved the same problem there. Then 2 ๐ธ๐ฆ|๐ฅ (๐ − ๐ฬ) = ๐ 2 − 2๐๐ธ๐ฆ|๐ฅ (๐ฬ) + ๐ธ๐ฆ|๐ฅ (๐ฬ 2 ) 30 2 2 = ๐ธ๐ฆ|๐ฅ (๐ − ๐ธ๐ฆ|๐ฅ (๐ฬ)) + ๐ธ๐ฆ|๐ฅ (๐ฬ 2 ) − ๐ธ๐ฆ|๐ฅ (๐ฬ) 2 = ๐ต๐๐๐ ๐ฆ|๐ฅ (๐ฬ) + ๐๐๐๐ฆ|๐ฅ (๐ฬ) (c). Decompose the unconditional mean-squared error. Proof: Using the same notation above where ๐ = ๐(๐ฅ0 ) and ๐ฬ(๐ฅ0 ) = ๐ฬ, then again, only ๐ฬ depends on the training data and we get 2 ๐ธ๐ฆ,๐ฅ (๐ − ๐ฬ) = ๐ 2 − 2๐๐ธ๐ฆ,๐ฅ (๐ฬ) + ๐ธ๐ฆ,๐ฅ (๐ฬ 2 ) 2 2 = ๐ธ๐ฆ,๐ฅ (๐ − ๐ธ๐ฆ,๐ฅ (๐ฬ)) + ๐ธ๐ฆ,๐ฅ (๐ฬ 2 ) − ๐ธ๐ฆ,๐ฅ (๐ฬ) 2 = ๐ต๐๐๐ ๐ฆ,๐ฅ (๐ฬ) + ๐๐๐๐ฆ,๐ฅ (๐ฬ) 31 CHAPTER 3 LINEAR METHODS FOR REGRESSION 3.1 Introduction This section goes over linear regression models, subset selection, shrinkage methods, and methods using derived input directions. 3.2 Linear Regression Models and Least Squares On page 47, Equation (3.11) states that linear regression model ๐ = ๐๐ฝ + ๐ where the error ๐ ~ ๐(0, ๐ 2 ) implies that ๐ฝฬ ~ ๐(๐ฝ, (๐ ๐ ๐)−1 ๐ 2 ). We will show that this statement holds. Consider the following estimate of ๐ฝฬ ๐ฝฬ = (๐ ๐ ๐)−1 ๐ ๐ ๐ฆ (61) where ๐ฆ is defined above, then this implies that ๐ฝฬ = (๐ ๐ ๐)−1 ๐ ๐ (๐๐ฝ + ๐) (62) = (๐ ๐ ๐)−1 ๐ ๐ ๐๐ฝ + (๐ ๐ ๐)−1 ๐ ๐ ๐ (63) = ๐ฝ + (๐ ๐ ๐)−1 ๐ ๐ ๐ (64) 32 Then we can use the property that a linear transformation of a multivariate normal random vector also has a multivariate normal distribution. We won’t prove this fact here but it states that given ๐~๐(๐, ๐) and suppose we have a linear transformation of ๐ ๐ฆ = ๐ด + ๐๐ then ๐ also has a multivariate normal distribution with mean ๐ธ[๐ฆ] = ๐ด + ๐๐ and covariance matrix ๐๐๐[๐ฆ] = ๐๐๐ ๐ . Substituting this above, this implies that ๐ฝฬ ~ ๐(๐ฝ, (๐ ๐ ๐)−1 ๐ ๐ ๐ 2 ) ๐ธ[๐ฝฬ ] = ๐ฝ + (๐ ๐ ๐)−1 ๐ ๐ 0 =๐ฝ (65) ๐๐๐(๐ฝฬ ) = (๐ ๐ ๐)−1 ๐ ๐ ๐(๐ ๐ ๐)−1 ๐ 2 = (๐ ๐ ๐)−1 ๐ 2 (66) ⇒ ๐ฝฬ ~ ๐(๐ฝ, (๐ ๐ ๐)−1 ๐ 2 ) (67) and so we have shown Equation 3.10. 3.3 Shrinkage Methods On page 54, Algorithm 3.1 has the Gram-Schmidt procedure for multiple regression. We will use the Gram-Schmidt procedure to show how this leads to the QR decomposition Equation 33 (3.31). Then we will perform the QR decomposition on a small 3 × 3 matrix in R and show how it is beneficial in describing least squares. The QR decomposition is ๐ = ๐๐. (68) If ๐ ∈ ๐ ๐×๐+1 then ๐ ∈ ๐ ๐×๐+1 and is an orthogonal matrix (i.e.๐๐ ๐ = ๐ผ) and ๐ ∈ ๐ ๐×๐ where ๐ is an upper triangular matrix. It is important to remember that an orthogonal matrix is a matrix with orthogonal unit column vectors rather than just orthogonal columns as the name suggests. Often times orthogonal matrices are assumed to be square, however, it is not the case in the book. Recall the algorithm is as follows: Table 3.1. Algorithm for Gram-Schmidt Orthogonalization. ๐ง 1. We initialize ๐ง0 = ๐ฅ0 = 1, ๐0 = ||๐ง0 || 0 2. For ๐ = 1,2, … , ๐ a. Dot product ๐ฅ๐ and ๐0 , ๐1 , … , ๐๐−1 to produce coefficients ๐พฬ๐๐ = ๐๐ ๐ฅ๐ ,โ = 0, … , ๐ − ๐−1 1 and residual vector ๐ง๐ = ๐ฅ๐ − ∑๐=1 ๐พฬ๐๐ ๐๐ 3. Regress ๐ฆ on the residual ๐ง๐ to give the estimate for ๐ฝฬ๐ . Suppose we have ๐ ∈ ๐ ๐×๐+1 as follows | ๐ฅ ๐=[ 0 | | ๐ฅ1 | 34 … | … ๐ฅ๐ ] … | then ๐ง ๐ง0 = ๐ฅ0 = 1 ๐0 = ||๐ง 0|| ๐ง1 = ๐ฅ1 − (๐ฅ1 ⋅ ๐0 )๐0 ๐1 = ||๐ง 1|| ๐ง2 = ๐ฅ2 − (๐ฅ2 ⋅ ๐0 )๐0 − (๐ฅ2 ⋅ ๐1 )๐1 ๐2 = ||๐ง 2|| ๐ง๐ = ๐ฅ๐ − (๐ฅ๐ ⋅ ๐0 )๐0 − (๐ฅ๐ ⋅ ๐1 )๐1 − โฏ − (๐ฅ๐ ⋅ ๐๐−1 )๐๐−1 ๐๐ = 0 2 ๐ง 1 2 ๐ง 2 2 ๐ง๐ ||๐ง๐ ||2 and so that the resulting matrix is | ๐ฅ ๐=[ 0 | | ๐ฅ1 | … … … | | ๐ฅ๐ ] = [๐0 | | | ๐1 | ๐ฅ0 ⋅ ๐0 … | 0 … ๐๐ ] [ โฎ … | 0 ๐ฅ1 ⋅ ๐0 ๐ฅ1 ⋅ ๐1 โฎ 0 … … โฑ … ๐ฅ๐ ⋅ ๐0 ๐ฅ๐ ⋅ ๐1 ] = ๐๐ โฎ ๐ฅ๐ ⋅ ๐๐ For example, consider the matrix 1 1 ๐ = [1 1 1 0 0 1 1] , ๐ฆ = [2] 1 3 We will use this matrix to show that solving least squares by the QR decomposition is equal to the normal equation. Table 3.2. R Code for Gram-Schmidt Orthogonalization. # Create Matrix x0=c(1,1,1) x1=c(1,1,0) x2=c(0,1,1) y=c(1,2,3) X=data.frame(x0,x1,x2,y) 35 # Create empty matrices Q and R Q=R=matrix(0,nrow=3,ncol=3) # Perform Operations # First iteration z0=x0 e0=z0/(sqrt(z0%*%z0)) Q[,1]=e0 # Second iteration z1=x1-x1%*%e0*e0 e1=z1/(sqrt(z1%*%z1)) Q[,2]=e1 # Third iteration z2=x2-x2%*%e0*e0-x2%*%e1*e1 e2=z2/(sqrt(z2%*%z2)) Q[,3]=e2 # Fill in R matrix R[1,1]=x0%*%e0 R[1,2]=x1%*%e0 R[1,3]=x2%*%e0 R[2,2]=x1%*%e1 R[2,3]=x2%*%e1 R[3,3]=x2%*%e2 # Check the matrix Q%*%R # Beta Hat solve(R)%*%t(Q)%*%y # Check using linear regression lm(y~x0+x1+x2-1,data=X) Where the solution for ๐ฝฬ under the QR Decomposition is ๐ฝฬ = ๐−1 ๐๐ ๐ฆ (69) and gives the coefficient vector identical to the one fit by the R function lm() 2 ๐ฝฬ = [−1]. −1 In order to solve from Equation (3.31) given as 36 (70) ๐ = ๐๐ (71) ๐ฝฬ = ๐−1 ๐๐ ๐ฆ (72) to equation (3.32) we can just use basic matrix operations. That is ๐ = ๐๐, and ๐ฝฬ = (๐ ๐ ๐)−1 ๐ ๐ ๐ฆ then, ๐ฝฬ = (๐๐ ๐๐ ๐๐)−1 ๐๐ ๐๐ ๐ฆ = (๐๐ ๐)−1 ๐๐ ๐๐ ๐ฆ = ๐−1 ๐๐ −1 ๐๐ ๐๐ ๐ฆ = ๐−1 ๐๐ ๐ฆ (73) and so we get Equation (3.32). By decomposing ๐ this way, we can save a lot on computation. Recall that the column space of the matrix ๐ is the set of all possible linear combinations of its column vectors. More formally, suppose ๐ ∈ ๐ ๐×๐ , then the column space of ๐ is ๐ถ(๐) = {๐ฃ: ๐๐ = ๐ฃ, ∀๐ ∈ ๐ ๐ }. Q is in the column space of ๐, as seen by ๐ = ๐๐ ⇒ ๐๐−1 = ๐, and it provides an orthonormal basis in the column space of ๐. Recall that the geometric interpretation of least squares fitted vector was to orthogonally project ๐ฆ into the column space of ๐ by using the projection matrix of ๐ (also known as hat matrix). Now with the QR decomposition, we can just project ๐ฆ onto ๐ to get the fitted vector ๐ฆฬ which turns out to be ๐(๐๐ ๐)−1 ๐๐ ๐ฆ = ๐๐๐ ๐ฆ 37 (74) and so we can save considerably on computation by not inverting the matrix (๐ ๐ ๐)−1. We can check this is true by using Equation (3.32) ๐ฝฬ = ๐๐ ๐๐ ๐ฆ (75) ⇒ ๐๐ฝฬ = ๐๐๐๐ ๐๐ ๐ฆ (76) = ๐๐๐ ๐ฆ (77) which is the same as the projection. We will also show in Exercise 3.9 that we can use the QR decomposition to implement the forward selection algorithm efficiently. On page 64, Equation (3.44) gives the solution to the ridge regression coefficient as ๐ฝฬ ๐๐๐๐๐ = (๐ ๐ ๐ + ๐๐ผ)−1 ๐ ๐ ๐ฆ. (78) We will derive this here so that we can show other properties of ridge regression. Equation (3.43) shows that the loss function for ridge regression is as follows ๐ ๐๐(๐) = (๐ฆ − ๐๐ฝ)๐ (๐ฆ − ๐๐ฝ) + ๐๐ฝ ๐ ๐ฝ = ๐ฆ ๐ ๐ฆ − 2๐ฝ ๐ ๐ ๐ ๐ฆ + ๐ฝ ๐ ๐ ๐ ๐๐ฝ + ๐๐ฝ ๐ ๐ฝ (79) (80) and then set the gradient to 0 and solve ∇๐ฝ ๐ ๐๐(๐) = −2๐ ๐ ๐ฆ + 2๐ ๐ ๐๐ฝ + 2๐๐ฝ = 0 (81) ⇒ (๐ ๐ ๐ + ๐๐ผ)๐ฝ = ๐ ๐ ๐ฆ (82) ⇒ ๐ฝฬ ๐๐๐๐๐ = (๐ ๐ ๐ + ๐๐ผ)−1 ๐ ๐ ๐ฆ (83) and so we have shown Equation (3.44). 38 Now, we will introduce the singular value decomposition of the centered matrix ๐. The author does not define centered so we will make explicit here what he means. Centering a matrix means that we normalize the columns to have mean 0 and variance 1. That is we use the following algorithm, Table 3.2. Algorithm for Centering. 1 1. Let ๐๐ = ๐ ∑๐๐=1 ๐ฅ๐๐ 2. Replace each ๐ฅ๐ ∈ ๐ with ๐ฅ๐ − ๐๐ 1 3. Let ๐๐2 = ๐−1 ∑๐๐=1(๐ฅ๐๐ ) 2 4. Replace each ๐ฅ๐ with ๐ฅ๐ /๐๐ . Once we do that, we can decompose matrix ๐ to its singular value decomposition ๐ = ๐๐๐ ๐ . (84) The actual decomposition is quite complicated so we will not go over derive it here. Here, ๐ and ๐ are ๐ × ๐ and ๐ × ๐ orthogonal matrices with the columns of ๐ spanning the column space of ๐ and the columns of ๐ spanning the row space. ๐ is a ๐ × ๐ diagonal matrix, with diagonal entries ๐1 ≥ ๐2 ≥ โฏ ≥ ๐๐ ≥ 0 and if one or more ๐๐ = 0, then ๐ is singular. We can use the singular value decomposition to represent the matrix ๐ ๐ ๐. ๐ ๐ ๐ = (๐๐๐ ๐ )๐ ๐๐๐ = ๐๐๐๐ ๐๐๐ ๐ = ๐๐๐ ๐ ๐ 39 (85) and use this as in page 66, Equation (3.48) which states that the eigen decomposition of ๐ ๐ ๐ is ๐ ๐ ๐ = ๐๐2 ๐ ๐ . (86) ฬ (๐) = To show that this is the eigen decomposition, recall that the sample covariance matrix ๐ถ๐๐ฃ ๐๐๐ ๐๐๐ since the matrix has been pre-centered. Then by the definition of covariance matrices, ๐ is ๐ symmetric which implies that ๐ ๐ ๐ is symmetric. One property of symmetric matrices we will use is that the eigen vectors of symmetric matrices are orthonormal. We will not go over the proof but it can be found in a linear algebra textbook. Since the inverse of an orthogonal matrix is its transpose (for an orthogonal matrix ๐, by definition ๐๐ ๐ = ๐ then taking the inverse, ๐๐ ๐๐−1 = ๐−1 implies that ๐๐ = ๐−1), we can write it in eigen form as ๐ ๐ ๐๐ = ๐๐2 (87) ⇒ ๐ ๐ ๐๐๐ −1 = ๐๐2 ๐ −1 (88) = ๐ ๐ ๐ = ๐๐2 ๐ ๐ (89) so we get Equation 3.48. Here, the matrix ๐ holds the eigenvectors of ๐ ๐ ๐ and the matrix ๐2 are the eigenvalues. What happens if we multiply ๐ and ๐? Recall that the ๐๐ ’s in ๐ are ordered ๐1 ≥ ๐2 ≥ โฏ ≥ ๐๐ ≥ 0 and we multiply ๐๐ฃ๐ as in Equation (3.49). Then we can look at its variance as follows ๐๐๐(๐๐ฃ๐ ) = ๐ฃ๐๐ ๐๐๐(๐)๐ฃ๐ (90) ๐๐ ๐ ) ๐ฃ๐ ๐ (91) = ๐ฃ๐๐ ( 40 = ๐ฃ๐๐ ๐๐2 ๐ ๐ ๐ฃ๐ /๐ (92) recall that ๐ is an orthogonal matrix which implies that (๐ฃ๐ ⋅ ๐ฃ๐ ) = 0, ∀๐ ≠ ๐, we get = ๐ฃ๐๐ ๐ฃ๐ ๐๐2 ๐ฃ๐๐ก ๐ฃ๐ /๐ (93) and since ๐ has orthonormal columns, ๐ฃ๐๐ ๐ฃ๐ = 1 and the above implies ๐๐๐(๐๐ฃ๐ ) = ๐๐2 ๐ (94) which completes Equation (3.49). This equation is stating that by taking the eigenvectors of matrix ๐ and multiplying it by the design matrix ๐, the variance is proportional to the eigenvalues. This variance is sorted since the ๐๐ ’s of ๐ are sorted. In fact, let ๐ง๐ = ๐๐ฃ๐ , then ๐ง๐ has the ๐th largest variance among all normalized linear combinations of the columns of ๐ subject to being orthogonal to the earlier ones. We will prove this in the next section. Since we have this nice property, the matrix ๐ gets a special name and is called the principal components of the variables in ๐. To tie this together, we will derive Equation (3.47) of ridge regression using the singular value decomposition. Showing this requires a lot of matrix operations and can be seen as follows ๐๐ฝฬ ๐๐๐๐๐ = ๐(๐ ๐ ๐ + ๐๐)−1 ๐ ๐ ๐ฆ = ๐๐๐ ๐ ((๐๐๐ ๐ )−1 ๐๐๐ ๐ + ๐๐)−1 (๐๐๐ ๐ )๐ ๐ฆ = ๐๐๐ ๐ (๐๐2 ๐ ๐ + ๐๐)−1 ๐๐๐๐ ๐ฆ = ๐๐๐ ๐ ๐ ๐ −1 (๐−2 )๐ −1 ๐๐๐๐ ๐ฆ + ๐๐๐ ๐ (๐−1 ๐ผ)๐๐๐๐ ๐ฆ 41 (95) = ๐๐(๐−2 )๐๐๐ ๐ฆ + ๐๐(๐−1 ๐)๐๐๐ ๐ฆ = ๐๐(๐2 + ๐๐)−1 ๐๐๐ ๐ฆ ๐ = ∑ ๐ข๐ ๐=1 ๐๐2 ๐ข๐ ๐ฆ ๐๐2 + ๐ ๐ (96) ๐2 ๐ so that we have shown equation (3.47). By looking at the middle term ๐2 +๐ we can see what ๐ ridge regression does. When ๐ = 0, we get the least squares estimates ๐๐2 ๐๐2 = 1. Ridge regression ๐2 ๐ gives a value ๐2 +๐ < 1. Recall what the ๐๐ ’s mean here; they are the eigenvalues of ๐ ๐ ๐ and we ๐ they are the variance of ๐ in the direction of its principal components ๐ง๐ = ๐๐ฃ๐ . Thus for ridge regression, smaller values of ๐๐ and thus the smaller variances in the direction of its principal components are shrunken most. To prove this, we have the original equation ๐๐2 ๐๐2 + ๐ . (97) Since ๐1 > ๐2 implies that ๐12 > ๐22 since the square function is monotonic for non-negative values, then we need to show that ๐22 ๐12 − ๐ ๐12 = < . ๐22 + ๐ ๐12 − ๐ + ๐ ๐12 + ๐ (98) Where ๐ is some small constant 0 < ๐ ≤ ๐12 and ๐ is the parameter in ridge regression ๐ > 0. By cross multiplying the function above, we get the following (๐12 − ๐)(๐12 + ๐) < ๐12 (๐12 − ๐ + ๐) 42 ๐14 + ๐๐12 − ๐๐12 − ๐๐ < ๐14 − ๐๐12 + ๐๐12 ⇒ −๐๐ < 0 (99) and since ๐, ๐ > 0 this is last inequality is true and so the statement that smaller directions of the principal components are shrunken the most holds. 3.4 Methods Using Derived Input Directions On page 81, Equation (3.63) we see that the mth principal component direction ๐ฃ๐ solves max ๐๐๐(๐๐ผ) ๐ผ subject to ||๐ผ|| = 1, ๐ผ ๐ ๐๐ฃโ = 0, โ = 1, … , ๐ − 1. (100) We will show that ๐ผ here are the eigenvalues; that is, we will show ๐ = ๐ฃ๐ . Recall from the text that our design matrix ๐ is standardized so that that we can rewrite the function ๐๐๐(๐๐ผ ๐ ) = ๐๐ ๐๐๐(๐)๐ = ๐๐ ๐ ๐ ๐๐. A standard way of solving optimization problems with equality constraints is by including the constraints in the objective function; this is known as the Lagrangian and it can often be found in a calculus book. So rewriting the above quantity in Lagrangian form, we get the following ๐ฟ(๐, ๐) = ๐๐ ๐ ๐ ๐๐ − ๐๐๐ ๐ (101) where ๐ is called the Lagrange multiplier. Then, to find the optimal point, we have to set the gradient of the Lagrangian to zero and solve for ๐ฅ. ∇๐ ๐ฟ(๐, ๐) = 2๐ ๐ ๐๐ − 2๐๐ = 0. 43 (102) Notice here that we get the standard eigen form ๐ ๐ ๐๐ = ๐๐. This shows that the only points which can maximize the functions are the eigenvectors of ๐ ๐ ๐ and we showed in the previous section that this is the matrix ๐ under the ๐๐๐ท decomposition. Since ๐ is an orthogonal matrix, we know that the second constraint (๐ผ ๐ ๐๐ฃโ = 0) holds so we have solved Equation (3.63). 3.5 Problems and Solutions Exercise 3.2. Given data on two variables ๐ and ๐, consider fitting a cubic polynomial regression model ๐(๐) = ∑3๐=0 ๐ฝ๐ ๐๐ . In addition to plotting the fitted curve, you would like a 95% confidence band about the curve. Consider the following two approaches: (1). At each point ๐ฅ0 , form a 95% confidence interval for the linear function at ๐๐ ๐ฝ = ∑3๐=0 ๐ฝ๐ ๐ฅ0๐ . Proof: For this problem, the data matrix is ๐ × 4 matrix as follows 1 ๐ = [โฎ 1 ๐ฅ๐1 โฎ ๐ฅ๐1 2 ๐ฅ๐1 โฎ 2 ๐ฅ๐1 3 ๐ฅ๐1 โฎ ] 3 ๐ฅ๐1 and ๐ฝฬ is estimated in the usual least-squares way by setting up the normal equations. Then, to generate confidence intervals around an estimated mean ๐ฬ(๐ฅ0 ) = ๐ธ[๐ฆฬ 0 |๐ฅ0 ], use the Wald confidence interval as follows 95% Confidence Interval(๐ฅ0 ) = ๐ฅ0๐ ๐ฝฬ ± 1.96 ⋅ √๐๐๐(๐ฅ0๐ ๐ฝฬ ) where the variance from Exercise 2.5 is as follows ๐๐๐(๐ฅ0๐ ๐ฝฬ ) = ๐ฅ0๐ ๐๐๐(๐ฝฬ )๐ฅ0 = ๐ 2 ๐ฅ0๐ (๐ ๐ ๐)−1 ๐ฅ0 44 then the interval is 95% Confidence Interval(๐ฅ0 ) = ๐ฅ0๐ ๐ฝฬ ± 1.96 ⋅ ๐√๐ฅ0๐ (๐ ๐ ๐)−1 ๐ฅ0 โ (2). Form a 95% confidence set for ๐ฝ as in (3.15), which in turn generates confidence intervals for ๐(๐ฅ0 ). Proof: Equation (3.15) is as follows ๐ 2 (1−๐) Cβ = {๐ฝ|(๐ฝฬ − ๐ฝ) ๐ ๐ ๐(๐ฝฬ − ๐ฝ) ≤ ๐ฬ 2 ๐๐+1 } 2 where C๐ฝ is the confidence set and ๐๐+1 (1−๐) is the (1 − ๐)-percentile of the chi-squared distribution on (p + 1)-degrees of freedom. Then, at ๐ผ = 0.05, χ2 = 3.84 so that the interval for this particular set ๐ฝ is Cβ = {๐ฝ|(๐ฝฬ ๐ − ๐ฝ)(๐ ๐ ๐)(๐ฝฬ − ๐ฝ) ≤ ๐ฬ 2 ⋅ 3.84 } which implies that the interval for the conditional mean is ⇒ ๐ถ๐ฅ0๐ ๐ฝฬ = {๐ฅ0๐ ๐ฝฬ |๐ฝ ∈ Cβ }. Exercise 3.3. (1). Prove the Gauss–Markov theorem: the least squares estimate of a parameter ๐๐ ๐ฝ has variance no bigger than that of any other linear unbiased estimate of ๐๐ ๐ฝ (Section 3.2.2). Proof: Recall that the least squares estimator for ๐ฝฬ is ๐ฝฬ = (๐ ๐ ๐)−1 ๐ ๐ ๐ฆ. Suppose we have another unbiased linear estimator ๐ฝฬ so that ๐ฝฬ = ๐๐ฆ and let ๐ = (๐ ๐ ๐)−1 ๐ ๐ + ๐. Then we get the following 45 ๐ฝฬ = ((๐ ๐ ๐)−1 ๐ ๐ + ๐)๐ฆ = (๐ ๐ ๐)−1 ๐ ๐ ๐ฆ + ๐๐ฆ. Since least squares is unbiased and we assume the model ๐ = ๐ ๐ ๐ฝ, the following holds ๐ธ[๐ฆ] = ๐ ๐ ๐ฝ. Then, use this below to find the expectation on ๐ฝฬ so that ๐ธ[๐ฝฬ] = ๐ธ[(๐ ๐ ๐)−1 ๐ ๐ ๐ฆ] + ๐ธ[๐๐ฆ] = (๐ ๐ ๐)−1 ๐ ๐ ๐๐ฝ + ๐๐๐ฝ = (๐๐ + ๐๐)๐ฝ then ๐ฝฬ is unbiased if and only if ๐๐ = 0. Then the variance of ๐ฝฬ is as follows ๐๐๐(๐ฝฬ) = ๐๐๐(((๐ ๐ ๐)−1 ๐ ๐ + ๐)๐ฆ) Let ๐ = ((๐ ๐ ๐)−1 ๐ ๐ + ๐), then the above is equal to = ๐๐๐๐(๐ฆ)๐๐ = ๐ 2 ⋅ ๐๐๐ = ๐ 2 ⋅ ((๐ ๐ ๐)−1 ๐ ๐ + ๐)(๐(๐ ๐ ๐)−1 + ๐ ๐ ) = ๐ 2 ⋅ [(๐ ๐ ๐)−1 ๐ ๐ ๐(๐ ๐ ๐)−1 + (๐ ๐ ๐)−1 ๐ ๐ ๐ ๐ + ๐๐(๐ ๐ ๐)−1 + ๐๐ ๐ ] And above we stated that ๐๐ = 0 so we remove the 2nd and 3rd terms to reduce it to the following = ๐ 2 ⋅ [(๐ ๐ ๐)−1 ๐ ๐ ๐(๐ ๐ ๐)−1 + ๐๐ ๐ ] 46 = ๐๐๐(๐ฝฬ ) + ๐ 2 [๐๐ ๐ ] and ๐๐ ๐ is a positive semidefinite matrix since ๐๐ ๐ถ๐ถ ๐ ๐ = (๐๐ ๐ถ ⋅ ๐ถ ๐ ๐) ≥ 0, so that ๐๐๐(๐ฝฬ) > ๐๐๐(๐ฝ) holds and we complete the proof. Exercise 3.4. (1). Show how the vector of least squares coefficients can be obtained from a single pass of the Gram–Schmidt procedure (Algorithm 3.1). Represent your solution in terms of the QR decomposition of X. Proof: After a single pass of the Gram-Schmidt process, we get the following decomposition ๐ = ๐๐ where Q is an ๐ × (๐ + 1) orthogonal matrix and ๐ is an (๐ + 1) × (๐ + 1) upper triangular matrix equal. Then by least squares, we get the following equation ๐ ๐ ๐๐ฝ = ๐ ๐ ๐ฆ ⇒ (๐๐)๐ ๐๐๐ฝ = (๐๐)๐ ๐ฆ ⇒ ๐๐ ๐๐ฝ = ๐๐ ๐๐ ๐ฆ ⇒ ๐๐ฝ = ๐๐ ๐ฆ and we can solve the last equation by back-substitution since R is an upper-triangular matrix. 47 For example, ๐00 0 [ โฎ 0 ๐01 ๐11 โฑ 0 โฏ ๐0๐ ๐ฝ0 ๐0๐ ๐ฆ โฏ ๐1๐ ๐ฝ1 ๐๐ ๐ฆ ][ ] = 1 โฑ โฎ โฎ โฎ ๐ โฏ ๐๐๐ ๐ฝ๐ [๐๐ ๐ฆ] ๐๐ฆ ๐๐ the last element is ๐ฝฬ๐ = ๐ , and solving the next elements in order by back-substitution ๐๐ ๐ฝฬ๐ , ๐ฝฬ๐−1 , … ๐ฝฬ0 . Exercise 3.5. (1). Consider the ridge regression problem (3.41). Show that this problem is equivalent to the problem ๐ ๐ ฬ๐ ๐ฝ ๐ 2 = arg min {∑[๐ฆ๐ − ๐ฝ0๐ − ∑(๐ฅ๐๐ − ๐ฅฬ ๐ )๐ฝ๐๐ ]2 + ๐ ∑ ๐ฝ๐๐ } ๐ฝ๐ ๐=1 ๐=1 ๐=1 Give the correspondence between ๐ฝ ๐ and the original ๐ฝ in (3.41). ๐ ๐ ๐ ๐ฝฬ ๐๐๐๐๐ = ๐๐๐ ๐๐๐ {∑[๐ฆ๐ − ๐ฝ0 − ∑ ๐ฅ๐๐ ๐ฝ๐ ]2 + ๐ ∑ ๐ฝ๐2 } ๐ฝ ๐=1 ๐=1 (3.41) ๐=1 Proof: For the first equation above, we can expand out the middle summation so that we get the following ๐ ๐ ๐ ∑(๐ฅ๐๐ − ๐ฅฬ ๐ )๐ฝ๐๐ = ∑ ๐ฅ๐๐ ๐ฝ๐๐ − ∑ ๐ฅฬ ๐ ๐ฝ๐๐ ๐=1 ๐=1 ๐=1 and we get the correspondence between ๐ฝ0๐ and ๐ฝ0 ๐ ๐ฝ0 = ๐ฝ0๐ + ∑ ๐ฅฬ ๐ ๐ฝ๐๐ . ๐=1 48 That is, ๐ฝ0 is equivalent to ๐ฝ0๐ except it is shifted. At this point, the two functions are equal and thus all other coefficients will stay the same ๐ฝ1๐ = ๐ฝ1 , ๐ฝ2๐ = ๐ฝ2 … , ๐ฝ๐๐ = ๐ฝ๐ so that ๐ฝ ๐ has the same slopes as the ๐ฝ’s so we have shown the correspondence. Exercise 3.6. Show that the ridge regression estimate is the mean (and mode) of the posterior distribution, under a Gaussian prior ๐ฝ ~ ๐(0, ๐๐ฐ), and Gaussian sampling model ๐ฆ ~ ๐(๐ฟ๐ฝ, ๐ 2 ๐ฐ). Find the relationship between the regularization parameter ๐ in the ridge formula, and the variances ๐ and ๐ 2 . Proof: From Bayes rule, we can write the posterior distribution of ๐ฝ as follows Pr(๐ฝ|๐ท) = Pr(๐ท|๐ฝ) ⋅ Pr(๐ฝ) ∝ Pr(๐ท|๐ฝ) ⋅ Pr(๐ฝ) Pr(๐ท) where ๐(๐ท|๐ฝ) is the likelihood; that is, the probability of the data given ๐ฝ. The last term states that the posterior distribution is proportional to Pr(๐ท|๐ฝ) ⋅ Pr(๐ฝ). Since the denominator is the marginal density of ๐ท, this will be constant for all ๐ฝ, we can leave it out when solving for ๐ฝ. Next, notice that each density of the last two terms are Pr(๐ท|๐ฝ) ~ ๐(๐ ๐ ๐ฝ, ๐ 2 ๐ผ) Pr(๐ฝ) ~ ๐(0, ๐๐) The first comes from the likelihood of the data under regression assumptions and the second is stated in the problem. The maximum likelihood with respect to ๐ฝ under the log transformation, we get the following log(๐๐(๐ท|๐ฝ) ⋅ ๐๐(๐ฝ)) = − (๐ฆ − ๐๐ฝ)๐ (๐ฆ − ๐๐ฝ) ๐ฝ ๐ ๐ฝ − 2๐ 2 2๐ 49 ๐ ๐ ๐ ๐=1 ๐=1 ๐=1 1 1 = 2 ∑[๐ฆ๐ − ๐ฝ0 − ∑ ๐ฅ๐๐ ๐ฝ๐ ]2 + ∑ ๐ฝ๐2 2๐ 2๐ ๐2 and if we multiply the equation by 2๐ 2 , then we get the ridge equation (3.41) with ๐ = ๐ thus we have showed the relationships between ๐ and the two variance terms. Finally, by plugging ๐ into the above equation, we get the following, ๐ ๐ ๐ ๐ฝฬ ๐๐๐๐๐ = arg max ∑[๐ฆ๐ − ๐ฝ0 − ∑ ๐ฅ๐๐ ๐ฝ๐ ]2 + ๐ ∑ ๐ฝ๐2 . ๐ฝ ๐=1 ๐=1 ๐=1 which is the ridge regression form. Since we assumed our data were generated from a normal distribution and we used a normal prior (conjugate prior), from statistics, we know that the posterior will be normal. This implies that the maximum ๐ฝ under the posterior distribution gives the mean. Since the mean and mode are equal for normal distributions (due to symmetry) we have shown the relationship between them and answered the problem. Exercise 3.9. Forward stepwise regression. Suppose we have the QR decomposition for the ๐ × ๐ matrix ๐ฟ1 in a multiple regression problem with response ๐ฆ, and we have an additional ๐ − ๐ predictors in the matrix ๐ฟ2 . Denote the current residual by ๐. We wish to establish which one of these additional variables will reduce the residual-sum-of squares the most when included with those in ๐ฟ1 . Describe an efficient procedure for doing this. Proof: Recall from linear algebra that the nullspace is ๐(๐) = {๐ฃ: ๐๐ฃ = 0}. This implies that the residual vector r lives in the nullspace of ๐ ๐ since ๐ ๐ ๐ = ๐ ๐ (๐ฆ − ๐ฆฬ) 50 = ๐ ๐ (๐ฆ − ๐(๐ ๐ ๐)−1 ๐ ๐ ๐ฆ) = ๐ ๐ ๐ฆ − ๐ ๐ ๐(๐ ๐ ๐)−1 ๐ ๐ ๐ฆ = ๐ ๐ ๐ฆ − ๐ ๐ ๐ฆ = 0. Since the QR decomposition uses the Gram-Schmidt procedure, we have seen that this means that this forms an orthonormal basis for ๐. This implies that the current fitted values have been projected onto the subspace spanned by ๐1 and so that the remaining ๐ − ๐ variables in ๐ 2 (assuming linear independence) live in the nullspace of ๐1 . Since the residual sum-of-squares is defined as ||๐ฆ − ๐ฆฬ||22 which is the squared distance between ๐ฆ and its projected vector, we would like to find the vector ๐ฃ ∈ ๐ ๐−๐ where such that we decrease the length of the residual vector. This means that reducing the residual sum-of-square error the most amounts to finding the largest projection of r onto the nullspace. That is, we need to find the following 〈๐๐ , ๐〉 ๐๐ ∈๐ 2 〈๐๐ , ๐๐ 〉 ๐๐ = arg max So that completes the first part of the problem. For the second part, define a new matrix ๐ ∗ that is ๐1 with an appended vector ๐๐ as follows ∗ ๐ =[ ๐1 | ๐๐ ] | we can use the QR decomposition for a more efficient fit. Recall that the QR decomposition fits each row iteratively and from page 35 of this thesis, we have the last row 51 ๐ง๐ = ๐ฅ๐ − (๐ฅ๐ ⋅ ๐0 )๐0 − (๐ฅ๐ ⋅ ๐1 )๐1 − โฏ − (๐ฅ๐ ⋅ ๐๐ )๐๐ ๐๐ = ๐ง๐ ||๐ง๐ ||2 and then we get the QR matrix by | ๐ = [๐ฅ0 | | ๐ฅ1 | … | | … ๐ฅ๐ ] = [๐0 … | | | ๐1 | ๐ฅ0 ⋅ ๐0 … | 0 … ๐๐ ] [ โฎ … | 0 ๐ฅ1 ⋅ ๐0 ๐ฅ1 ⋅ ๐1 โฎ 0 … … โฑ … ๐ฅ๐ ⋅ ๐0 ๐ฅ๐ ⋅ ๐1 ] โฎ ๐ฅ๐ ⋅ ๐๐ And then we can solve for the coefficients as in Exercise 3.4 by back substitution ๐๐ฝ = ๐๐ ๐ฆ so that we have completed the problem. Solving this problem using the QR decomposition is much faster solving for ๐ฝฬ using the normal equations and ๐ ∗ . Exercise 3.10. Backward stepwise regression. Suppose we have the multiple regression fit of y on ๐ฟ, along with the standard errors and Z-scores as in Table 3.2. We wish to establish which variable, when dropped, will increase the residual sum-of-squares the least. How would you do this? Exercise 3.1 established that the ๐น statistic for dropping a single coefficient is equal to the square of the corresponding ๐ง-score so that we get the following relationship ๐น= ๐ ๐๐0 − ๐ ๐๐1 ๐ ๐๐0 − ๐ ๐๐1 = = ๐ง๐2 ๐ ๐๐1 ๐ฬ 2 ๐−๐−1 and then shuffling terms around, we get ⇒ ๐ ๐๐0 − ๐ ๐๐1 = ๐ฬ 2 ๐ง๐2 52 ⇒ ๐ ๐๐0 = ๐ฬ 2 ๐ง๐2 + ๐ ๐๐1 which states that the smallest increase in the residual sum-of-squares corresponds to dropping the smallest ๐ง-score (i.e. smallest ๐ง๐2 value) and we have concluded the problem. Exercise 3.13. Derive the expression (3.62), and show that ๐ฝฬ ๐๐๐ (๐) = ๐ฝฬ ๐๐ . Proof: Using the singular value decomposition of ๐ = ๐๐๐ ๐ we can represent ๐ฝฬ ๐๐ as follows ๐ ๐ ๐๐ฝ = ๐ ๐ ๐ฆ ๐๐๐ ๐๐ ๐๐๐ ๐ ๐ฝ = ๐๐๐ ๐๐ ๐ฆ ๐๐2 ๐ ๐ ๐ฝ = ๐๐๐๐ ๐ฆ where ๐๐ ๐ = ๐ since ๐ is an orthogonal matrix and ๐๐ = ๐ because it is a diagonal matrix. Then we can solve for ๐ฝฬ and get the following result ๐ฝฬ ๐๐ = ๐๐−1 ๐ ๐ ๐ฆ Equation (3.62) on principal component regression has the form ๐ ๐ฝฬ ๐๐๐ (๐) = ∑ ๐ฬ๐ ๐ฃ๐ ๐=1 where ๐ฬ๐ = 〈๐ง๐ , ๐ฆ〉/〈๐ง๐ , ๐ง๐ 〉 and where ๐ง๐ are the principal components and is the ๐th column in the matrix ๐ = ๐๐ = ๐๐. ๐ฃ๐ is the column in the singular value decomposition matrix ๐. When ๐ = ๐, the vector ๐ฬ is the coefficient vector when regressing ๐ฆ onto the principal components and is solved as follows 53 ๐ฬ = (๐๐ ๐)−1 ๐๐ ๐ฆ = (๐๐ ๐๐ ๐๐)−1 ๐๐ ๐๐ ๐ฆ = ๐−1 ๐๐ ๐ฆ and to get the vector ๐ฝฬ ๐๐๐ we do the following ๐ฝฬ ๐๐๐ = ๐๐ฬ = ๐๐−1 ๐๐ ๐ฆ which is equivalent to that of least squares when ๐ = ๐โ Exercise 3.16. Derive the entries in Table 3.4, the explicit forms for estimators in the orthogonal case. Estimator Formula Best subset (size M) ๐ฝฬ๐ ⋅ ๐ผ(|๐ฝฬ๐ | ≥ |๐ฝฬ๐ |) Ridge ๐ฝฬ๐ /(1 + ๐) Lasso sign(๐ฝฬ๐ )(|๐ฝฬ๐ | − ๐)+ Proof: Table 3.4 Estimators of ๐ฝ๐ in the case of orthonormal columns of ๐. ๐ and ๐ are constants chosen by the corresponding techniques; sign denotes the sign of its argument (±1). For least-squares ๐ฝฬ ๐๐ = (๐ ๐ ๐)−1 ๐ ๐ ๐ฆ = ๐ ๐ ๐ฆ For best-subset selection we get the following ๐ฝฬ ๐๐ = ๐ฝฬ ๐๐ (๐ = ๐) = ๐ ๐ ๐ฆ where the coefficients are identical even if we take ๐ ≤ ๐ since the design matrix is orthogonal. For ridge regression we get the following estimates ๐ฝฬ ๐๐๐๐๐ = (๐ ๐ ๐ + ๐๐)−1 ๐ ๐ ๐ฆ = (๐ + ๐๐)−1 ๐ ๐ ๐ฆ = ๐ฝฬ ๐๐ /(1 + ๐) For lasso regression, we get the following 54 ๐ฟ(๐ฝ) = (๐ฆ − ๐๐ฝ)๐ (๐ฆ − ๐๐ฝ) + ๐|๐ฝ| ⇒ ๐๐ฟ(๐ฝ) = −๐ ๐ ๐ฆ + ๐ ๐ ๐๐ฝ + ๐ ⋅ sign(๐ฝ) ๐๐ฝ and by setting the gradient to zero and solving with respect to ๐ฝ we get ๐ฝฬ ๐๐๐ ๐ ๐ = ๐(๐ ๐ ๐ฆ − ๐ ⋅ sign(๐ฝ)) = sign(๐ฝ)(|๐ ๐ ๐ฆ| − ๐)โ Exercise 3.17. Repeat the analysis of Table 3.3 on the spam data discussed in Chapter 1 where Table 3.3 compares the coefficient estimates of least-squares, best subset, ridge, lasso, principal component regression, and partial least-squares. This dataset has 57 variables with a binary response indicating spam or not spam. Fitting each model and using cross validation to generate the mean squared errors, we get the following table LS Parameters Test Error Std Error Best Subset Ridge ๐ = .08 0.334 0.0169 0.334 0.0115 Lasso ๐ = 0.0017 0.334 0.0182 PCR ncomp = 56 0.335 0.0177 PLS ncomp = 12 0.333 0.0158 Best subset regression did not complete as it took too long. The author mentions that it can work when the number of variables ๐ up to 30 or 40 but is not feasible since it is an exponential time search time. Exercise 3.19. Show that ||๐ฝฬ๐๐๐๐๐ || increases as its tuning parameter ๐ → 0. Does the same property hold for the lasso and partial least squares estimates? For the latter, consider the “tuning parameter” to be the successive steps in the algorithm. Proof: The solution to ๐ฝฬ ๐๐๐๐๐ can be found using equation (3.47) 55 ๐ฝฬ ๐๐๐๐๐ = (๐ ๐ ๐ + ๐๐)−1 ๐ ๐ ๐ฆ and using the singular value decomposition on the centered matrix ๐ = (๐2 + ๐๐)−1 ๐๐๐๐ ๐ฆ (3.47) so that ||๐ฝฬ ๐๐๐๐๐ ||2 = ๐ฆ ๐ ๐๐๐ ๐ (๐2 + ๐๐)−2 ๐๐๐๐ ๐ฆ = ๐ฆ ๐ ๐๐(๐2 + ๐๐)−2 ๐๐๐ ๐ฆ ๐ = ∑ ๐ฆ๐ข๐ ๐=1 and as ๐ → 0, the quantity ๐๐2 (๐๐2 +๐) 2 ๐๐2 2 (๐๐2 + ๐) ๐ข๐๐ ๐ฆ is increasing and thus the vector ||๐ฝฬ ๐๐๐๐๐ ||2 increases and so we have concluded the exercise. 56 CHAPTER 4 LINEAR METHODS FOR CLASSIFICATION 4.1 Introduction This section goes over topics including the classification task and linear discriminant analysis. 4.2 Linear Discriminant Analysis This chapter is an extension of Chapter 3 from a regression point a view to a classification one. On page 113, we will derive the equations listed in the center of the page. Notice that the second equation is ๐พ ฬ ๐ | = ∑ log ๐๐๐ . log |๐ถ๐๐ฃ (103) ๐=1 This is the determinant of the covariance matrix and uses the property of determinants where ฬ ๐ ) = ∏๐พ det(๐ถ๐๐ฃ ๐=1 ๐๐๐ . That is, the determinant is equal to the product of the eigenvalues which can be obtained from matrix ๐ of the eigen decomposition of the covariance matrix. The next bullet point on the same page states that we can sphere the data with respect to the common covariance matrix by applying 1 ๐ ∗ ← ๐−2 ๐๐ ๐. 57 (104) This occurs by first decomposing the covariance of ๐ by its eigen decomposition; that is, ๐ถ๐๐ฃ(๐) = ๐๐๐๐ . We can take the negative square root of a matrix by performing each operation 1-by-1. That is, we take the square root and then invert it as follows 1 (๐๐๐๐ )−2 1 − 1 1 2 ๐ = (๐๐2 ๐2 ๐ ) 1 = (๐๐(๐๐)๐ )−2 = (๐๐)−1 1 1 = ๐−2 ๐−1 = ๐−2 ๐๐ (105) and then by sphering the data, we change the covariance of ๐ ∗ to be the identity matrix. This can be shown as follows 1 ๐ถ๐๐ฃ(๐ ∗ ) = ๐ถ๐๐ฃ (๐−2 ๐ ๐ ๐) (106) 1 1 = ๐−2 ๐๐ ๐ถ๐๐ฃ(๐)๐๐−2 1 1 = ๐−2 ๐๐ ๐๐๐ ๐ ๐๐−2 =๐ (107) so that we get the identity matrix as the covariance of ๐ ∗ and complete the proof. 58 4.3 Problems and Solutions Exercise 4.1. Show how to solve the generalized eigenvalue problem ๐๐๐ฅ ๐ผ ๐ ๐ฉ๐ผ subject to ๐ผ ๐ ๐พ๐ผ = 1 by transforming to a standard eigenvalue problem where ๐ฉ is the between-class covariance matrix and ๐พ is the within-class covariance matrix. Proof: Since this is an equality constraint, we can set it up in Lagrangian form and solve using lagrangian multipliers. The problem is of the form max ๐ผ ๐ ๐๐ผ ๐ผ subject to ๐ผ ๐ ๐๐ผ = 1 Then, in Lagrangian form, this is L(๐, ๐) = ๐๐ ๐๐ + ๐(๐๐ ๐๐ − 1) We can take partials with respect to ๐ and ๐ so that ๐L(๐, ๐) = 2๐๐ + 2๐๐๐ = 0 ๐๐ (1) ๐L(๐, ๐) = ๐๐ ๐๐ − 1 = 0 ๐๐ (2) And so for the first equation, ⇒ −๐๐ = ๐๐๐ ⇒ −๐ −1 ๐๐ = ๐๐ 59 Notice that this is in eigen decomposition form and since we want to maximize the original quantity, we know that ๐ must be the first eigenvector and ๐ the corresponding eigenvalue to the matrix −๐ −1 ๐. 60 REFERENCES [1] Agresti, Alan. (2012). Categorical Data Analysis, 3rd edition. [2] Hastie, Tibshirani, & Friedman. (2009). The Elements of Statistical Learning, 2nd edition. [3] Ng, Andrew. CS229 Lecture Notes on Learning Theory. [4] Schmidt, Mark. (2005). Least Squares Optimization with L1-Norm Regularization. [5] Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011. [6] Weatherwax, John & Epstein, David. (2013). A Solution Manual and Notes for: The Elements of Statistical Learning by Jerome Friedman, Trevor Hastie, and Robert Tibshirani. [7] Wolfram Alpha LLC. (2009). Wolfram|Alpha. http://www.wolframalpha.com/input/?i=x^2%2By^2 (access October 20, 2014). 61