Author(s): Kerby Shedden, Ph.D., 2010 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution Share Alike 3.0 License: http://creativecommons.org/licenses/by-sa/3.0/ We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your ability to use, share, and adapt it. The citation key on the following slide provides information about how you may share and adapt this material. Copyright holders of content included in this material should contact open.michigan@umich.edu with any questions, corrections, or clarification regarding the use of content. For more information about how to cite these materials visit http://open.umich.edu/privacy-and-terms-use. Any medical information in this material is intended to inform and educate and is not a tool for self-diagnosis or a replacement for medical evaluation, advice, diagnosis or treatment by a healthcare professional. Please speak to your physician if you have questions about your medical condition. Viewer discretion is advised: Some medical content is graphic and may not be suitable for all viewers. 1 / 22 Prediction Kerby Shedden Department of Statistics, University of Michigan October 28, 2015 2 / 22 Prediction analysis In a prediction analysis, we are interested in fitting a model fθ̂ to capture the mean relationship between indpendent variables X and a dependent variable Y , and then using fθ̂ to make predictions on an independent data set. It is helpful to think in terms of training data (Y , X ) that are used to fit the model, so θ̂ = θ̂(Y , X ), and testing data (Y ∗ , X ∗ ) on which predictions are made, or on which the model is evaluated. 3 / 22 Quantifying prediction error Prediction analysis focuses on prediction errors, for example through the mean squared prediction error (MSPE): E kY ∗ − fθ̂ (X ∗ )k2 /n∗ , where n∗ is the size of the testing set. Prediction analysis does not usually focus on properties of the parameter estimates themselves, e.g. bias E θ̂ − θ, or parameter MSE E (θ̂ − θ)2 . 4 / 22 MSPE for OLS analysis The mean squared prediction error for OLS regression is easy to derive. The testing data follows Y ∗ = X ∗ β + ∗ . Let Ŷ ∗ = X ∗ β̂ denote the predicted values in the test set. Then E kY ∗ − Ŷ ∗ k2 = E kX ∗ β + ∗ − X ∗ β̂k2 = E kX ∗ (β − β̂)k2 + E k∗ k2 = (β − β̂)0 (X ∗0 X ∗ )(β − β̂) + n∗ σ 2 = tr X ∗0 X ∗ · E (β − β̂)(β − β̂)0 + n∗ σ 2 = tr X ∗0 X ∗ · Σβ̂ + n∗ σ 2 , where Σβ̂ is the covariance matrix of β̂. Note the requirement for Ŷ ∗ and Y ∗ to be independent (given X and X ∗ ). 5 / 22 MSPE for OLS analysis The MSPE for OLS is tr (X ∗0 X ∗ /n∗ ) · Σβ̂ + σ 2 . If X is the training set design matrix, then Σβ̂ = σ 2 (X 0 X )−1 , so if X = X ∗ , then E kY ∗ − Ŷ k2 = σ 2 (p + 1 + n∗ ), and the MSPE in this case is σ 2 (p + 1)/n∗ + σ 2 = σ 2 (p + 1)/n + σ 2 . 6 / 22 MSPE for OLS analysis More generally, suppose X 0 X /n = X ∗0 X ∗ /n∗ . Then Σβ̂ = σ 2 (X 0 X )−1 = σ 2 n∗ (X ∗0 X ∗ )−1 /n. Thus the MSPE is tr (X ∗0 X ∗ /n∗ ) · Σβ̂ + σ 2 = σ 2 (p + 1)/n + σ 2 . 7 / 22 Ridge regression Ridge regression uses the minimizer of a penalized squared error loss function to estimate the regression coefficients: β̂ ≡ argminβ kY − X βk2 + λβ 0 Dβ. Typically D is a diagonal matrix with 0 in the 1,1 position and ones on the rest of the diagonal. In this case, β 0 Dβ = X βj2 . j≥1 This makes most sense when the covariates have been standardized, so it is reasonable to penalize the βj equally. 8 / 22 Ridge regression Ridge regression is a compromise between fitting the data as well as possible (by making kY − X βk2 small), while not allowing any one fitted coefficient to get very large (which causes β 0 Dβ to get large). 9 / 22 Ridge regression and colinearity Suppose X1 and X2 are strongly positively associated, and the population slopes are β1 and β2 . Fits of the form (β1 + γ)X1 + (β2 − γ)X2 = EY + γ(X1 − X2 ) have similar MSE values as γ varies, since X1 − X2 is small when X1 and X2 are strongly positively associated. In other words, OLS can’t easily distinguish among these fits. For example, if X1 ≈ X2 , then 3X1 + 3X2 , 4X1 + 2X2 , 5X1 + X2 , etc. all have very similar MSE values. 10 / 22 Ridge regression and colinearity For large λ, ridge regression favors the fits that minimize (β1 + γ)2 + (β2 − γ)2 . This expression is minimized at γ = (β2 − β1 )/2, giving the fit (β1 + β2 )X1 /2 + (β1 + β2 )X2 /2. ⇒ Ridge regression favors coefficient estimates for which strongly positively correlated covariates have similar estimated effects. 11 / 22 Calculation of ridge regression estimates For a given value λ > 0, ridge regression is no more difficult computationally than ordinary least squares, since ∂ kY − X βk2 + λβ 0 Dβ = −2X 0 Y + 2X 0 X β + 2λDβ, ∂β so the ridge estimate β̂ solves the system of linear equations (X 0 X + λD)β = X 0 Y . This equation can have a unique solution even when X 0 X is singular. Thus one application of ridging is to produce regression estimates for singular design matrices. 12 / 22 Ridge regression bias and variance Ridge regression estimates are biased, but may be less variable than OLS estimates. If X 0 X is non-singular, the ridge estimator can be written β̂λ = (X 0 X + λD)−1 X 0 Y = (I + λ(X 0 X )−1 D)−1 (X 0 X )−1 X 0 Y = (I + λ(X 0 X )−1 D)−1 (X 0 X )−1 X 0 (X β + ) = (I + λ(X 0 X )−1 D)−1 β + (I + λ(X 0 X )−1 D)−1 (X 0 X )−1 X 0 . Thus the bias is E β̂λ − β = ((I + λ(X 0 X )−1 D)−1 − I )β 13 / 22 Ridge regression bias and variance The variance of the ridge regression estimates is varβ̂λ = σ 2 (I + λ(X 0 X )−1 D)−1 (X 0 X )−1 (I + λ(X 0 X )−1 D)−T . 14 / 22 Ridge regression bias and variance Next we will show that varβ̂ ≥ varβ̂λ , in the sense that varβ̂ − varβ̂λ is a non-negative definite matrix. First let M = λ(X 0 X )−1 D, and note that v 0 (varβ̂ − varβ̂λ )v ∝ = = = “ ” v 0 (X 0 X )−1 − (I + M)−1 (X 0 X )−1 (I + M)−T v “ ” u 0 (I + M)(X 0 X )−1 (I + M)0 − (X 0 X )−1 u “ ” u 0 M(X 0 X )−1 + (X 0 X )−1 M 0 + M(X 0 X )−1 M 0 u “ u 0 2λ(X 0 X )−1 D(X 0 X )−1 + ” λ2 (X 0 X )−1 D(X 0 X )−1 D(X 0 X )−1 u where u = (I + M)−T v . We can conclude that for any fixed vector θ, var(θ0 β̂λ ) ≤ var(θ0 β̂). 15 / 22 Ridge regression effective degrees of freedom Like OLS, the fitted values under ridge regression are linear functions of the observed values Ŷλ = X (X 0 X + λD)−1 X 0 Y In OLS regression, the degrees of freedom is the number of free parameters in the model, which is equal to the trace of the projection matrix P that satisfies Ŷ = PY . Fitted values in ridge regression are not a projection of Y , but the matrix X (X 0 X + λD)−1 X 0 . plays an analogous role to P. 16 / 22 Ridge regression effective degrees of freedom The effective degrees of freedom for ridge regression is defined as EDFλ = tr X (X 0 X + λD)−1 X 0 . The trace can be easily computed using the identity trace X (X 0 X + λD)−1 X 0 = trace (X 0 X + λD)−1 X 0 X . 17 / 22 Ridge regression effective degrees of freedom EDFλ is monotonically decreasing in λ. To see this we will use the following fact about matrix derivatives ∂tr(A−1 B)/∂A = −A−T B 0 A−T . By the chain rule, letting A = X 0 X + λD, we have −1 ∂tr A X ∂tr A−1 X 0 X ∂Aij · X X /∂λ = ∂Aij ∂λ ij X = − [A−T (X 0 X )A−T ]ij · Dij 0 ij X = − [A−T (X 0 X )A−T ]ii · Dii i ≤ 0. 18 / 22 Ridge regression effective degrees of freedom EDFλ equals rank(X ) when λ = 0. To see what happens as λ → ∞, we can apply the Sherman-Morrison-Woodbury identity (A + UCV )−1 = A−1 − A−1 U C −1 + VA−1 U −1 VA−1 . Let G = X 0 X , and write D = FF 0 , where F has independent columns (usually F will be p + 1 × p as we do not penalize the intercept). 19 / 22 Ridge regression effective degrees of freedom Applying the SMW identity and letting λ → ∞ we get tr (G + λD)−1 G G −1 − G −1 F (I /λ + F 0 G −1 F )−1 F 0 G −1 G = tr Ip+1 − G −1 F (I /λ + F 0 G −1 F )−1 F 0 → trIp+1 − tr G −1 F (F 0 G −1 F )−1 F 0 → trIp+1 − tr (F 0 G −1 F )−1 F 0 G −1 F = tr = p + 1 − rank(F ). Therefore in the usual case where F has rank p, EDFλ converges to 1 as λ grows large, reflecting the fact that all coefficients other than the intercept are forced to zero. 20 / 22 Ridge regression and the SVD Suppose we are fitting a ridge regression with D = I , and we factor X = USV 0 using the singular value decomposition (SVD), so that U and V are orthogonal matrices, and S is a diagonal matrix with non-negative diagonal elements. The fitted coefficients are β̂λ = (X 0 X + λI )−1 X 0 Y = (VS 2 V 0 + λVV 0 )−1 VSU 0 Y = V (S 2 + λI )−1 SU 0 Y Note that for OLS (λ = 0), we get β̂ = VS −1 U 0 Y . The effect of ridging is to replace S −1 in this expression with (S 2 + λI )−1 S, which are uniformly smaller values when λ > 0. 21 / 22 Ridge regression tuning parameter There are various ways to set the ridge parameter λ. Cross-validation can be used to estimate the MSPE for any particular value of λ. Then this estimated MSPE could be minimized by checking its value at a finite set of λ values. Generalized cross validation, which minimizes the following over λ, is a simpler, and more commonly used approach. GCV(λ) = kY − Ŷλ k2 . (n − EDFλ )2 22 / 22