Linear methods for regression Hege Leite Størvold Tirsdag 12.04.05 Linear regression models Assumes that the regression function E(Y|X) is linear N E (Y | X ) 0 i X i f ( X ) i 1 Linear models are old tools but … Still very useful Simple Allow an easy interpretation of regressors effects Very wide since Xi’s can be any function of other variables (quantitative or qualitative) Useful to understand because most other methods are generalizations of them. Matrix Notation X is n (p+1) of input vectors y is the n-vector of outputs (labels) is the (p+1)-vector of parameters 1 x1T 1 x11 x12 ... x1 p T 1 x x ... x 1 x 21 22 2p X 2 ... ... T 1 x 1 xn1 xn 2 ... xnp n y1 y y 2 ... yn 0 1 β ... p Lesast squares estimation The linear regression model has the form p f ( X ) 0 x j j j 1 the βj’s are unknown parameters or coefficients. Typically we have a set of training data (x1,y1), …, (xn,yn) from which we want to estimate the parameters β. The most popular estimation method is least squares Linear regression and least squares Least Squares: find solution, β̂ , by minimizing the residual sum of squares (RSS): 2 RSS (β) yi 0 xij j (y Xβ)T (y Xβ) i 1 j 1 N p Reasonable criterion when… Training samples are random, independent draws OR, yi’s are conditionally independent given xi Geometrical view of least squares Simply find the best linear fit to the data ei is the residual of observation i One covariate Two covariates Solving Least Squares Derivative of a Quadratic Product d dx Ax b CDx e AT CDx e DT CT Ax b T Then, RSS Xβ y T I N Xβ y β β X T I N Xβ y X T I N Xβ y T 2 X T y Xβ Setting the First Derivative to Zero: X T y X T Xβ 0 X T Xβ X T y β X X T 1 X Ty The normal equations Assuming that (XTX) is non-singular, the normal equations gives the unique least squares solution: βˆ OLS ( X T X ) 1 X T y Least squares predicitons yˆ Xβˆ OLS X ( X T X ) 1 X T y When (XTX) is singular the least squares coefficients are no longer uniquely defined. Some kind of alternative strategy is needed to obtain a solution: Recoding and/or dropping redundant columns Filtering Control fit by regularization Geometrical interpretation of least squares estimates Predicted outcomes ŷ are the orthogonal projection of y onto the columnspace of X (that spans a subspace of Rn). Properties of least squares estimators If Yi are independent, X fixed and Var(Yi) = σ2 constant, then E (βˆ ) β, Var (βˆ ) 2 ( X T X ) 1 and E (ˆ 2 ) 2 with ˆ 2 1 ˆ (x )) 2 ( y f i i n p 1 If, in addition Yi=f(Xi) + ε with ε ~ N(0,σ2), then βˆ ~ N (β, ( X T X ) 1 2 ) and (n p 1)ˆ 2 ~ 2 n2 p 1 Properties of the least squares solution To test the hypothesis that a particular coefficient βj = 0 we calculate ˆ j zj vj is the jth diagonal ˆ v j element of (XTX)-1 Under the null hypotesis that βj = 0, zj will be distributed as tn-p-1 and hence a large absolute value of zj will reject the null hypothesis A (1-2α) confidence interval for βj can be formed by: ( ˆ j z (1 ) v jˆ , ˆ j z (1 ) v jˆ ) F-test can be used to test the nullity of a vector of parameters Significance of Many Parameters We may want to test many features at once Comparing model M0 with k parameters to an alternative model MA with m parameters from M0 (m<k) Use the F statistic: Full model Reduced/alternative model RSS 0 RSS A k m F ~ F (( k m), (n k 2)) RSS A n k 2 Example: Prostate cancer Response: level of prostate antigen Regressors: 8 clinical measures useful for men receiving prostatectomy Results from linear fit: Term Coefficient Std.error Z score intercept 2,48 0,09 27,66 lcalvol 0,68 0,13 5,37 lweight 0,30 0,11 2,75 age -0,14 0,10 -1,40 lbph 0,21 0,10 2,06 svi 0,31 0,12 2,47 lcp -0,29 0,15 -1,87 gleason -0,02 0,15 -0,15 0,27 0,15 1,74 pgg45 Gram-Schmidt Procedure 1) 2) Initialize z0 = x0 = 1 For j = 1 to p For k = 0 to j-1, regress xj on the zk’s so that kj zk x j (univariate least squares estimates) zk zk Then compute the next residual j 1 zj xj z kj k k 0 3) Let Z = [z0 z1 … zp] and be upper triangular with entries kj X = Z = ZD-1D = QR where D is diagonal with Djj = || zj || O(Np2) Technique for Multiple Regression Computing βˆ X T X properties 1 X T y directly has poor numeric QR Decomposition of X Decompose X = QR where Q is N (p+1) orthogonal matrix (QTQ = I(p+1)) R is an (p+1) (p+1) upper triangular matrix Then 1 1 βˆ RT QT QR RT QT y RT R RT QT y R 1 RT 1) Compute QTy 2) Solve R β̂ = QTy by back-substitution 1 RT QT y R 1QT y Multiple outputs Suppose we want to predict multiple outputs Y1,Y2,…,YK from multiple inputs X1,X2,…,Xp. Assume a linear model for each output: p Yk 0 k X j jk k f k ( X ) k j 1 With n training cases the model can be written in matrix notation Y=XB+E here Y is the nxK response matrix, with ik entry yik X is the nx(p+1) input matrix B is the (p+1)xK matrix of parameters E is the nxK matrix of errors Multiple outputs cont. A straightforward generalization of the univariate loss function is K n RSS ( B) ( yik f k (x i )) 2 tr (Y XB)T (Y XB) k 1 i 1 And the least squares estimates have the same form as before: Bˆ ( X T X ) 1 X T Y the coefficients for the k’th outcome are just the least squares estimates of the single output regression of yk on x0,x1,…,xp If the errors ε=(ε1,…., εK) are correlated a modified model might be more appropriate (details in textbook) Why Least Squares? Gauss-Markov Theorem: The least squares estimates have the smallest variance among all linear unbiased estimates However, there may exist a biased estimator with lower mean square error MSE (βˆ ) E (βˆ β) 2 Var (βˆ ) [ E (βˆ ) β]2 this is zero for least squares Subset selection and Coefficient Shrinkage Two reasons why we are not happy with least squares Prediction accuracy: LS estimates often provide predictions with low bias, but high variance. Interpretation: when the number of regressors i too high, the model is difficult to interpret. One seek to find a smaller set of regressors with higher effects. We will consider a numer of approaches to variable selection and coefficient shrinkage. Subset selection and shrinkage: Motivation Bias – variance trade off: Goal: choose a model to minimize error 2 ˆ ˆ ˆ MSE ( ) Var ( ) bias ( ) Method: sacrifice a little bit of bias to reduce the variance. Better interpretation: find the strongest factors from the input space. Subset selection Goal: to eliminate unnecessary variables from the model. We will consider three approaches: Best subset regression Forward stepwise selection Choose subset of size k that gives lowest RSS. Continually add features with the largest F-ratio Backward stepwise selection Remove features with small F-ratio Greedy techniques – not guaranteed to find the best model Best subset regression For each k {0,1,2,..., p} find the subset of size k that gives the smallest RSS. Leaps and bounds procedure works with p ≤ 40. How to choose k? Choose model that minimizes prediction error (not a topic here). When p is large searching through all subsets is not feasible. Can we seek a good path through subsets instead? Forward Stepwise selection Method: Start with intercept model. Sequentially include variable that most improve the RSS(β) based on the F statistic: ~ ˆ RSS ( ) RSS ( ) F ~ RSS ( ) /( n k 2) Stop when no new variable improves fit significantly Backward Stepwise selection Method: Start with full model Sequentially delete predictors that produces the smallest value of the F statistic, i.e. increases RSS(β) least. Stop when each predictor in the model produces a significant value of the F statistic Hybrids between forward and backward stepwise selection exists Subset selection Produces model that is interpretable and has possibly lower prediction error Forces some dimensions of X to zero, thus probably decrease Var(β) ˆβ ~ N (β, ( X T X ) 1 2 ) Optimal subset must be chosen to minimize predicion error (model selection: not a topic here) Shrinkage methods Use additional penalties/constraints to reduce coefficients Shrinkage methods are more continous than stepwise selection and therefore don’t suffer as much from variability Two examples: Ridge Regression p Minimize least squares s.t. 2 j s j 1 The Lasso p Minimize least squares s.t. | j | s j 1 Ridge regression Shrinks the regression coefficients by imposing a penalty on their size Complexity parameter λ controls amount of shrinkage βˆ ridge p p 2 2 arg min ( yi 0 xij j ) j β j 1 j 1 i equivalently p ˆβ ridge arg min ( y x ) 2 i i 0 ij j β j 1 p subject t o j 1 2 j s One-to-one corresponence between s and λ Properties of ridge regression Solution by matrix notation: βˆ ridge ( X T X I ) 1 X T y Addition of λ>0 to the diagonal of XTX before inversion makes the problem nonsingular even if X is not of full rank. The size constraint prevents coefficient estimates of highly correlated variables show high variance. Quadratic penalty makes the ridge solution a linear function of y. Properties of ridge regression cont. Can also be motivated through bayesian statistics by choosing an appropriate prior for β. Does no automatic variable selection Ridge existence theorem states that there exists a λ>0 so that MSE (βˆ ridge ) MSE (βˆ OLS ) Optimal complexity parameter must be estimated Example Complexity parameter of the model: Effective degrees of freedom The parameters are continously shrunken towards zero Singular value decomposition (SVD) The SVD of the matrix X Rn p has the form X UDV T p p V R where U R and are orthogonal matrices n p and D=diag(d1,…..,dr) d1 d 2 ..... d r 0 are the non-zero singular values of X. r ≤ min(n,p) is the rank of X The eigenvectors vi are called the principal components of X. Linear regression by SVD A general solution to y=Xβ can be written as p ˆβ u i y v i i d i 1 i The filter factors ωi determines the extent of shrinking, 0≤ ωi ≤1, or stretching, ωi >1, along the singular directions ui For the OLS solution ωi =1, i=1,…,p, i.e. all the directions ui contribute equally Ridge regression by SVD I ridge regression the filter factors are given by d i2 i 2 , di i 1,...., p Shrinks the OLS estimator in every direction depending on λ and the corresponding di. The directions with low variance (small singular values) are the directions shrunken the most by ridge Assumption: y vary most in the directions of high variance The Lasso A shinkage method like ridge, but with important differences The lasso estimate β lasso p 2 p subject t o | j 1 j | s The L1 penalty makes the solution nonlinear in y arg min yi 0 xij j β i 1 j 1 N quadratic programming needed to compute the solutions. Sufficient shrinkage will cause some coefficients to be exactly zero, so it acts like a subset selection method. Example Coefficients plottet against s t p j 1 | ˆ | Note that the lasso profiles hit zero, while those for ridge do not. A unifying view We can view these linear regression techniques under a common framework 2 p p N ˆ arg min yi 0 xij j | j j 1 j 1 i 1 | q includes bias, q indicates a prior distribution on =0: least squares >0, q=0: subset selection (counts number of nonzero parameters) >0, q=1: the lasso >0, q=2: ridge regression Methods using derived input directions Goal: Using linear combinations of inputs as inputs in the regression Includes Principal Components Regression Partial Least Squares Regress on M < p principal components of X Regress on M < p directions of X weighted by y The methods differ in the way the linear combinations of the input variables are constructed. PCR Use linear combinations zm=X v as new features vj is the principal component (column of V) corresponding to the jth largest element of D, e.g. the directions of maximal sample variance For some M ≤ p form the derived input vectors [z1…zM] = [Xv1……XvM] Regress y on z1…zM, gives the solution M ˆ PCR ( M ) ˆm z m m 1 where ˆm zm , y / zm , zm PCR continued The m’th principal component direction vm solves: max ||α||1 Var ( Xα) vTl Sα 0,l 1,..., m 1 Filter factors become 1 for i M i 0 for i M e.g. it discards the p-M smallest eigenvalue components from the OLS solution. If p=M it gives the OLS solution Comparison of PCR and ridge Shrinkage and trucation patterns as a function of the principal component index PLS Idea: find the directions that have high variance and have correlation with y In construction of each zm the inputs are weighted by the strength of their univariate effect on y Pseudo-algoritm: ( 0) Set x(j0) (x j x j ) / var( x j ) and yˆ 1 y For m=1,….,p 1. 2. 3. 4. Find m’th PLS component Regress y on zm Update y Orthogonalize each xj(m) with respect to zm z m X ( m )ˆ m where ˆ m X ( m ) y ( m ) ˆ z , y / z , z T m m m m m yˆ ( m) yˆ ( m1) ˆm z m X ( m ) X ( m 1) z m X ( m ) , z m / z m , z m PLS solution: ˆ jPLS (m) lm1ˆljˆl PLS cont. Nonlinear function of y, because y is used to find the linear components As with PCR M=p gives OLS estimate, while M<p directions produces a reduced regression. The m’th PLS direction is found by using the ̂m that maximizes the covariation between the input and output variable: max ||α|| 1 Corr 2 (y, Xα)Var ( Xα) lT Sα 0 ,l 1,..., m 1 where S is the sample covariance matrix of the xj. PLS cont. Filter factors for PLS become q i 1 (1 j 1 d i2 j ) where θ1≥… ≥θ are the Ritz values (not defined here). Note that some ωi>1, but it can be shown that PLS shrinks the OLS solution, || ˆ PLS ||2 || ˆ OLS ||2 It can also be shown that the sequence of PLS components for m=1,2,…,p represents the conjugate gradient sequence for computing the OLS solution. Comparison of the methods p Consider the general solution: uy βˆ i i v i di i 1 Ridge shrinks all directions, but shinks the low-variance directions most d i2 i 2 , di PCR leaves M high variance directions alone, and discards the rest. 1 for i 1,..., M i 0 for i M 1,..., p PLS tends to shrink the low-variance directions, but can inflate some of the higher variance directions q i 1 (1 j 1 i 1,...., p d i2 j ) Comparison of the methods Consider an example with two correlated inputs x1 and x2, ρ=0.5. Assume true regression coefficients β1=4 and β2=2 Coefficient profiles for the different methods as the tuning parameters are varied: PCR PLS ridge Least Squares Best subset lasso β2 2 β1 4 Example: Prostate cancer Term LS Best subset Ridge Lasso PCR PLS Intercept 2,480 2,495 2,467 2,477 2,513 2,452 lcalvol 0,680 0,740 0,389 0,545 0,544 0,440 lweight 0,305 0,367 0,238 0,237 0,337 0,351 -0,152 -0,017 age -0,141 -0,029 lbph 0,210 0,159 0,098 0,213 0,248 svi 0,305 0,217 0,165 0,315 0,252 lcp -0,288 0,026 -0,053 0,078 Gleason -0,021 0,042 0,230 0,003 Pgg45 0,267 0,123 0,059 -0,053 0,080 Test err. 0,586 0,574 0,540 0,491 0,527 0,636 Std.err. 0,184 0,156 0,168 0,152 0,122 0,172