Chapter 3: Linear Methods for Regression The Elements of Statistical Learning Aaron Smalter Chapter Outline 3.1 Introduction 3.2 Linear Regression Models and Least Squares 3.3 Multiple Regression from Simple Univariate Regression 3.4 Subset Selection and Coefficient Shrinkage 3.5 Computational Considerations 3.1 Introduction A linear regression model assumes the regression function, E(Y|X) is linear on the inputs X1, ..., Xp Simple, precomputer model Can outperform nonlinear models when low # training cases, low signal-noise ration, sparse Can be applied to transformations of the input 3.2 Linear Regression Models and Least Squares Vector of inputs: X = (X1, X2, ..., Xp) Predict real-valued output: Y 3.2 Linear Regression Models and Least Squares (cont'd) Inputs derived from various sources Quantitative inputs Transformations of inputs (log, square, sqrt) Basis expansions (X2 = X1^2, X3 = X1^3) Numeric coding (map one multi-level input into many Xi's) Interactions between variables (X3 = X1 * X2) 3.2 Linear Regression Models and Least Squares (cont'd) Set of training data used to estimate parameters, (x1,y1) ... (xN,yN) Each xi is a vector of feature measurements, xi = (xi1, xi2, ... xip)^T 3.2 Linear Regression Models and Least Squares (cont'd) Least Squares Pick a set of coefficients, B = (B0, B1, ..., Bp)^T In order to minimize the residual sum of squares (RSS): 3.2 Linear Regression Models and Least Squares (cont'd) How do we pick B so that we minimize the RSS? Let X be the N x (p+1) matrix Rows are input vectors (1 in first position) Cols are feature vectors Let y be the N x 1 vector of outputs 3.2 Linear Regression Models and Least Squares (cont'd) Rewrite RSS as, Differentiate with respect to B, Set first derivative to zero (assume X has full column rank, X^TX is positive definite), Obtain unique solution: Fitted values are: 3.2.2 The Gauss-Markov Theorum Asserts: least squares estimates have the smallest variance among all linear unbiased estimates. Estimate any linear combination of the parameters, 3.2.2 The Gauss-Markov Theorum (cont'd) Least squares estimate is, If X is fixed, this is a linear function c0^Ty of response vector y If linear model is correct, a^TB is unbiased. Gauss-Markov Theorum: 3.2.2 The Gauss-Markov Theorum (cont'd) However, unbiased estimator is not always best. Biased estimator may exist with better MSE, trading bias for reduction in variance. In reality, most models are approximations of the truth and therefore biased; need to pick a model that balances bias and variance. 3.3 Multiple Regression from Simple Univariate Regression Linear model with p > 1 inputs is a multiple linear regression model. Suppose we first have univariate model (p = 1), Least squares estimate and residuals are, 3.3 Multiple Regression from Simple Univariate Reg. (cont'd) With inner product notation we can rewrite as, Will use univariate regression to build multiple least squares regression. 3.3 Multiple Regression from Simple Univariate Reg. (cont'd) Suppose inputs x1,x2,...,xp are orthogonal (<xi,xj> = 0, for all j != k) Multiple least squares estimates Bj are equal to the univariate estimates <xj,y> / <xj,xj> When inputs are orthogonal, they have no effect on each others parameter estimates. 3.3 Multiple Regression from Simple Univariate Reg. (cont'd) Observed data is not usually orthogonal, so we must make it so. Given intercept and single input x, the least squares coefficient of x has the form, This is the result of two simple regression applications: 1. regress x on 1 to produce residual z = x – x^-1 2. regress y on the residual z to give coefficient B1^^ 3.3 Multiple Regression from Simple Univariate Reg. (cont'd) In this procedure, the phrases ”regress b on a” or ”b is regressed on a” refers to: A simple univariate regression of b on a with no intercept, producing coefficient y^^ = <a,b> / <a,a>, And residual vector b – y^^a. We say b is adjusted for a, or is ”orthogonalized” with respect to a. 3.3 Multiple Regression from Simple Univariate Reg. (cont'd) We can generalize this approach to cases of p inputs (Gram-Schmidt procedure): The result of the algorithm is: 3.3.1 Multiple Outputs Suppose we have multiple outputs Y1,Y2,...,Yk, Predicted from inputs X0,X1,X1,...,Xp. Assume a linear model for each output: In matrix notation: Y is N x K response matrix, ik entry is yik X is N x (p+1) input matrix B is (p+1) x K parameter matrix E is N x K matrix of errors 3.3.1 Multiple Outputs (cont'd) Generalization of the univariate loss function: Least squares estimate has same form: Multiple outputs do not affect each other's least squares estimates 3.4 Subset Selection and Coefficient Shrinkage As mentioned, unbiased estimators are not always the best. Two reasons why not satisfied by least squares estimates: 1. prediction accuracy – low bias but large variance; improve by shrinking or settings coefficients to zero. 2. interpretation – with a large number of predictors, signal gets lost in the noise, we want to find a subset with strongest effects. 3.4.1 Subset Selection Improve estimator performance by retaining only a subset of variables. Use least squares to estimate coefficients of remaining inputs. Several strategies: Best subset regression Forward stepwise selection Backwards stepwise selection Hybrid stepwise selection 3.4.1 Subset Selection (cont'd) Best subset regression For each k in {0,1,2,...,p}, find subset of size k that gives smallest residual sum Leaps and Bounds procudure (Furnival and Wilson 1974) Feasible for p up to 30-40. Typically choose k such that estimate of expected prediction error is minimized. 3.4.1 Subset Selection (cont'd) Forward stepwise selection Searching all subsets is time consuming Instead, find a good path through them. Start with intercept, Sequentially add predictor that most improves the fit. ”Improved fit” based on F-statistic, add predictor that gives largest value of F Stop adding when no predictor gives a significantly greater F value. 3.4.1 Subset Selection (cont'd) Backward stepwise selection Similar to previous procedure, but starts with the full model Sequentially deletes predictors. Drop predictor giving the smallest F value. Stop when dropping any other predictor leads to significanty decrease in F value. 3.4.1 Subset Selection (cont'd) Hybrid stepwise selection Consider both forward and backward moves at each iteration Make ”best” move Need parameter to set threshold between choosing ”add” or ”drop” operation. The stopping rule (for all stepwise methods) leads to only a local max/min, not guarenteed to find the best model. 3.4.3 Shrinkage Methods Subset selection is a discrete process (variables retained or not) so may give high variance. Shrinkage methods are similar, but use a continuous process to avoid high variance. Examine two methods: Ridge regression Lasso 3.4.3 Shrinkage Methods (cont'd) Ridge regression Shrinks regression coefficients by imposing a penalty on their size. Ridge coefficients minimize the penalized sum of squares, Where s is a complexity parameter controlling the amount of shrinkage. Mitigates high variance produced when many input variables are correlated. 3.4.3 Shrinkage Methods (cont'd) Ridge regression Can be reparameterized using centered inputs, replacing each xij with Estimate, Ridge regression solutions give by, 3.4.3 Shrinkage Methods (cont'd) Singular value decomposition (SVD) of centered input matrix X has form, U and V are N x p and p x p orthogonal matrices Cols of U span col space of X Cols of V span row space of X D is p x p diagonal matrix with d1 >= ... >= dp >= 0 (singular values) 3.4.3 Shrinkage Methods (cont'd) Using SVD, rewrite least squared fitted vector: Now the ridge solutions are: Greater amount of shrinkage is applied to vectors with smaller dj^2 3.4.3 Shrinkage Methods (cont'd) SVD of centered matrix X expresses the principle components of the variables in X. The the eigen decomposition of X^TX is The eigenvectors vj are the principle components directions of X. First principle component has largest sample variance, last one has minimum variance. 3.4.3 Shrinkage Methods (cont'd) Ridge regression Assume that response varies most in direction of high variance of the inputs. Shrink the coefficients of the low-variance components more than high-variance ones 3.4.3 Shrinkage Methods (cont'd) Lasso Shrinks coefficients, like ridge regression, but has some differences. Estimate is defined by, Again, reparameterize and fid model without intercept. Ridge penalty replaced by lasso penalty 3.4.3 Shrinkage Methods (cont'd) Compute using quadratic programming. Lasso is kind of continuous subset selection Making t very small will make some coefficients zero If t is larger than some threshold, lasso estimates are same as least squares If t chosen as half of the this threshold, coefficients are shrunk by about 50%. Choose t to minimize estimate of prediction error. 3.4.4 Methods Using Derived Input Directions Sometimes we have a lot of inputs, and want to instead produce a smaller number of inputs which are a linear combinations of the original inputs. Different methods for constructing linear combintaions: Principal component regression Partial least squares 3.4.4 Methods Using Derived Input Directions (cont'd) Principal Components Regression Constructs linear combinations using principal components Regression is a sum of univariate regressions: zm are each linear combinations of original xj Similar to ridge regression, instead of shrinking PC we discard p – M smallest eigenvalue components. 3.4.4 Methods Using Derived Input Directions (cont'd) Partial least squares Linear combinations use y in addition to X. Begin by computing univariate regression coefficient of y on each xi, Construct the derived input In construction of each zm, inputs are weighted by strength of their univariate effect on y. The outcome y is regressed on z1 We orthogonalize with respect to z1 Iterate until M <= p directions obtained. 3.4.6 Multiple Outcome Shrinkage and Selection Least squares estimates in multi-output linear model are collection of individual estimates for each output. To apply selection and shrinkage we could simply apply a univariate technique individually to each outcome or simultaneously to all outcomes 3.4.6 Multiple Outcome Shrinkage and Selection (cont'd) Canonical correlation analysis (CCA) Data reduction technique that combines responses Similar to PCA Finds sequence of uncorrelated linear combinations of inputs, corresponding sequence of uncorrelated linear combinations of the responses, That maximizes correlations: 3.4.6 Multiple Outcome Shrinkage and Selection (cont'd) Reduced-rank regression Formalizes CCA approach using regression model that explicitly pools information Solve multivariate regression problem: Questions