Lecture 4. Linear Models for Regression Outline Linear Regression Least Square Solution Subset Least Square subset selection/forward/backward Penalized Least Square: Ridge Regression LASSO Elastic Nets (LASSO+Ridge) Linear Methods for Regression Input (FEATURES) Vector: (p-dimensional) X = X1, X2, …, Xp Real Valued OUTPUT: Y Joint Distribution of (Y,X ) Function: Regression Function E(Y |X ) = f(X) Training Data : (x1, y1), (x2, y2), …, (xN, yN) for estimation of input-output relation f. Linear Model p f ( X) 0 X j j j 1 f(x): Regression function or a good approximation LINEAR in Unknown Parameters(weights, coefficients) 0 , 1 ,..., p Features Quantitative inputs Any arbitrary but known function of measured attributes Transformations of quantitative attributes: g(x), e.g., log, square, square-root etc. Basis expansions: e.g., a polynomial approximation of f as a function of X1 (Taylor Series expansion with unknown coefficients) X 2 X , X 3 X ,...X k X , 2 1 3 1 k 1 Features (Cont.) Qualitative (categorical) input G Dummy Codes: For an attribute with k categories, may use k codes j = 1,2, …, k, as indicators of the category (level) used. Together, this collection of inputs represents the effect of G through X k j j This is a set of level-dependent constants, since only one of the Xj equals one and others are zero j 1 Features(cont) Interactions: 2nd or higher-order Interactions of some features, e.g., X 3 X1 * X 2 , X 4 X1 * X 2 * X 3 Feature vector for the ith case in training set (Example) xi (xi1, xi2, ...,xip )T Generalized Linear Models: Basis Expansion Wide variety of flexible models Model for f is a linear expansion of basis functions K f ( x) k hk ( x) k 1 Dictionary: Prescribed basis functions Other Basis Functions Polynomial basis of degree s (Smooth functions Cs). Fourie Series (Band-limited functions, a compact subspace of C∞) Splines Piecewise polynomials of degree K between the knots, joined with continuity of degree K-1 at the knots (Sobolev Spaces). Wavelets (Besov Spaces) Radial Basis Functions: Symmetric p-dim kernels located at particular centroids f(|x-y|) Gaussian Kernel at each centroids And more … -- Curse of Dimensionality: p could be equal to or much larger than n. Method of Least Squares T ( , ,..., ) 0 1 p Find coefficients that minimize Residual Sum of Squares, RSS() = N N ( y f ( x )) ( y 2 i 1 i i i 1 i p 0 xij j ) 2 j 1 RSS denotes the empirical risk over the training set. It doesn’t assure the predictive performance over all inputs of interest. Min RSS Criterion Statistically Reasonable provided Examples in the Training Set Large # of independent random draws from the inputs population for which prediction is desirable. Given inputs (x1, x2, …, xN), the outputs (y1, y2, …, yN) conditionally independent In principle, predictive performance over the set of future input vectors should be examined. Gaussian Noise: the Least Squares method equivalent to Max Likelihood Min RSS() over in R(p+1), a quadratic function of . Optimal Solution: Take the derivatives with respect to elements of ,and set them equal to zero. Optimal Solution The Hession (2nd derivative) of the criterion function is given by XTX. The optimal solution satisfies the normal equations XT(Y-X ) = 0 or (XTX) XT Y ö ( X T X )1 X T Y Yö X ( X T X )1 X T Y HY , H X ( X T X )1 X T For an unique solution, the matrix XTX must full rank. Projection When the matrix XTX is full rank. the estimated response for the training set: Yö X ( X T X )1 X T Y HY , H X ( X T X )1 X T H: Projection (Hat) matrix HY: Orthogonal Projection of Y on the space spanned by the columns of X Note: the projection is linear in Y Geometrical Insight Simple Univariate Regression One Variable with no intercept LS estimate Y X N ö 1 xi yi N 1 xi2 x, y x, x inner product x, y 1 xi yi xT y N = cosine (angle between vectors x and y), a measure of similarity between y and x Residuals: projection on normal space r y xö Definition: “Regress b on a” Simple regression of response b and input a, with no intercept Estimate Residual “b adjusted for a” “b orthogonalized with respect to a” b öa ö a, b / a, a Multiple Regression Multiple Regression:p>1 LS estimates different from simple univariate regression estimates, unless columns of input matrix X orthogonal, If xi , x j 0, for all i j then öj x j , y / x j , x j These estimates are uncorrelated, and Var(öp ) 2 / xp , xp Orthogonal inputs occur sometimes in balanced, designed experiments (experimental design). Observational studies, will almost never have orthogonal inputs Must “orthogonalize” them in order to have similar interpretation Use Gram-Schmidt procedure to obtain an orthogonal basis for multiple regression Multiple Regression Estimates: Sequence of Simple Regressions Regression by Successive Orthogonalization: Initialize z0 x0 1 For, j=1, 2, …, p, Regress x j on z0 , z1, , z j 1 to produce coefficients ölj zl , x j / zl , zl and residual vectors z j x j k 1 ökj zk 1 Regress y on the residual vector z p for the estimate j Instead of using x1 and x2, take x1 and z as features zp , y ö p zp , zp Multiple Regression = Gram-Schmidt Orthogonalization Procedure The vector zp is the residual of the multiple regression of xp on all other inputs x0 , x1, , xp1 Successive z’s in the above algorithm are orthogonal and form an orthogonal basis for the columns space of X. The least squares projection onto this subspace is the usual yö By re-arranging the order of these variables, any input can be labeled as the pth variable. If x j is highly correlated with other variables, the residuals are quite small, and the coefficient has high variance. Statistics Properties of LS Model Uncorrelated noise: Mean zero, Variance 2 1 2 Then Var ( ) ( X ' X ) Noise estimation N 1 2 ( yi yöi )2 N p 1 i 1 Model d.f. = p+1 (dimension of the model space) To Draw inferences on parameter estimates, we need assumptions on noise: If assume: then, ˆ ~ N ( ,(T )1 2 ) ( N p 1)ˆ 2 ~ 2 N2 p1 Gauss-Markov Theorem (The Gauss-Markov Theorem) If we have any linear estimator that is unbiased for aT β, that is, E(cT y)= aT β,then T ˆ var(a ) var(c y) T It says that, for inputs in row space of X, LS estimate have Minimum variance among all unbiased estimates. Bias-Variance Tradeoff Mean square error of an estimator = variance + bias Least square estimator achieves the minimal variance among all unbiased estimators There are biased estimators to further reduce variance: Stein’s estimator, Shrinkage/Thresholding (LASSO, etc.) More complicated a model is, more variance but less bias, need a trade-off Hypothesis Test • Single Parameter test: βj=0, T-statistics ˆ j zj ~ t N p 1 ˆ v j where vj is the j-th diagonal element of V = (XTX)-1 ˆ z1v1/ 2 ˆ , e.g. z1-0.0025=1.96 Confidence interval j • Group parameter: 0, p1 p0 , F-statistics for nested models ( RSS0 RSS1 ) ( p1 p 0 ) F RSS1 ( N p 1) Example R command: lm(Y ~ x1 + x2 + … +xp) Rank Deficiency For an input, not in the row space of X, the estimate may change with X : rank deficient Normal equations has infinitely many solutions the solution used. Hat matrix H, and the projection are unique. For an input in the row space of X, unique LS estimate. Yö X How to generalize to inputs outside the training set? Penalized methods (!) Reasons for Alternatives to LS Estimates Prediction accuracy LS estimates have low bias but high variance when inputs are highly correlated Larger ESPE Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero. Small bias in estimates may yield a large decrease in variance Bias/var tradeoff may provide better predictive ability Better interpretation With a large number of input variables, like to determine a smaller subset that exhibit the strongest effects. Many tools to achieve these objectives Subset selection Penalized regression constrained optimization Best Subset Selection Method Algorithm: leaps & bounds Find the best subset corresponding to the smallest RSS for each size k {0,1, 2, , p} For each fixed size k, can also find a specified number of subsets close to the best For each fixed subset, obtain LS estimates Feasible for p ~ 40. Choice of optimal k based on model selection criteria to be discussed later Other Subset Selection Procedures Larger p, Classical Forward selection (step-up), Backward elimination (step down) Hybrid forward-backward (step-wise) methods Given a model, these methods only provide local controls for variable selection or deletion Which current variable is least effective (candidate for deletion) Which variable not in the model is most effective (candidate for inclusion) Do not attempt to find the best subset of a given size Not too popular in current practice Forward Stagewise Selection (Incremental) Forward stagewise Standardize the input variables Note: Penalized Regression • Instead of directly minimize the Residual Sum Square, The penalized regression usually take the form: where J(f) is the penalization term, usually penalize on the smoothness or complexity of the function f λ is chosen by cross-validation. Model Assessment and Selection If we are in data-rich situation, split data into three parts: training, validation, and testing. Train See chapter 7.1 for details Validation Test Cross Validation When sample size not sufficiently large, Cross Validation is a way to estimate the out of sample estimation error (or classification rate). Randomly split Available Data Training Split many times and get error2, …, errorm ,then average over all error to get an estimate Test error1 Ridge Regression (Tikhonov Regularization) Ridge regression shrinks coefficients by imposing a penalty on their size Min a penalized RSS öridge arg min {RSS j 1 j2 } p Here 0 is complexity parameter, that controls the amount of shrinkage Larger its value, greater the amount of shrinkage Coefficients are shrunk towards zero Choice of penalty term based on cross validation Prostate Cancer Example Ridge Regression (cont) Equivalent problem Min RSSp subject to 2 s j j 1 Lagrangian multiplier 1-1 correspondence between s and With many correlated variables, LS estimates can become unstable and exhibit high variance and high correlations A widely large positive coeff on one variable can be cancelled by a large negative coeff on another Imposing a size constraint on the coefficients, this phenomena is prevented from occurring Ridge solutions are not invariant under scaling of inputs Normally standardize the inputs before solving the optimization problem Since the penalty term does not include the bias term, estimate intercept by the mean of response y. Ridge Regression (cont) The Ridge criterion RSS ( ) ( y X )T ( y X ) T ö ridge ( X X+I) X Y T -1 T Shrinkage: For orthogonal inputs, ridge: scaled version of LS estimates öridge ö,0 1 Centered input matrix X T SVD of X: X UDV U and V are orthogonal matrices Columns of U span column space of X Columns of V span row space of X D: a diagonal matrix of singular values Eigen decomposition of d1 d2 Ridge is mean or mode of posterior distribution of under a normal prior dp 0 X T X VD2V T The Eigen vectors v j : principal components directions of X (Karhunen-Loeve direction) Ridge Regression and Principal Components First PC direction :v1 Among all normalized linear combinations of columns of X, z1 Xv1 u1d1 the has largest sample variance Derived variable, z1 is first PC of X. Subsequent PC z j have max variance subject to being orthogonal to earlier ones. Last PC has min variance Effective Degree of Freedmon Ridge Regression (Summary) Ridge Regression penalized the complexity of a linear model by the sum squares of the coefficients It is equivalent to minimize RRS given the constraints The matrix (XTX+ I) is always invertable. The penalization parameter controls how simple “you” want the model to be. Prostate Cancer Example Ridge Regression (Summary) Solutions are not sparse in the coefficient space. - ’s are not zero almost all the time. The computation complexity is O(p3) when inversing the matrix XTX+ I. Prostate Cancer Example Least Absolute Shrinkage and Selection Operator (LASSO) Penalized RSS with L1-norm penalty, or subject to p constraint j 1 j t Shrinks like Ridge with L2norm penalty, but LASSO coefficients hit zero, as the penalty increases. LASSO as Penalized Regression • Instead of directly minimize the Residual Sum Square, The penalized regression usually take the form: where p J( f ) f 1 : i i1 p f (X) 0 X j j j 1 LASSO(cont) The computation is a quadratic programming problem. We can obtain the solution path, piece-wise linear. Coefficients are non-linear in response y (they are linear in y in Ridge Regression) Regularization parameter is chosen by cross validation. LASSO and Ridge Contour of RRS in the space of ’s Generalize to L-q norm as penalty Minimize RSS subject to constraints on the l-q norm Equivalent to Min RSS j 1 j p q Bridge regression with ridge and LASSO as special cases (q=1, smallest value for convex region) For q=0, best subset regression For 0<q<1, it is not convex! Contours of constant values of L-q norms Why non-convex norms? LASSO is biased: 1 2 y 2 ˆ S (y) sign(y)max(y ,0) ˆ ) E(y), y 0 E( Nonconvex Penalty is necessary for unbiased estimator 1 2 y J() 2 J() (y ) / 0, y Elastic Net as a compromise between Ridge and LASSO (Zou and Hastie 2005) The Group LASSO Group LASSO Group norm l1-l2 (also l1-l∞) Every group of variable are simultaneously selected or dropped Methods using Derived Directions Principal Components Regression Partial Least Squares Principal Components Regression Principal Components Regression Z1 (M<p) Motivation: leading X2 Z2 eigen-vectors describe most of the variability in X X1 Principal Components Regression Zi and Zj are orthogonal now. The dimension is reduced. High correlation between independent variables are eliminated. Noises in X’s are taken off (hopefully). Computation: PCA + Regression Partial Least Square Partial Least Squares (PLS) Uses inputs as well as response y to form the directions Zm Seeks directions that have high variance and have high correlation with response y Popular in Chemometrics If original inputs are orthogonal, finds the LS after one step. Subsequent steps have no effect. Since the derived inputs use y, the estimates are non-linear functions of the response, when the inputs are not orthogonal The coefficients for original variables tend to shrink as fewer PLS directions are used Choice of M can be made via cross validation Partial Least Square Algorithm PCR vs. PLS Principal Component Regression choose directions: Partial Least Square has m-th direction: variance tends to dominate, whence PLS is close to Ridge Ridge, PCR and PLS The solution path of different methods in a two variable (corr(X1,X2)=ρ, β=(4,2)) Regression case. Comparisons on Estimated Prediction Errors (Prostate Cancer Example) 0.574 (0.156) 0.491 (0.152) 0.540 (0.168) 0.527 (0.122) 0.636 (0.172) Least Square: Test error: 0.586 Sd of error:(0.184) LASSO and Forward Stagewise Diabetes Data LASSO and Forward Stagewise Least Angel Regression (LARS) Efron, Hastie, Johnstone, and Tibshirani (2003) Recall: Forward Stagewise Selection (Incremental) Forward stagewise Standardize the input variables Note: LAR directions and Example y Relationship between those three • Lasso and forward stagewise can be thought of as restricted version of LARS • For Lasso: Start with LARS; If a coefficient crosses zero, stop. Drop that predictor, recompute the best direction and continue. It gives the LASSO path. • For Stagewise, Start with LARS; select the most correlated direction at each stage, go that direction with a small step. • There are other related methods: •Orthogonal Matching Pursuit •Linearized Bregman Iteration Homework Project I Keyword Pricing (regression) Homework Project II Click Prediction (classification): two subproblems click/impression click/bidding Data Directory: /data/ipinyou/ Files: bid.20130301.txt: Bidding log file, 1.2M rows, 470MB imp.20130301.txt: Impression log, 0.8M rows, 360MB clk.20130301.txt: Click log file, 796 rows, 330KB data.zip: compressed files above (Password: ipinyou2013) dsp_bidding_data_format.pdf: format file Region&citys.txt: Region and City code Questions: dsp-competition@ipinyou.com Homework Project II Data Input by R: bid <read.table("/Users/Liaohairen/DSP/bid.20130301.txt", sep='\t', comment.char='') imp <read.table("/Users/Liaohairen/DSP/imp.20130301.txt", sep='\t', comment.char='’) R read.table by default uses '#' ascomment character, that is , it has the comment.char = '#' parameter, but the user-agent field may have '#' character. To read correctly, turning off of interpretation of comments by setting comment.char='' is needed. Homework Project III Heart Operation Effect Prediction (classification) Note: Large amount missing values