Lecture4,5 – Linear Regression Rice ELEC 697 Farinaz Koushanfar Fall 2006 Summary • • • • • The simple linear regression model Confidence intervals Multiple linear regression Model selection and shrinkage methods Homework 0 Preliminaries • Data streams X and Y, forming the measurement tuples (x1,y1), …., (xN,yN) • xi is the predictor (regressor, covariate, feature, independent variable) • yi is the response (dependent variable, outcome) • Denote the regression function by: (x) = E (Y |x) • The linear regression model assumes a specific linear form for (x) = 0 + 1 x Linear Regression - More • ’s are unknown parameters (coefficients) • X’s can come from different sources: – – – – Quantitative inputs Transformations of quantitative inputs Basis expansions, e.g. X2=X12, X3=X13 Numeric or “dummy” coding of the levels of qualitative inputs – Interaction between variables, e.g. X3=X1.X2 • No matter the source of X, the model is linear in parameters Fitting by Least Squares • For (x1,y1), …., (xN,yN), estimate ’s! • xi=(xi1,xi2,…, xip)T is a vector of feature measures for the i-th case p N RSS () ( yi 0 x ij j ) 2 • Least square: i 1 j1 How to Find Least Square Fits? • X is N(p+1), y is the N-vector of outputs • RSS() = (y - X)T (y - X) RSS 2X T ( y X), 2 RSS T 2X X T • If X is full rank, then XTX is PD, set RSS T X ( y X) 0 0 ˆ (XT X)1 XT y yˆ Xˆ X(XT X)1 XT y H (hat matrix) Geometrical Representation • Least square estimates in N: • Minimize RSS()=||y - X||2, s.t. residual vector y – ŷ is orthogonal to this subspace • H (hat matrix a.k.a projection matrix) Non-full Rank Case! • • • • • • What if columns of X are linearly dependent! XTX is singular, ’s not uniquely defined ŷ is still projection of y into columns of X There is just more than one way for projection! Resolve by dropping redundant columns of X Most SW packages detect and automatically implement some strategy to remove them Sampling Properties of • yi’s uncorrelated - variance yi= 2, xi’s fixed ˆ (XT X)1 XT y Var (ˆ ) (XT X)1 2 • We use an unbiased estimator for N 1 2 ˆ ˆ 2 ( y y ) i i N p 1 i 1 • Y is linear in X1,…, Xp. Deviations are linear and additive, N(0, 2) • We can show that: ˆ N(, (XT X)1 2 ) Confidence Intervals • Assume that our population parameter of interest is the population mean • What is the meaning of a 95% confidence interval in this situation? • X (wrong) there is a 95% chance that the confidence interval contains the population mean – But any particular confidence interval either contains the population mean, or it doesn’t. The confidence interval shouldn’t be interpreted as a probability. • (correct) If samples of the same size are drawn repeatedly from a population, and a confidence interval is calculated from each sample, then 95% of these intervals should contain the population mean. But Before We Proceed… • Let’s talk some real probability and statistics, refresh your memories about a number of useful distributions! Chi-square Distribution • For Z1,Z2,…,Z independent, Z~N(0,1) • Set 2= Z12+Z22+…+Z2 with > 0 • The distribution of 2 is chi-square with “degrees of freedom” (DF) • 2(x) denotes the value of the distribution function at x • p, denotes the p-th quantile of the distribution • Usually use look-up tables to find the values N 1 2 ˆ ˆ 2 ( y y ) i i N p 1 i 1 ˆ 2 ~ 2Np1 (N p 1) • This is a 2 distribution with (N-p-1) DF Example – Chi-square Distribution t-distribution • Let Z and 2 be independent random variables, assume further that – Z has the standard normal distribution – 2 has the chi-square distribution with degrees of freedom – Define t as: t Z 2 • The distribution of t is referred to as t-distribution (student’s tdistribution) with “degrees of freedom” • The value of the corresponding distribution function is denoted by t(x) and its p-th quantile is denoted by tp, (0p1) Example – t-distribution Tail of t-distribution vs. Normal F-distribution • Let 12, 22 be independent random variables, assume further that – 12 has the chi-square distribution with 1 DF – 22 has the chi-square distribution with 2 DF • Define F as: F 1 2 1 2 2 2 • F distribution with 1 DF in the numerator and 2 DF in the denomenator. • F1,2(x): distribution function at x, p-th quantile is Fp,1,2 • F is always positive and F1,2(-)= F1,2(0)=0, quantiles>0 Example – F-distribution Confidence Interval (CI) • CI is an estimated range of values which is likely to include an unknown population parameter, • The estimated range calculated from a given set of sample data • If independent samples are taken repeatedly from the same population, and CI calculated for each sample, a certain percentage (confidence level) of the intervals will include the unknown population parameter. Hypothesis Testing 2 distribution, • (N-p-1) DF: • and 2 statistically independent • Hypothesis: Is j=0? • Standard coefficient or Z-score: (N p 1)ˆ 2 ~ 2Np1 zj j vj • vj: j-th diagonal element of (XTX)-1 • If j=0, zj is distributed as tN-p-1 (t distribution, N-p-1 DF) • Enter the t-table for N-p-1 DF, choose the significance level () required, and find the tabulated value • If the zj exceeds the tabulated value, then j is significantly different from zero and the hypothesis is rejected! Test the Significance of a Group • Simultaneously test the significance of a group of ’s • F-statistics: (RSS 0 RSS1 ) /(p1 p 0 ) F RSS1 /( N p 1) • RSS1 is for the bigger model with p1+1 • RSS0 is for the nested model with p0+1, having p1-p0 parameters constrained to be zero • The F-stat measures the change in RSS, normalized by an estimate of 2 • If the smaller model is correct, F-stat~Fp1-p0,N-p1-1 Confidence Intervals (CI) • Recall that ˆ N(, (XT X)1 2 ) • Isolate j and get the (1-2) CI for j: 1 2 j 1 2 j ˆ , j z(1) v ˆ ) ( j z(1) v • Where z(1- ) is 1- percentile of the normal • Similarly, obtain CI for entire vector ˆ 22p(11 ) CI { | (ˆ )T XT X(ˆ ) • Where 2(1-) is the (1-) percentile of the 2 distribution with DF The Gauss-Markov Theorem • The least square estimates of the parameters have the smallest variance among all linear unbiased estimates • Restriction to unbiased estimation is not always the best • Focus on a linear combination =aT of • For example, prediction f(X0)=X0 • For a fixed X=X0, this is a linear function =CY The Gauss-Markov Theorem - Proof • =CY: Since the new estimator is unbiased – E(CY|X) = E(C X+C |X) = CX=I • • • • • Var(|X)=Var(CTCT|X) =2CCT Need to show Var(|X)>Var(ols|X) Define D=C - (XTX)-1XT 0r, DY = - ols I=CX=(D+(XTX)-1XT)X=DX+(XTX)-1 XTXDX=0 Var(|X)=2CCT= 2[D+(XTX)-1XT][DT+(XTX)-1] = 2[DDT+(XTX)-1], DDT is positive semidefinite • Var(|X)>Var(ols|X), except for D=0, where = ols Multiple Outputs • Multiple outputs Y1, Y2, …, YK from multiple inputs X0,X1,X2,…,Xp p Yk 0 k j1 X j jk k f k (X) • For each output: • Y=XB+E (Y(NK), X(N(p+1)), E(NK)) K N RSS (B) ( yik f k (x i ))2 tr[(Y XB )T (Y XB )] k 1 i 1 Bˆ (XT X)1 XT Y • The least square estimates: • Note: If the errors were correlated as Cov ()=, K N RSS(B) ( y f ( x )) ( y f ( x )) k 1 i 1 T ik k i ik k i The Bias Variance Trade-off • A good measure of the quality of an estimator f*(x) is the MSE • Let f0(x) be the true value of f(x) at the point x. Then: • MSE[f*(x)] = E[f*(x)-f0(x)]2.This can be written as: • MSE[f*(x)] = Var[f*(x)] + [E (f*(x) – f0(x))]2 • This is variance plus squared bias Typically, when bias is low, variance is high and vice versa. Choosing estimators often involves a trade-off between bias and variance Linear Methods for Regression • If the linear model is correct for a given problem, then the OLS prediction is unbiased, and has the lowest variance among all linear unbiased estimators • But there can be (and often exist) biased estimators with smaller MSE • Generally, by regularizing (thinking, dampening, controlling) the estimator in some way, its variance will be reduced; if the corresponding increase in bias is small, this will be worthwhile • Examples of regularization: subset selection (forward, backward, all subsets): ridge regression, the lasso • In reality, models are almost never correct, so there is an additional model bias between the closest member of the linear model class and the truth Model Selection • Often we prefer a restricted estimate because of its reduced estimation variance Subset Selection and Coefficient Shrinkage • There are two reasons why we are often not satisfied with the least squares estimates: • Prediction accuracy: it can be sometimes improved by trading a little bias to reduce the variance of the predicted values • Interpretation: determine a smaller subset of predictors that exhibit the strongest effect. To get the “big picture” sacrifice some of the small details • Let us describe a number of approaches to variable selection and coefficient shrinkage… Subset Selection - Linear Model • Best subset selection: Finds the subset of size k that gives the smallest RSS – Leaps and bounds procedure (Furnival and Wilson, 1974) – sort of an exhaustive search, all possible combinations! • Let’s take a look again at prostate cancer exp: • Correlation b/w the level of prostate specific antigen (PSA) and clinical predictors • We use log(PSA) - lpsa as the response variable • Denote: lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + pgg45 Prostate Cancer Data All the Subset Models for PC Example Subset Selection - Linear Model • Exhaustive search (best subset selection) becomes infeasible for large p • Forward stepwise selection: start from intercept and sequentially add variables ~ ˆ • For k inputs , for 1+k inputs ~ ˆ RSS () RSS () F ~ RSS () /( N k 2) • Typically, add predictors producing largest value of F, stopping when cannot produce F-ratio greater than 90-95th percentile of the F1,N-k-2 distribution Subset Selection - Linear Model • Backward stepwise selection: start with a full model and sequentially delete predictors • Again, typically uses the F-ratio as stopping criteria • Drops predictors producing the smallest F at each stage, stopping when each of the predictors in the model produces a value of F greater than 90-95th percentile when dropped • There are also hybrid strategies that simultaneously consider both forward and backward selection • Note: this is only a local search, because we are not considering all the combinations, just the sequential combination of variables K-fold Cross Validation • Primary method for estimating a tuning parameter (such as subset size) • Divide data into K roughly equal parts (typically K=5 or 10) • For each k=1, 2,. . .K, fit the model with to the other K−1parts, giving ˆ k () and compute its error in predicting the k-th part: • This gives the cross-validation error • Do this for many ’s, choose that makes CV() smallest. K-fold Cross Validation Notations • In our variable subsets example, is the subset size • ˆ k () are the coefficients for the best subset of size , found from the training set that leaves out the kth part of the data • Ek() is the estimated test error for this best subset. • From the K cross-validation training sets, the K test error estimates are averaged to give • Note that different subsets of size will (probably) be found from each of the K cross-validation training sets. Doesn’t matter: focus is on subset size, not the actual subset. The Bootstrap Approach • Bootstrap works by sampling N times with replacement from training set to form a “bootstrap” data set. Then model is estimated on bootstrap data set, and predictions are made for original training set • This process is repeated many times and the results are averaged • Bootstrap is most useful for estimating standard errors of predictions • Can also use modified versions of the bootstrap to estimate prediction error • Sometimes produces better estimates than cross-validation (topic for current research) Shrinkage Methods – Ridge Reg. • Because subset selection is a discrete process, it often produces a high variance model • Shrinkage methods are more continuous • Ridge regression: Ridge coefficient minimize a penalized RSS: • Equivalently • The parameter >0 penalizes j proportional to its size j2. Solution is • where I is the identity matrix. This is a biased estimator that for some value of >0 may have smaller mean squared error than the least squares estimator. Note =0 gives the least squares estimator; if then 0 Prostate Cancer Example (Cont’d) Shrinkage Methods – The Lasso • The lasso is a shrinkage method like ridge, but acts in a nonlinear manner on the outcome y • The lasso is defined by: • Notice that ridge penalty j2 is replaced by |j| • This makes the solution nonlinear in y, and a quadratic programming algorithm is used to compute them • Because of the nature of the constraint, if t is chosen small enough then the lasso will set some coefficients to zero. Thus lasso does a kind of continuous model selection Shrinkage Methods • The parameter t should be adaptively chosen to minimize an estimate of expected, using say cross-validation • Ridge vs Lasso: if inputs are orthogonal, ridge multiplies least squares coefficients by a constant < 1, lasso translates them towards zero by a constant, truncating at zero Prostate Cancer Example (Cont’d) Lagrange Multiplier -- Concept • Solving an optimization problem – finding a min or max, e.g., min f(P) • • • • • • • Not closed form, subject to constraints (e.g. g(P)=0) If no constraints, you would set grad(f(P))=0 With constraints, define F(P,)=f(P)- g(P) Now, set the gradient of F to 0, grad(F(P, ))=0 One more dim, partial derivative w.r.t =0 g(P)=0 Thus, grad(F)=0 automatically satisfies our constraint Add a new Lagrange multiplier() for each constraint Lagrange Multiplier – Geometrical Example • • • • In 2D, assume, OF: min f(x,y) Constraint: g(x,y)-c=0 Traverse along g(x,y)-c=0 May intersect with f(x,y)=dn at many points • Only when g=c, and f(x,y) touches tangentially, but does not cross g, we will find min • For more info, see Lagrange Multipliers without Permanent Scarring (by Dan Klein) From wikipedia: Drawn in green is the locus of points satisfying the constraint g(x,y) = c. Drawn in blue are contours of f. Arrows represent the gradient, which points in a direction normal to the contour. Principle Component Analysis (PCA) Intro – eigenvlues, eigenvectors • A square matrix can be interpreted as a transformation – translation, rotation, stretching, etc. • Eigenvectors of transformations are vectors that are either left intact or simply multiplied by a scalar factor after the transformation • An eigenvector's eigenvalue is the scale factor by which it has been multiplied • E.g., matrix A, eigenvector u, eigen value • By definition, Au = u – To find, set Au - u =0, det(A- I)=0, sys of linear eq. PCA Intro – Singular Value Decomposition (SVD) • The covariance matrix of a matrix XNp, pN, is a pp matrix (square), whose elements are covariance between column pairs of X • Singular value decomposition: matrix X can be written as X=UDVT • U and V are Np and pp orthogonal matrices: – Columns of U spanning the column space of X, UTU=INN – Columns of V spanning the row space, VTV=Ipp – D is a pp diagonal matrix, with diagonal entries d1d2… dp 0 called the singular values of X • The SVD always exists, and is unique up to signs PCA Intro – SVD, correlation matrix • SVD finds the eigenvalues and eigenvectors of XXT and XTX • The eigenvectors of XTX make up the columns of V , the eigenvectors of XXT make up the columns of U. • XTX= VDTUTUDVT=VD2VT, V is the eigenvectors of the covariance matrix • The singular values in D (diagonal entries of D) are square roots of eigenvalues from XTX and are arranged in descending order • The singular values are always real numbers. If the matrix X is a real matrix, then U and V are also real PCA and dimension reduction • The eigenvectors vj are the principle component directions of X • The first PC is z1=Xv1=u1d1 has the largest variance among all normalized directions, var(z1)=d12/N • Subsequent zj’s have max variance dj2/N subject to the constraint of being orthogonal to earlier ones • If we have a large number of correlated inputs, we can produce a small number of linear combinations Zm, m=1,…,M (M<P) of the original inputs Xj, then use Zm’s as regression inputs Methods Using Derived Input Directions • Principle Component Regression: PCA Regression • Let Dq be D, with all but the first q diagonal elements set to ˆ UD V T zero. Then X q q ~ ˆ min || X Xq || ˆ rank ( Xq ) q • Write q(j) for the ordered principle components, ordered from largest to smallest value of dj2 • Then principle component regression computes the derived input columns zj=Xq(j) and then regresses y on z1,z2,…,zJ for some Jp • Since the zj’s are orthogonal, this regression is just a sum of univariate regressions: J ˆ pcr ~ y y ˆ z j1 j j Where ˆ j is the univariate regression coefficient of y on zj PCA vs. Ridge Regression • Principal components regression is very similar to ridge regression: both operate on the principal components of the input matrix. • Ridge regression shrinks the coefficients of the principal components, with relatively more shrinkage applied to the smaller components than the larger; principal components regression discards the p − J + 1 smallest eigenvalue components. Partial Least Squares • This technique also constructs a set of linear combinations of the xj’s for regression, but unlike principal components regression, it uses y (in addition to X) for this construction. • We assume that y is centered and begin by computing the univariate regression coefficient ˆj of y on each xj • From this we construct the derived input z1 = ˆjxj, which is the first partial least squares direction. • The outcome y is regressed on z1, giving coefficient ˆ1, and then we orthogonalize y, x1, . . . xp with respect to z1: r1 = y − ˆ1z1, and * ˆ xl xl θl zl • We continue this process, until J directions are obtained. Partial Least Squares (Cont’d) • In this manner, partial least squares produces a sequence of derived inputs or directions z1, z2, . . . zJ . • As with principal components regression, if we continue on to construct J = p new directions, we get back the ordinary least squares estimates; use of J < p directions produces a reduced regression • Notice that in the construction of each zj , the inputs are weighted by the strength of their univariate effect on y. Example – Cancer Data (Cont’d) Ridge vs. PCA vs. PLS vs. Lasso • Recent study has shown that ridge and PCR outperform PLS in prediction, and they are simpler to understand. • Lasso outperforms ridge when there are a moderate number of sizable effects, rather than many small effects. It also produces more interpretable models. • These are still topics for ongoing research. PCA – Singular Value Decomposition (SVD) • The SVD of the Np matrix X has the form X=UDVT • U and V are Np and pp orthogonal matrices: – Columns of U spanning the column space of X – Columns of V spanning the row space – D is a pp diagonal matrix, with diagonal entries d1d2… dp 0 called the singular values of X • The SVD always exists, and is unique up to signs • The columns of V are the principal components, and Zj = Ujdj