Multiple Linear Regression University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 / 42 Passenger car mileage Consider the carmpg dataset taken from here. Variables: make.model – Make and model vol – Cubic feet of cab space hp – Engine horsepower mpg – Average miles per gallon sp – Top speed (mph) wt – Vehicle weight (100 lb) Goal: Predict a car’s gas consumption based on its characteristics. Graphics: pairwise scatterplots and possibly individual boxplots. To [R]. 2 / 42 Scatterplot highlights hp, sp and wt seem correlated. vol and wt seem correlated. Highly correlated predictors can lead to a misinterpretation of the fit. mpg seems polynomial in hp and sp. mpg versus wt shows a fan shape indicating unequal variances. 3 / 42 Least squares regression We fit a (simple) linear model: mpg = β0 + β1 vol + β2 hp + β3 sp + β4 wt With data {(xi , yi) : i = 1, . . . , n}, where x = (xi,1 , . . . , xi,p ), we fit: yi = β0 + β1 xi,1 + · · · + βp xi,p + ǫi Standard assumptions: measurement errors are i.i.d. normal with mean zero, i.e. ǫ1 , . . . , ǫn ∼iid N (0, σ 2) The least squares fit minimizes the error sum of squares SSE(β0 , β1 , . . . , βp ) = n X i=1 (yi − β0 − β1 xi,1 − · · · − βp xi,p )2 Under the standard assumption, this fit corresponds to the MLE. 4 / 42 Fitted values, residuals and standard error The fitted (predicted) values are defined as: ybi = βb0 + βb1 xi,1 + · · · + βbp xi,p The residuals are defined as: ei = yi − ybi Our estimate for σ 2 is the mean squared error of the fit n n X X 1 1 2 (yi − ybi) = e2 σ b = n − p − 1 i=1 n − p − 1 i=1 i 2 5 / 42 Matrix interpretation Let X be the n × (p + 1) matrix with row vectors {(1, xi) : i = 1, . . . , n}. Define y = (y1 , . . . , yn ), ǫ = (ǫ1 , . . . , ǫn ) and β = (β0 , β1 , . . . , βp ). The model is: y = Xβ + ǫ Then the least squares estimate βb minimizes ky − Xβk2 If the columns of X are linearly independent, then βb = (XT X)−1 XT y Note that the estimate is linear in the response. Define the hat matrix H = X(XT X)−1 XT H is the orthogonal projection onto span(X). The fitted values may be expressed as The residuals may be expressed as b = X(XT X)−1 XT y = Hy y b = (I − H)y e= y−y 6 / 42 Distributions Assume that the ǫi ’s are i.i.d. normal with mean zero. Define M = (XT X)−1 ∈ R(p+1)×(p+1) . βb = (βb0 , . . . , βbp ) has the multivariate normal distribution with mean β (unbiased) and covariance matrix σ 2 M. I.e. the βbj ’s are jointly normal and For σ b2 , we have βb and σ b2 are independent. E βbj = βj , σ b2 ∼ cov βbj , βbk = σ 2 Mjk σ2 χ2 n − p − 1 n−p−1 7 / 42 t-Tests Consequently, βb − βj pj ∼ tn−p−1 σ b2 Mjj In general, if c = (c0 , c1 , . . . , cp )T ∈ Rp+1 , then √ b where se b c β =σ b cT Mc T cT βb − cT β ∼ tn−p−1 se b cT βb In particular, the following is a level-(1 − α) confidence interval for β0 + β1 x1 + · · · + βp xp = β T x, where x = (1, x1 , . . . , xp ): α/2 βbT x ± tn−p−1 se b βbT x To [R]. 8 / 42 Confidence bands / regions Using the fact that (βb − β)T XT X(βb − β) ∼ Fp+1,n−p−1 (p + 1)b σ2 the following defines a level-α confidence region (an ellipsoid) for β: α (βb − β)T XT X(βb − β) ≤ (p + 1)b σ 2 Fp+1,n−p−1 Using the same fact, Scheffe’s S-method gives that α cT β ± ((p + 1)Fp+1,n−p−1 )1/2 se b cT βb is a level-α confidence interval for cT β simultaneously for all c ∈ Rp+1 . As a special case, we obtain the following confidence band T α 1/2 T b b β x ± ((p + 1)Fp+1,n−p−1) se b β x 9 / 42 Comparing models Say we want to test H0 : H1 : Same as H0 : H1 : β1 = · · · = βp = 0 βj 6= 0, for some j = 1, . . . , p yi = β0 + ǫi yi = β0 + β1 xi,1 + · · · + βp xi,p + ǫi The model under H0 is a submodel of the model under H1 . 10 / 42 Analysis of variance The residual sum of squares under H0 is SSY = n X i=1 (yi − y)2 It has n − 1 degrees of freedom. The residual sum of squares under H1 is n X SSE = (yi − ybi )2 i=1 It has n − p − 1 degrees of freedom. The sum of squares due to regression is SSreg = SSY − SSE = n X i=1 It has p degrees of freedom. The ANOVA F -test rejects for large: SSreg/p SSE/(n − p − 1) F = (y − ybi)2 Under H0 , F ∼ Fp,n−p−1 11 / 42 Coefficient of (multiple) determination Defined as R2 = 1 − SSE SSY Interpreted as the fraction of variation in y explained by x. R2 always increases when variables are added. The adjusted R2 incorporates the number of predictors Ra2 = 1 − SSE/(n − p − 1) SSY /(n − 1) 12 / 42 Testing whether a subset of the variables are zero Consider a subset of variables J ⊂ {1, . . . , p}. We want to test H0 : ∀j ∈ J βj = 0 H1 : ∃j ∈ J βj = 6 0 Same as H0 : yi = β0 + X βj xi,j + ǫi X βj xi,j + ǫi j ∈J / p H1 : yi = β0 + j=1 The model under H0 is a submodel of the model under H1 . 13 / 42 Analysis of Variance Let SSEJ the rss for the model under H0 . Let SSE be the rss for the model under H1 . The ANOVA F -test rejects for large: F = (SSEJ − SSE)/|J| SSE/(n − p − 1) Under the null, F ∼ F|J|,n−p−1 In particular, testing βj = 0 versus βj 6= 0 using the F -test above is equivalent to using the t-test described earlier. 14 / 42 Testing whether a given set of linear combinations are zero Consider a (full-rank) matrix q-by-(p + 1) matrix A. We want to test H0 : Aβ = 0 H1 : Aβ 6= 0 Assuming A = (A1 , A2 ) with A2 q-by-q invertible, we are comparing the two models H0 : H1 : y = (X1 − X2 A−1 2 A1 )β1 + ǫ y = Xβ + ǫ The model under H0 is a submodel of the model under H1 . 15 / 42 Analysis of Variance Let SSEA the rss for the model under H0 . Let SSE be the rss for the model under H1 . The ANOVA F -test rejects for large: F = (SSEA − SSE)/rank(A) SSE/(n − p − 1) Under the null, F ∼ Frank(A),n−p−1 16 / 42 Spreading Spreading the variables often equalizes the leverage of the data points and improves the fit, both numerically (e.g. R2 ) and visually. A typical situation is an accumulation of data points near 0, in which case applying a logarithm or a square root helps spread the data more evenly. Example: mammals dataset. 17 / 42 Standardized variables Standardizing the variables removes the unit and makes comparing the magnitude of the coefficients easier. One way to do so is to make all the variables mean 0 and unit length: y−y y←√ SSY and xj − xj xj ← p SSXj An intercept is not needed anymore. Standardizing the variables changes the estimates βbj , but not σ b2 , the multiple R2 or the p-value of the tests we saw. 18 / 42 Checking assumptions Can we rely on those p-values? In other words, are the assumptions correct? 1. Mean zero, i.e. model accuracy 2. Equal variances, i.e. homoscedasticity 3. Independence 4. Normally distributed 19 / 42 Residuals We look at the residuals as substitutes for the errors. Remember that e = (e1 , . . . , en ), with b = y − Xβb = (I − H)y e=y−y Under the standard assumptions, e is multivariate normal with mean zero and covariance σ 2 (I − H)). In particular, cov (ei , ej ) = −σ 2 hij ∀i 6= j, var (ei ) = σ 2 (1 − hii ) ∀i 20 / 42 Standardized residuals In practice, the residuals are often corrected for unequal variance. Internally studentized residuals (what R uses): Note that: ri2 ∼ Beta n−p−1 ei ri = √ σ b 1 − hii 1 n−p−2 , 2 2 Externally studentized residuals: ti = , so that ri ∼ ˙ tn−p−1 e √i σ b(i) 1 − hii where σ b(i) comes from the fit with the ith observation removed. Note that: ti ∼ tn−p−2 21 / 42 Checking model accuracy With several predictors, the residuals-vs-fitted-values plot may be misleading. Instead, we look at partial residual plots, which focus on one variable at a time. Say we want to look at the influence of predictor Xj = (x1,j , . . . , xn,j ) (the jth column of X) on the response y. Added variable plots take into account the influence of the other predictors. 1. Regress y on all the predictors excluding Xj , obtaining residuals e(j) . 2. Regress Xj on all the predictors excluding Xj , obtaining residuals X(j) . 3. Plot e(j) versus X(j) . Component plus residual plots are another alternative. 1. Plot e + βbj Xj versus Xj . 22 / 42 Checking homoscedasticity Partial residual plots may help assess whether the errors have equal variance or not. A fan-shape in the plot is an indication that the errors do not have equal variances. 23 / 42 Checking normality A q-q plot of the standardized residuals is used. 24 / 42 Checking for independence Checking for independence often requires defining a dependency structure explicitly and testing against that. (For otherwise we would need a number of replicates of the data, i.e. multiple samples of same size.) If dependency is expected, then it is preferable to fit some time series model (for the errors). In that context, we could use something like the Durbin-Watson test to assess whether the errors are autocorrelated. 25 / 42 Outliers Outliers are points that are unusual compared to the bulk of the observations. An outlier in predictor is a data point (xi , yi ) such that xi is away from the bulk of the predictor vectors. Such a point has high-leverage An outlier in response is a data point (xi , yi) such that yi is away from the trend implied by the other observations. A point that is both an outlier in predictor and response is said to be influential. 26 / 42 Detecting outliers To detect outliers in predictor, plotting the hat values hii may be useful. (hii > 2p/n is considered suspect.) To detect outliers in response, plotting the residuals may be useful. 27 / 42 Leave-one-out diagnostics If βb(i) is the least squares coefficient vector computed with the ith case deleted, then (XT X)−1 xi ei βb − βb(i) = 1 − hii Define b(i) = Xβb(i) y In particular, yb(i)i is the value at xi predicted by the model fitted without the observation (xi , yi ). 28 / 42 Cook’s distances Cook’s distances Di b(i) k2 kb y−y (change in fitted values) = (p + 1)b σ2 (βb − βb(i) )T XT X(βb − βb(i) ) = (change in coefficients) (p + 1)b σ2 ri2 hii = (combination of residuals and hat values) (p + 1)(1 − hii ) (.05) Di > Fp+1,n−p−1 is considered suspect. To [R]. 29 / 42 DFBETAS and DFFITS DFBETAS DFBETASj(i) = q √ DFBETASj(i) > 2/ n is considered suspect. DFFITS DFFITSi > 2 p βbj − βbj(i) 2 σ b(i) (XT X)−1 j,j ybi − ybi(i) DFFITSi = q = ti 2 σ b(i) hi,i r hi 1 − hi p/n is considered suspect. 30 / 42 Multicollinearity When the predictors are nearly linearly dependent, several issues arise: 1. Interpretation is difficult. 2. The estimates have large variances: var βbj = σ2 (1 − Q2j )SSXj where Q2j is the R2 when regressing Xj on Xk , k 6= j. 3. The fit is numerically unstable since XT X is almost singular. 31 / 42 Correlation between predictors To detect multicollinearity, we look at large correlation between predictors. For that we compute the correlation matrix of the predictors, with coefficients cor(Xj , Xk ). To [R]. 32 / 42 Variance Inflation Factors Alternatively, one can look at the variance inflation factors: VIFj = 1 1 − Q2j If the predictors are standardized to unit norm, XT X is a correlation matrix and we have: VIF = diag((XT X)−1 ) VIFj > 10 (same as Q2j > 90%) is considered suspect. To [R]. 33 / 42 Condition Indices Another option is to examine the condition indices of XT X: κj = λmax λj where the λj ’s are the eigenvalues of XT X. κj > 1000 is considered suspect. λmax /λmin is called the condition number of matrix XT X. To [R]. 34 / 42 Dealing with questionable assumptions If some of the assumptions are doubtful, it is important to correct the model, for otherwise we cannot rely on our results (estimates + tests). 35 / 42 Dealing with curvature When the model accuracy is questionable, we can: 1. Transform the variables and/or the response, which often involves guessing. This is particularly useful to spread the variables. 2. Augment the model by adding other terms, perhaps with interactions. This is often used in conjunction with model selection so as to avoid overfitting. 36 / 42 Dealing with heteroscedasticity When homoscedasticity is questionable, we can: 1. Apply a variance stabilizing transformation to the response. 2. Fit the model by weighted least squares. 3. Both involve guessing at the dependency of the variance as a function of the variables. 37 / 42 Variance stabilizing transformations Here are a few common transformations on the response to stabilize the variance: σ2 σ2 σ2 σ2 σ2 √ ∝ E(yi ) (Poisson) change to y √ ∝ E(yi )(1 − E(yi )) (binomial) change to sin−1 ( y) ∝ E(yi )2 change to log(y) √ ∝ E(yi )3 change to 1/ y ∝ E(yi )4 change to 1/y In general, we want a transformation ψ such that ψ ′ (Eyi )var (yi ) ∝ 1 38 / 42 Weighted least squares Suppose yi = β0 + β T xi + ǫi , Maximum likelihood estimation of β0 , β corresponds to the weighted least squares solution that minimizes SSE(β0 , β) = n X i=1 ǫi ∼ N (0, σi2) wi (yi − β0 − β T xi )2 , wi = 1 σi2 To determine appropriate weights, different approaches are used: – Guess how σi2 varies with xi , i.e. the shape of var (yi |xi ). – Assume a parametric model for the weights and use maximum likelihood estimation – solved by iteratively reweighted least squares. Using weights is helpful in situations where some observations are more reliable than others, mostly because of variability. 39 / 42 Generalized least squares Weighted least squares assumes that the errors are uncorrelated. Generalized least squares assumes a more general form for the covariance of the errors, namely σ 2 V, where V is usually known. Assuming both the design matrix X and the covariance matrix V are full rank, the maximum likelihood estimates are βb = (XT V−1 X)−1 XT V−1 y 40 / 42 Dealing with outliers We have identified an outlier. Was it recorded properly? – No: correct it or remove it from the fit. – Yes: decide whether to include it in the fit. Genuine outliers carry information, so simply discarding them amounts to losing that information. However, strong outliers can compromise the quality of the overall fit. Possible option: to model outliers and non-outliers separately. 41 / 42 Dealing with Multicolinearity If interpretation is the main goal, drop a few redundant predictors. If prediction is the main goal, use a model selection procedure. 42 / 42