Applied Linear Regression CSTAT Workshop March 16, 2007 Vince Melfi References • “Applied Linear Regression,” Third Edition by Sanford Weisberg. • “Linear Models with R,” by Julian Faraway. • Countless other books on Linear Regression, statistical software, etc. Statistical Packages • • • • • • • Minitab (we’ll use this today) SPSS SAS R Splus JMP ETC!! Outline I. II. III. IV. V. Simple linear regression review Multiple Regression: Adding predictors Inference in Regression Regression Diagnostics Model Selection Savings Rate Data Data on Savings Rate and other variables for 50 countries. Want to explore the effect of variables on savings rate. • SaveRate: Aggregate Personal Savings divided by disposable personal income. (Response variable.) • Pop>75: Percent of the population over 75 years old. (One of the predictors.) I. Simple Linear Regression Review 5 Scatterplot of SaveRate vs pop>75 20 SaveRate 15 10 5 0 0 1 2 3 4 5 pop>75 I. Simple Linear Regression Review 6 Regression Output The regression equation is SaveRate = 7.152 + 1.099 pop>75 Fitted model S = 4.29409 R-Sq = 10.0% R-Sq(adj) = 8.1% Analysis of Variance R2 (coeff. of determination) Source DF SS MS F P Regression 1 98.545 98.5454 5.34 0.025 Error 48 885.083 18.4392 Total 49 983.628 Testing the model I. Simple Linear Regression Review 7 Importance of Plots • Four data sets • All have – Regression line Y = 3 + 0.5 x – R2 = 66.7% – S = 1.24 – Same t statistics, etc., etc. • Without looking at plots, the four data sets would seem similar. Importance of Plots (1) Fitted Line Plot y1 = 3.000 + 0.5001 x1 11 S R-Sq R-Sq(adj) 10 1.23660 66.7% 62.9% 9 y1 8 7 6 5 4 5.0 7.5 10.0 12.5 15.0 x1 I. Simple Linear Regression Review 9 Importance of Plots (2) Fitted Line Plot y2 = 3.001 + 0.5000 x1 S R-Sq R-Sq(adj) 10 9 1.23721 66.6% 62.9% 8 y2 7 6 5 4 3 5.0 7.5 10.0 12.5 15.0 x1 I. Simple Linear Regression Review 10 Importance of Plots (3) Fitted Line Plot y3 = 3.002 + 0.4997 x1 13 S R-Sq R-Sq(adj) 12 1.23631 66.6% 62.9% 11 y3 10 9 8 7 6 5 4 5.0 7.5 10.0 12.5 15.0 x1 I. Simple Linear Regression Review 11 Importance of Plots (4) Fitted Line Plot y4 = 3.002 + 0.4999 x2 13 S R-Sq R-Sq(adj) 12 1.23570 66.7% 63.0% 11 y4 10 9 8 7 6 5 8 I. Simple Linear Regression Review 10 12 14 x2 16 18 20 12 The model • Yi = β0 + β1xi + ei, for i = 1, 2, …, n • “Errors” e1, e2, …, en are assumed to be independent. • Usually e1, e2, …, en are assumed to have the same standard deviation, σ. • Often e1, e2, …, en are assumed to be normally distributed. I. Simple Linear Regression Review 13 Least Squares • The regression line (line of best fit) is based on “least squares.” • The regression line is the line that minimizes the sum of the squared deviations from the data. • The least squares line has certain optimality properties. • The least squares line is denoted Yˆi ˆ ˆ X i ei I. Simple Linear Regression Review 14 9 8 7 6 5 4 eˆi Yi Yˆi Y • The residuals represent the difference between the data and the least squares line: 10 Residuals 1 2 3 4 5 6 7 X I. Simple Linear Regression Review 15 Checking assumptions • Residuals are the main tool for checking model assumptions, including linearity and constant variance. • Plotting the residuals versus the fitted values is always a good idea, to check linearity and constant variance. • Histograms and Q-Q plots (normal probability plots) of residuals can help to check the normality assumption. I. Simple Linear Regression Review 16 I. Simple Linear Regression Review 17 I. Simple Linear Regression Review 18 I. Simple Linear Regression Review 19 I. Simple Linear Regression Review 20 “Four in one” plot from Minitab Residual Plots for SaveRate Normal Probability Plot Versus Fits 99 10 Residual Percent 90 50 5 0 10 -5 1 -10 -10 -5 0 Residual 5 10 8 9 Histogram 10 11 Fitted Value 12 Versus Order 16 12 Residual Frequency 10 8 4 0 5 0 -5 -10 -5 I. Simple Linear Regression Review 0 5 Residual 10 -10 1 5 10 15 20 25 30 35 Observation Order 40 45 50 21 Coefficient of determination (R2) Residual sum of squares, aka sum of squares for error: Total sum of squares: Coefficient of determination: I. Simple Linear Regression Review RSS SSE i 1 eˆi n 2 SST TSS i 1 ( yi y ) 2 n TSS RSS R TSS 2 22 R2 • The coefficient of determination, R2, measures the proportion of the variability in Y that is explained by the linear relationship with X. • It’s also the square of the Pearson correlation coefficient I. Simple linear regression review 23 Adding a predictor • Recall: Fitted model was SaveRate = 7.152 + 1.099 pop>75 (p-value for test of whether pop>75 is significant was 0.025.) • Another predictor: DPI (per-capita income) • Fitted model: SaveRate = 8.57 + 0.000996 DPI (p-value for DPI: 0.124) II. Multiple regression: Adding predictors 24 Adding a predictor (2) • Model with both pop>75 and DPI is SaveRate = 7.06 + 1.30 pop>75 - 0.00034 DPI • p-values are 0.100 and 0.738 for pop>75 and DPI • The sign of the coefficient of DPI has changed! • pop>75 was significant alone, but neither it nor DPI are significant together! II. Multiple regression: Adding predictors 25 Adding a predictor (3) •What happened?? S R-Sq R-Sq(adj) 5 0.804599 61.9% 61.1% 4 pop>75 •The predictors pop>75 and DPI are highly correlated Fitted Line Plot pop>75 = 1.158 + 0.001025 DPI 3 2 1 0 0 II. Multiple regression: Adding predictors 1000 2000 DPI 3000 4000 26 Added variable plots and partial correlation 1. Residuals from a fit of SaveRate versus pop>75 give the variability in SaveRate that’s not explained by pop>75. 2. Residuals from a fit of DPI versus pop>75 give the variability in DPI that’s not explained by pop>75. 3. A fit of the residuals from (1) versus the residuals from (2) gives the relationship between SaveRate and DPI after adjusting for pop>75. This is called an “added variable plot.” 4. The correlation between the residuals from (1) and the residuals from (2) is the “partial correlation” between SaveRate and DPI adjusted for pop>75. II. Multiple regression: Adding predictors 27 Added variable plot Note that the slope term, II. Multiple regression: Adding predictors 15 S R-Sq R-Sq(adj) 4.28891 0.2% 0.0% 10 RESSRvspop>75 -0.000341, is the same as the slope term for DPI in the two-predictor model Fitted Line Plot RESSRvspop>75 = 0.0000 - 0.000341 RESDPIvspop>75 5 0 -5 -10 -1000 -500 0 500 1000 RESDPIvspop>75 1500 2000 2500 28 Scatterplot matrices (Matrix Plots) • With one predictor X, a scatterplot of Y vs. X is very informative. • With more than one predictor, scatterplots of Y vs. each of the predictors, and of each of the predictors vs. each other, is needed. • A scatterplot matrix (or matrix plot) is just an organized display of the plots II. Multiple regression: Adding predictors 29 Matrix Plot of SaveRate, pop<15, ... vs SaveRate, pop<15, ... SaveRate 20 30 40 0 2000 4000 20 10 0 pop<15 40 30 pop>75 20 4 2 0 changeDPI DPI 4000 2000 0 16 8 0 0 10 SaveRate 20 II. Multiple regression: Adding predictors 0 pop<15 2 pop>75 4 0 DPI 8 changeDPI 16 30 Changes in R2 • Consider adding a predictor X2 to a model that already contains the predictor X1 • Let R2,1 be the R2 value for the fit of Y vs. X1, and let R2,2 be the R2 value for the fit of Y vs. X2 II. Multiple regression: Adding predictors 31 Changes in R2 (2) • The R2 value for the multiple regression fit is always larger than R2,1 and R2,2 • The R2 value for the multiple regression fit of Y versus X1 and X2 may be – less than R2,1 + R2,2 (if the two predictors are explaining the same variation) – equal to R2,1 + R2,2 (if the two predictors measure different things) – more than R2,1 + R2,2 (e.g. Response is area of rectangle, and the two predictors are length and width) II. Multiple regression: Adding predictors 32 Multiple regression model • Response variable Y • Predictors X1, X2, …, Xp Yi X 1i X 2i ... p X pi ei •Same assumptions on errors ei (independent, constant variance, normality) II. Multiple regression: Adding predictors 33 Inference in regression • Most inference procedures assume independence, constant variance, and normality of the errors. • Most are “robust” to departures from normality, meaning that the p-values, confidence levels, etc. are approximately correct even if normality does not hold. • In general, techniques like the bootstrap can be used when normality is suspect. III. Inference in regression 34 New data set • Response variable: – Fuel = per-capita fuel consumption (times 1000) • Predictors: – Dlic = proportion of the population who are licensed drivers (times 1000) – Tax = gasoline tax rate – Income = per person income in thousands of dollars – logMiles = base 2 log of federal-aid highway miles in the state III. Inference in regression 35 t tests • Regression Analysis: Fuel versus Tax, Dlic, Income, logMiles • The regression equation is • Fuel = 154 - 4.23 Tax + 0.472 Dlic - 6.14 Income + 18.5 logMiles • • • • • • Predictor Constant Tax Dlic Income logMiles Coef 154.2 -4.228 0.4719 -6.135 18.545 t statistics III. Inference in regression SE Coef 194.9 2.030 0.1285 2.194 6.472 T 0.79 -2.08 3.67 -2.80 2.87 P 0.433 0.043 0.001 0.008 0.006 p values 36 t tests (2) • The t statistics tests the hypothesis that a particular slope parameter is zero. • The formula is t = (coefficient estimate)/(standard error) • degrees of freedom are n-(p+1) • p-values given are for the two-sided alternative • This is like simple linear regression III. Inference in regression 37 F tests • General structure: – Ha: Large model – H0: Smaller model, obtained by setting some parameters in the large model to zero, or equal to each other, or equal to a constant – RSSAH = resid. sum of squares after fitting the large (alt. hypothesis) model – RSSNH = resid. sum of squares after fitting the smaller (null hypothesis) model – dfNH and dfAH are the corresponding degrees of freedom III. Inference in regression 38 F tests (2) • Test statistic: ( RSS NH RSS AH ) F RSS AH (df NH df AH ) df AH •Null distribution: F distribution with dfNH – dfAH numerator and dfAH denominator degrees of freedom III. Inference in regression 39 F test example • Can the “economic” variables tax and income be dropped from the model with all four predictors? • AH model includes all predictors • NH model includes only Dlic and logMiles • Fit both models and get RSS and df values III. Inference in regression 40 F test example (2) • RSSAH = 193700; dfAH = 46 • RSSNH = 243006; dfNH = 48 (243006 193700) /( 48 46) F 5.85 193700 / 46 •P-value is the area to the right of 5.85 under a F(2,46) distribution, approx. 0.0054 •There’s pretty strong evidence that removing both Tax and Income is unwise III. Inference in regression 41 Another F test example • Question: Does it make sense that the two “economic” predictors should have the same coefficient? • Ha: Y = β0 + β1Tax + β2 Dlic+ β3 Income + β4 logMiles + error • H0: Y = β0 + β1Tax + β2 Dlic+ β1 Income + β4 logMiles + error • Note: H0: Y = β0 + β1 (Tax + Income)+ β2 Dlic + β4 logMiles + error III. Inference in regression 42 Another F test example (2) • Fit full model (AH) • Create new predictor “TI” by adding Tax and Income, and fit a model with TI and Dlic and logMiles (NH) (195487 193700) /( 47 46) F 0.424 193700 / 46 •P-value is the area to the right of 5.85 under a F(1,46) distribution, approx. 0.518 •This suggests that the simpler model with the same III. Inference in coefficient for Tax and Income fits well. 43 regression Removing one predictor • We have two ways to test whether one predictor can be removed from the model: – t test – F test • The tests are equivalent, in the sense that t2 = F, and that the p-values will be equivalent. III. Inference in regression 44 Confidence regions • Confidence intervals for one parameter use the familiar t-interval. • For example, to form a 95% confidence interval for the parameter of Income in the context of the full (four predictor) model: • -6.135 ± (2.013)(2.194) = -6.135 ± 4.417. From t distribution with 46 df From Minitab output III. Inference in regression 45 Joint confidence regions • Joint confidence regions for two or more parameters are more complex, and use the F distribution in place of the t distribution. • Minitab (and SPSS, and …) can’t draw these easily • On the next page is a joint confidence region for the parameters of Dlic and Tax, drawn in R. III. Inference in regression 46 1.0 0.2 0.4 Dlic 0.6 0.8 Boundary of confidence region 0.0 (0,0) -8 -6 -4 -2 0 Tax Joint confidence region for Dlic and Tax, with dotted lines indicating individual confidence intervals for the two. III. Inference in regression 47 Prediction • Given a new set of predictor values x1, x2, …, xp, what’s the predicted response? • It’s easy to answer this: Just plug the new predictors into the fitted regression model: Yˆ ˆ ˆ x1 ˆ x2 ... ˆ p x p •But how do we assess the uncertainty in the prediction? How do we form a confidence interval? III. Inference in regression 48 Predicted Values for New Observations New Obs Fit SE Fit 1 613.39 12.44 95% CI 95% PI (588.34, 638.44) (480.39, 746.39) Confidence interval for the average fuel consumption for states with Dlic = 900, Income = 28, logMiles=15, and Tax = 17 Prediction interval for the fuel consumption for a state with Dlic=900, Income = 28, logMiles=15, and Tax = 17 Values of Predictors for New Observations New Obs Dlic Income logMiles Tax 1 900 28.0 15.0 17.0 III. Inference in regression 49 Diagnostics • Want to look for points that have a large influence on the fitted model • Want to look for evidence that one or more model assumptions are untrue. • Tools: – Residuals – Leverage – Influence and Cook’s Distance IV. Regression Diagnostics 50 Leverage • A point whose predictor values are far from the “typical” predictor values has high leverage. • For a high leverage point, the fitted value Yˆi will be close to the data value Yi. • A rule of thumb: Any point with leverage larger than 2(p+1)/n is interesting. • Most statistical packages can compute leverages. IV. Regression Diagnostics 51 Scatterplot with leverages 13 0.236364 12 11 10 0.318182 y3 9 8 7 6 0.318182 0.236364 0.172727 0.127273 0.100000 0.090909 0.100000 0.127273 0.172727 5 4 5.0 7.5 10.0 12.5 15.0 x1 IV. Regression Diagnostics 52 Scatterplot of Leverage vs Index 0.6 Liby a 0.5 Leverage 0.4 UnitedStates 0.3 Japan Ireland 0.2 0.1 0.2 SouthRhodesia Canada Jamaica France Sweden A ustria UnitedKingdom Greece Uruguay Netherlands Portugal Boliv ia China Finland Belgium Germany Luxembourg Venezuela Malta Spain CostaRica Tunisia Switzerland Iceland India Brazil Paraguay Australia Ecuador Guatamala Philippines Peru SouthAfrica ZambiaMalay sia Denmark Honduras ItalyKorea NewZealand Colombia Nicaragua Norway Turk ey Chile Panama 0.0 0 10 20 30 40 50 Index IV. Regression Diagnostics 53 Influential Observations • A data point is influential if it has a large effect on the fitted model. • Put another way, an observation is influential if the fitted model will change a lot if the observation is deleted. • Cook’s Distance is a measure of the influence of an observation. • It may make sense to refit the model after removing a few of the most influential observations. IV. Regression Diagnostics 54 Scatterplot with Cook's Distance (measure of influence) 13 1.39285 12 11 High leverage, low influence High Influence 10 0.30057 y3 9 8 7 6 0.03382 0.00695 0.00052 0.00035 0.00214 0.00547 0.01176 0.02598 0.05954 5 4 5.0 7.5 10.0 12.5 15.0 x1 IV. Regression Diagnostics 55 Scatterplot of Cook's Distance vs Index 0.30 Liby a Cook's Distance 0.25 0.20 Japan 0.15 Zambia 0.10 0.05 0.00 Ireland Iceland Korea Philippines Peru Paraguay Sweden Chile C ostaRica Denmark Jamaica Venezuela France Greece Brazil UnitedKingdom UnitedStates Guatamala Malta Tunisia Uruguay Malay sia China Ecuador Switzerland Belgium Panama SouthRhodesia Finland Luxembourg NewZealand A ustralia Austria Boliv ia Colombia Honduras Norway SpainTurk ey Canada Germany IndiaItaly Netherlands Nicaragua Portugal SouthAfrica 0 10 20 30 40 50 Index IV. Regression Diagnostics 56 Model Selection • Question: With a large number of potential predictors, how do we choose the predictors to include in the model? • Want good prediction, but parsimony: Occam’s Razor. • Also can be thought of as a bias-variance tradeoff. V. Model Selection 57 Model Selection Example • Data on all 50 states, from the 1970s – Life.Exp = Life expectancy (response) – Population (in thousands) – Income = per-capita income – Illiteracy (in percent of population) – Murder = murder rate per 100,000 – HS.Grad (in percent of population) – Frost = mean # days with min. temp < 32F – Area = land area in square miles V. Model Selection 58 Forward Selection • Choose a cutoff α • Start with no predictors • At each step, add the predictor with the lowest p-value less than α • Continue until there are no unused predictors with p-values less than α V. Model Selection 59 • Stepwise Regression: Life.Exp versus Population, Income, ... • Forward selection. • Response is Life.Exp on 7 predictors, with N = 50 • • Step Constant 1 72.97 2 70.30 3 71.04 4 71.03 • • • Murder T-Value P-Value -0.284 -8.66 0.000 -0.237 -6.72 0.000 -0.283 -7.71 0.000 -0.300 -8.20 0.000 • • • HS.Grad T-Value P-Value 0.044 2.72 0.009 0.050 3.29 0.002 0.047 3.14 0.003 • • • Frost T-Value P-Value -0.0069 -2.82 0.007 -0.0059 -2.46 0.018 • • • Population T-Value P-Value • S • R-Sq • R-Sq(adj) • Mallows Cp V. Model Selection Alpha-to-Enter: 0.25 0.00005 2.00 0.052 0.847 60.97 60.16 16.1 0.796 66.28 64.85 9.7 0.743 71.27 69.39 3.7 0.720 73.60 71.26 2.0 60 Variations on FS • Backward elimination – Choose cutoff α – Start with all predictors in the model – Eliminate the predictor with the highest pvalue that is greater than α – ETC • Stepwise: Allow addition or elimination at each step (hybrid of FS and BE) V. Model Selection 61 All subsets • Fit all possible models. • Based on a “goodness” criterion, choose the model that fits best. • Goodness criteria include AIC, BIC, Adjusted R2, Mallow’s Cp • Some of the criteria will be described next V. Model Selection 62 Notation • RSS* = Resid. Sum of Squares for the current model • p* = Number of terms (including intercept) in the current model • n = number of observations • s2 = RSS/(n-(p+1)) = Estimate of σ2 from model with all predictors and intercept term. V. Model Selection 63 Goodness criteria • Smaller is better for AIC, BIC, Cp*. Larger is better for adjR2 • AIC = n log(RSS*/n) + 2p* • BIC = n log(RSS*/n) + p* log(n) • Cp* = RSS*/s2 + 2p* - n • adjR2 n 1 = 1 (1 R 2 ) n ( p 1) V. Model Selection 64 • Best Subsets Regression: Life.Exp versus Population, Income, ... • Response is Life.Exp • • • • • • • • • • Vars • 1 • 2 • 3 • 4 • 5 • 6 • 7 R-Sq 61.0 66.3 71.3 73.6 73.6 73.6 73.6 V. Model Selection R-Sq(adj) 60.2 64.8 69.4 71.3 70.6 69.9 69.2 Mallows Cp 16.1 9.7 3.7 2.0 4.0 6.0 8.0 S 0.84732 0.79587 0.74267 0.71969 0.72773 0.73608 0.74478 P o p u l a t i o n X X X X I n c o m e I l l i t e r a c y M u r d e r X X X X X X X X X X X X H S . G r a d F r o s t X X X X X X X X X X X X A r e a 65 Model selection can overstate significance • Generate Y and X1, X2, …, X50 • All are independent and standard normal. • So none of the predictors are related to the response. • Fit the full model and look at the overall F test. • Use model selection to choose a “good” smaller model, and look at its overall F test V. Model Selection 66 The full model • Results from fitting model with all 50 predictors • Note that the F test is not significant • S = 0.915237 R-Sq = 57.6% R-Sq(adj) = 14.3% • Analysis of Variance • • • • Source Regression Residual Error Total V. Model Selection DF 50 49 99 SS 55.7093 41.0453 96.7546 MS 1.1142 0.8377 F 1.33 P 0.160 67 The “good” small model • Run FS with α = 0.05 • Predictors x38, x41, and x24 are chosen. • Fit that three predictor model. Now the F test is highly significant • Analysis of Variance • • • • Source Regression Residual Error Total V. Model Selection DF 3 96 99 SS 20.9038 75.8508 96.7546 MS 6.9679 0.7901 F 8.82 P 0.000 68 What’s left? • Weighted least squares • Tests for lack of fit • Transformations of response and predictors • Analysis of Covariance • Etc.