Chapter Multiple Regression 13 Multiple Regression Bivariate or Multivariate? Multiple Regression Assessing Overall Fit Predictor Significance Confidence Intervals for Y Binary Predictors Tests for Nonlinearity and Interaction Multicollinearity Violations of Assumptions Other Regression Topics McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies, Inc. • • 13-2 Multiple Regression Multiple Regression Regression Terminology 13-3 • Y is the response variable and is assumed to be related to the k predictors (X1, X2, … Xk) by a linear equation called the population regression model: • The fitted regression equation is: Multiple regression is an extension of bivariate regression to include more than one independent variable. Limitations of bivariate regression: - often simplistic - biased estimates if relevant predictors are omitted - lack of fit does not show that X is unrelated to Y Data Format • 13-4 n observed values of the response variable Y and its proposed predictors X1, X2, … Xk are presented in the form of an n x k matrix: Multiple Regression Multiple Regression Illustration: Home Prices • Illustration: Home Prices Consider the following data of the selling price of a home (Y, the response variable) and three potential explanatory variables: X1 = SqFt X2 = LotSize X3 = Baths 13-5 • Intuitively, the regression models are 13-6 Multiple Regression Multiple Regression Logic of Variable Selection Fitted Regressions • • State the hypotheses about the sign of the coefficients in the model. • • 13-7 13-8 Use Excel, MegaStat, MINITAB, or any other statistical package. For n = 30 home sales, here is the fitted regression and its statistics of fit. R2 is the coefficient of determination and SE is the standard error of the regression. Multiple Regression Multiple Regression Common Misconceptions about Fit Regression Modeling • • Four Criteria for Regression Assessment Logic Is there an a priori reason to expect a causal relationship between the predictors and the response variable? Fit Does the overall regression show a significant relationship between the predictors and the response variable? • A common mistake is to assume that the model with the best fit is preferred. Principle of Occam’s Razor: When two explanations are otherwise equivalent, we prefer the simpler, more parsimonious one. 13-9 13-10 Multiple Regression Regression Modeling Assessing Overall Fit F Test for Significance • Four Criteria for Regression Assessment Parsimony Does each predictor contribute significantly to the explanation? Are some predictors not worth the trouble? Stability Are the predictors related to one another so strongly that regression estimates become erratic? 13-11 • 13-12 For a regression with k predictors, the hypotheses to be tested are H0: All the true coefficients are zero H1: At least one of the coefficients is nonzero In other words, H0: β1 = β2 = … = β4 = 0 H1: At least one of the coefficients is nonzero Assessing Overall Fit F Test for Significance • Assessing Overall Fit F Test for Significance The ANOVA table decomposes variation of the response variable around its mean into 13-13 • 13-14 Assessing Overall Fit F Test for Significance • Assessing Overall Fit Coefficient of Determination (R2) • Here are the ANOVA calculations for the home price data • • 13-15 The ANOVA calculations for a k-predictor model can be summarized as 13-16 R2, the coefficient of determination, is a common measure of overall fit. It can be calculated one of two ways. For example, for the home price data, Assessing Overall Fit Adjusted R2 • • It is generally possible to raise the coefficient of determination R2 by including addition predictors. The adjusted coefficient of determination is done to penalize the inclusion of useless predictors. For n observations and k predictors, • For the home price data, the adjusted R2 is • 13-17 Assessing Overall Fit How Many Predictors? • • • 13-18 Predictor Significance F Test for Significance • • • 13-19 Limit the number of predictors based on the sample size. When n/k is small, the R2 no longer gives a reliable indication of fit. Suggested rules are: Evan’s Rule (conservative): n/k > 0 (at least 10 observations per predictor) Doane’s Rule (relaxed): n/k > 5 (at least 5 observations per predictor) Predictor Significance Test Statistic Test each fitted coefficient to see whether it is significantly different from zero. The hypothesis tests for predictor Xj are • The test statistic for coefficient of predictor Xj is • Find the critical value tα for a chosen level of significance α from Appendix D. Reject H0 if tj > tα or if p-value < α. • If we cannot reject the hypothesis that a coefficient is zero, then the corresponding predictor does not contribute to the prediction of Y. • 13-20 The 95% confidence interval for coefficient βj is Confidence Intervals for Y Confidence Intervals for Y Standard Error • • • Standard Error The standard error of the regression (SE) is another important measure of fit. For n observations and k predictors • Approximate prediction interval for individual Y value 13-22 Confidence Intervals for Y Binary Predictors What Is a Binary Predictor? Quick Prediction Interval for Y The t-values for 95% confidence are typically near 2 (as long as n is too small), and so … • • • An approximate 95% confidence interval for conditional mean of Y is: • • 13-23 Approximate confidence interval for conditional mean of Y. If all predictions were perfect, the SE = 0. 13-21 • • An approximate 95% prediction interval for individual Y value is: • 13-24 A binary predictor has two values (usually 0 and 1) to denote the presence or absence of a condition. For example, for n graduates from an MBA program: Employed = 1 Unemployed = 0 These variables are also called dummy or indicator variables. For easy understandability, name the binary variable the characteristic that is equivalent to the value of 1. Binary Predictors Binary Predictors Effects of a Binary Predictor Effects of a Binary Predictor • • • • A binary predictor is sometimes called a shift variable because it shifts the regression plane up or down. Suppose X1 is a binary predictor which can take on only the values of 0 or 1. Its contribution to the regression is either b1 or nothing, resulting in an intercept of either b0 (when X1 = 0) or b0 + b1 (when X1 = 1). 13-25 The slope does not change, only the intercept is shifted. For example, Figure 13.8 13-26 Binary Predictors Testing a Binary for Significance Binary Predictors More Than One Binary • • 13-27 In multiple regression, binary predictors require no special treatment. They are tested as any other predictor using a t test. • 13-28 More than one binary occurs when the number of categories to be coded exceeds two. For example, for the variable GPA by class level, each category is a binary variable: Freshman = 1 if a freshman, 0 otherwise Sophomore = 1 if a sophomore, 0 otherwise Junior = 1 if a junior, 0 otherwise Senior = 1 if a senior, 0 otherwise Masters = 1 if a master’s candidate, 0 otherwise Doctoral = 1 if a PhD candidate, 0 otherwise Binary Predictors More Than One Binary • Binary Predictors What if I Forget to Exclude One Binary? If there are c mutually exclusive and collectively exhaustive categories, then there are only c-1 binaries to code each observation. Any one of the categories can be omitted because the remaining c-1 binary values uniquely determine the remaining binary. • • • Table 13.6 13-29 13-30 Tests for Nonlinearity and Interaction Binary Predictors Regional Binaries • Tests for Nonlinearity Binaries are commonly used to code regions. For example, Midwest = 1 if in the Midwest, 0 otherwise Neast = 1 if in the Northeast, 0 otherwise Seast = 1 if in the Southeast, 0 otherwise West = 1 if in the West, 0 otherwise • • Sometimes the effect of a predictor is nonlinear. To test for nonlinearity of any predictor, include its square in the regression. For example, • If the linear model is the correct one, the coefficients of the squared predictors β2 and β4 would not differ significantly from zero. Otherwise a quadratic relationship would exist between Y and the respective predictor variable. Figure 13.11 • 13-31 Including all c binaries for c categories would introduce a serious problem for the regression estimation. One column in the X data matrix will be a perfect linear combination of the other column(s). The least squares estimation would fail because the data matrix would be singular (i.e., would have no inverse). 13-32 Tests for Nonlinearity and Interaction Tests for Interaction • Test for interaction between two predictors by including their product in the regression. • If we reject the hypothesis H0: β3 = 0, then we conclude that there is a significant interaction between X1 and X2. Interaction effects require careful interpretation and cost 1 degree of freedom per interaction. • 13-33 Multicollinearity What is Multicollinearity? • • • 13-34 Multicollinearity Variance Inflation • • • 13-35 Multicollinearity occurs when the independent variables X1, X2, …, Xm are intercorrelated instead of being independent. Collinearity occurs if only two predictors are correlated. The degree of multicollinearity is the real concern. Multicollinearity Correlation Matrix Multicollinearity induces variance inflation when predictors are strongly intercorrelated. This results in wider confidence intervals for the true coefficients β1, β2, …, βm and makes the t statistic less reliable. The separate contribution of each predictor in “explaining” the response variable is difficult to identify. • To check whether two predictors are correlated (collinearity), inspect the correlation matrix using Excel, MegaStat, or MINITAB. For example, Table 13.10 13-36 Multicollinearity Correlation Matrix • • Multicollinearity Variance Inflation Factor (VIF) • A quick Rule: A sample correlation whose absolute value exceeds 2/ n probably differs significantly from zero in a two-tailed test at α = .05. • This applies to samples that are not too small (say, 20 or more). 13-37 • where Rj2 is the coefficient of determination when predictor j is regressed against all other predictors. 13-38 Multicollinearity Variance Inflation Factor (VIF) • Multicollinearity Rules of Thumb • • Some possible situations are: • • • 13-39 The matrix scatter plots and correlation matrix only show correlations between any two predictors. The variance inflation factor (VIF) is a more comprehensive test for multicollinearity. For a given predictor j, the VIF is defined as 13-40 There is no limit on the magnitude of the VIF. A VIF of 10 says that the other predictors “explain” 90% of the variation in predictor j. This indicates that predictor j is strongly related to the other predictors. However, it is not necessarily indicative of instability in the least squares estimate. A large VIF is a warning to consider whether predictor j really belongs to the model. Multicollinearity Are Coefficients Stable? • Multicollinearity Are Coefficients Stable? • Evidence of instability is when X1 and X2 have a high pairwise correlation with Y, yet one or both predictors have insignificant t statistics in the fitted multiple regression, and/or if X1 and X2 are positively correlated with Y, yet one has a negative slope in the multiple regression. 13-41 • • 13-42 Violations of Assumptions • • • • 13-43 As a test, try dropping a collinear predictor from the regression and seeing what happens to the fitted coefficients in the re-estimated model. If they don’t change much, then multicollinearity is not a concern. If it causes sharp changes in one or more of the remaining coefficients in the model, then the multicollinearity may be causing instability. The least squares method makes several assumptions about the (unobservable) random errors εi. Clues about these errors may be found in the residuals ei. Assumption 1: The errors are normally distributed. Assumption 2: The errors have constant variance (i.e., they are homoscedastic). Assumption 3: The errors are independent (i.e., they are nonautocorrelated). Violations of Assumptions Non-Normal Errors • • • • 13-44 Except when there are major outliers, nonnormal residuals are usually considered a mild violation. Regression coefficients and variance remain unbiased and consistent. Confidence intervals for the parameters may be unreliable since they are based on the normality assumption. The confidence intervals are generally OK with a large sample size (e.g., n > 30) and no outliers. Violations of Assumptions Non-Normal Errors • • • Violations of Assumptions Nonconstant Variance (Heteroscedasticity) • Test H0: Errors are normally distributed H1: Errors are not normally distributed Create a histogram of residuals (plain or standardized) to visually reveal any outliers or serious asymmetry. The normal probability plot will also visually test for normality. 13-45 • • • If the error variance is constant, the errors are homoscedastic. If the error variance is nonconstant, the errors are heteroscedastic. This violation is potentially serious. The least squares regression parameter estimates are unbiased and consistent. Estimated variances are biased (understated) and not efficient, resulting in overstated t statistics and narrow confidence intervals. 13-46 Violations of Assumptions Violations of Assumptions Nonconstant Variance (Heteroscedasticity) Nonconstant Variance (Heteroscedasticity) • • • 13-47 The hypotheses are: H0: Errors have constant variance (homoscedastic) H1: Errors have nonconstant variance (heteroscedastic) Constant variance can be visually tested by examining scatter plots of the residuals against each predictor. Ideally there will be no pattern. Figure 13.19 13-48 Violations of Assumptions Autocorrelation • • • • Violations of Assumptions Autocorrelation Autocorrelation is a pattern of nonindependent errors that violates the assumption that each error is independent of its predecessor. This is a problem with time series data. Autocorrelated errors results in biased estimated variances which will result in narrow confidence intervals and large t statistics. The model’s fit may be overstated. 13-49 • • 13-50 Violations of Assumptions Autocorrelation • • • • • 13-51 Test the hypotheses: H0: Errors are nonautocorrelated H1: Errors are autocorrelated We will use the observable residuals e1, e2, …, en for evidence of autocorrelation and the Durbin-Watson test statistic DW: Violations of Assumptions Unusual Observations The DW statistic lies between 0 and 4. When H0 is true (no autocorrelation), the DW statistic will be near 2. A DW < 2 suggests positive autocorrelation. A DW > 2 suggests negative autocorrelation. Ignore the DW statistic for cross-sectional data. • 13-52 An observation may be unusual 1. because the fitted model’s prediction is poor (unusual residuals), or 2. because one or more predictors may be having a large influence on the regression estimates (unusual leverage). Violations of Assumptions Unusual Observations • Other Regression Topics Outliers: Causes and Cures To check for unusual residuals, simply inspect the residuals to find instances where the model does not predict well. To check for unusual leverage, look at the leverage statistic (how far each observation is from the mean(s) of the predictors) for each observation. For n observations and k predictors, look for observations whose leverage exceeds 2(k + 1)/n. • • 13-53 • • 13-54 Other Regression Topics Missing Predictors • • • 13-55 An outlier may be due to an error in recording the data and if so, the observation should be deleted. It is reasonable to discard an observation on the grounds that it represents a different population that the other observations. Other Regression Topics Ill-Conditioned Data An outlier may also be an observation that has been influenced by an unspecified “lurking” variable that should have been controlled but wasn’t. Try to identify the lurking variable and formulate a multiple regression model including both predictors. Unspecified “lurking” variables cause inaccurate predictions from the fitted regression. • • • • • 13-56 All variables in the regression should be of the same general order of magnitude. Do not mix very large data values with very small data values. To avoid mixing magnitudes, adjust the decimal point in both variables. Be consistent throughout the data column. The decimal adjustments for each data column need not be the same. Other Regression Topics Significance in Large Samples • • Other Regression Topics Model Specification Errors • Statistical significance may not imply practical importance. Anything can be made significant if you get a large enough sample. 13-57 • 13-58 Other Regression Topics Missing Data • • • 13-59 A misspecified model occurs when you estimate a linear model when actually a nonlinear model is required or when a relevant predictor is omitted. To detect misspecification - Plot the residuals against estimated Y (should be no discernable pattern). - Plot the residuals against actual Y (should be no discernable pattern). - Plot the fitted Y against the actual Y (should be a 45° line). Other Regression Topics Binary Dependent Variable Discard a variable if many data values are missing. If a Y value is missing, discard the observation to be conservative. Other options would be to use the mean of the X data column for the missing values or to use a regression procedure to “fit” the missing X-value from the complete observations. • • 13-60 When the response variable Y is binary (0, 1), the least squares estimation method is no longer appropriate. Use logit and probit regression methods. Applied Statistics in Business & Economics Other Regression Topics Stepwise and Best Subsets Regression • • • 13-61 The stepwise regression procedure finds the best fitting model using 1, 2, 3, …, k predictors. This procedure is appropriate only when there is no theoretical model that specifies which predictors should be used. Perform best subsets regression using all possible combinations of predictors. End of Chapter 13 13-62 McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies, Inc.