OPRE504 Chapter Study Guide Chapter 15 Multiple Regressions The Multiple Regression Model: 𝐲̂ = b0 + b1x1 + b2x2 + … + bk xk Where b0 is the intercept, each bk is the estimated coefficient (slope) of its corresponding predictor xk. e = y - 𝑦̂. df = n-k-1 (n is number of OBS, k is number of predictors) Residual: Degree of Freedom: ∑(𝑦−𝑦̂)2 Standard Deviation of Residual: se = √ 𝑛−𝑘−1 Interpretation of b1 (b2 or bk): When the values of all other predictors are held constant, one unit change in x1 is associated with b1 unit of change in y. I 1. Assumptions and Conditions: Check Linearity Conditions: Prior model check: scatterplots of y against each of the predictors are reasonably straight (no bend observed) Post model check: A scatterplot of residuals against the predicted values should show no obvious pattern – bend. Violations of Linearity Conditions: 2. Check Independence Conditions: Chaodong Han OPRE504 Data Analysis and Decisions ClassHandout Page 1 of 9 Error terms associated with individual observations should be independent of each other. Rule of thumb: Random samples ensure independence. Without randomization, the generalization of regression models is limited to the data under analysis. Check 1: scatterplot of residuals and predicted value should show no trends, or clumping. Check 2: individual plot of residuals against each predictor should show no trends, or clumping; special attention to time series data for serial correlation. Violations of independence assumptions: 3. Check Equal Variance Assumption (Homoscedasticity): Variability of error terms should be the same (constant) for all values of each predictor. Check 1: Scatterplot of residuals against the predicted value shows consistent spread. Check 2: Boxplot of y against each predictor of x should show consistent spread. Homoscedasticity vs. Heteroscedasticity: Chaodong Han OPRE504 Data Analysis and Decisions ClassHandout Page 2 of 9 4. Check Normality Assumption: Error terms around the regression model at any specific values of x-predictors should follow a Normal distribution or nearly normal distribution. Check normality of residuals and individual variables and identify outliers of variables using normal probability plot (i.e., DDXL) Visit the Following Links for More Details: Testing The Assumptions of Linear Regression http://www.duke.edu/~rnau/testing.htm Testing Assumptions of Linear Regression Using SPSS http://www.utexas.edu/courses/schwab/sw388r6_fall_2006/SolvingProblems/Homework%20Pro blems%20-Simple%20Linear%20Regression%20-%20Testing%20%20Assumptions.ppt Regression Diagnostics Using SPSS http://www.ats.ucla.edu/stat/spss/webbooks/reg/chapter2/spssreg2.htm Testing Linear Assumptions Using SAS http://www2.sas.com/proceedings/sugi22/STATS/PAPER267.PDF Check Linear Regression Assumptions for Ph.D. Studies http://courses.unt.edu/yeatts/6200-Multivariate%20Stats/Lectures-Tests/Test%202/Week-11diagnostics-solutions.pdf Assumptions of Linear Regressions: http://www.statisticssolutions.com/methods-chapter/statistical-tests/assumptions-of-linearregression/ Chaodong Han OPRE504 Data Analysis and Decisions ClassHandout Page 3 of 9 II Hypothesis Tests and Interpretations in Multiple Regressions Given the underlying population parameters: Y = β0 + β1 x1 + β2 x2+ ... + βk xk Model Hypotheses: H0: β1 = β2 = .. = βk = 0 (the model predicts no better than using the grand mean) Ha: at least one β is not 0. F-test for the Model: F-test with numerator degree of freedom =k and denominator degree of freedom of n-k-1 (k = number of predictors, 1 is used to account for the intercept). 𝑆𝑆𝑅 ⁄𝑘 𝑀𝑆𝑅 𝑆𝑆𝑅 𝑆𝑆𝐸 F(k, n-k-1) = 𝑆𝑆𝐸⁄(𝑛−𝑘−1) = 𝑀𝑆𝐸 , since R2 = 𝑆𝑆𝑇 = 1 - 𝑆𝑆𝑇 F(k, n-k-1) = 𝑆𝑆𝑅 ⁄𝑘 𝑆𝑆𝐸 ⁄(𝑛−𝑘−1) = 𝑆𝑆𝑅 𝑆𝑆𝑇∗𝑘 𝑆𝑆𝐸 𝑆𝑆𝑇∗(𝑛−𝑘−1) = 𝑆𝑆𝑅 1 𝑆𝑆𝑇 𝑘 𝑆𝑆𝐸 1 𝑆𝑆𝑇 (𝑛−𝑘−1) = 𝑅2 (1−𝑅 2 ) 1 𝑘 1 (𝑛−𝑘−1) T-tests for Individual Coefficients at Desired Alpha Levels: 𝑏 t*n-k-1, alpha = 𝑆𝐸(𝑏𝑖 ) 𝑖 If one predictor is not significant, it does not necessarily mean that this predictor has no linear relationship to dependent variable, y; rather, it means that this particular predictor contributes nothing to the explanation of y after controlling for all other predictors. Confidence Intervals for Each Slope (Coefficient): bi ± t* n-k-1 x SE (𝑏𝑖 ) using statistical software. R2 and Adjusted R2: 𝑆𝑆𝑅 𝑆𝑆𝐸 R2 = 𝑆𝑆𝑇 = 1 - 𝑆𝑆𝑇 𝑛−1 Radj2 = 1 – (1-R2) 𝑛−𝑘−1 = 1- 𝑆𝑆𝐸 ⁄(𝑛−𝑘−1) 𝑆𝑆𝑇 ⁄(𝑛−1) Multicollinearity Issue: When two independent variables are highly correlated, a multicollinearity issue is a serious concern. Variance inflation Factor (VIF) Test 1 VIFj = 1−𝑅 2 , using jth predictor as dependent variable while all other predictors are independent 𝑗 variables. Chaodong Han OPRE504 Data Analysis and Decisions ClassHandout Page 4 of 9 [Chapter 15, Exercise 22, 25 and 26, Sharpe 2011, pp.510-511] Here is a dataset containing monthly revenue of Wal-Mart Corp., relating that revenue to the Total U.S. Retail Sales, the Personal Consumption Index, and the Consumer Price Index. Date 11/28/2003 12/30/2003 1/30/2004 2/27/2004 3/31/2004 4/29/2004 5/28/2004 6/30/2004 7/27/2004 8/27/2004 9/30/2004 10/29/2004 11/29/2004 12/31/2004 1/21/2005 2/24/2005 3/30/2005 4/29/2005 5/25/2005 6/28/2005 7/28/2005 8/26/2005 9/30/2005 10/31/2005 11/28/2005 12/30/2005 1/27/2006 2/23/2006 3/31/2006 4/28/2006 5/25/2006 6/30/2006 7/28/2006 8/29/2006 9/28/2006 10/20/2006 11/24/2006 12/29/2006 1/26/2007 Wal Mart Revenue 14.764 23.106 12.131 13.628 16.722 13.98 14.388 18.111 13.764 14.296 17.169 13.915 15.739 26.177 13.17 15.139 18.683 14.829 15.697 20.23 15.26 15.709 18.618 15.397 17.384 27.92 14.555 18.684 16.639 20.17 16.901 21.47 16.542 16.98 20.091 16.583 18.761 28.795 20.473 CPI 552.7 552.1 554.9 557.9 561.5 563.2 566.4 568.2 567.5 567.6 568.7 571.9 572.2 570.1 571.2 574.5 579 582.9 582.4 582.6 585.2 588.2 595.4 596.7 592 589.4 593.9 595.2 598.6 603.5 606.5 607.8 609.6 610.9 607.9 604.6 603.6 604.5 606.348 Personal Consumption 7868495 7885264 7977730 8005878 8070480 8086579 8196516 8161271 8235349 8246121 8313670 8371605 8410820 8462026 8469443 8520687 8568959 8654352 8644646 8724753 8833907 8825450 8882536 8911627 8916377 8955472 9034368 9079246 9123848 9175181 9238576 9270505 9338876 9352650 9348494 9376027 9410758 9478531 9540335 Retail Sales Index 301337 357704 281463 282445 319107 315278 328499 321151 328025 326280 313444 319639 324067 386918 293027 294892 338969 335626 345400 351068 351887 355897 333652 336662 344441 406510 322222 318184 366989 357334 380085 373279 368611 382600 352686 354740 363468 424946 332797 December 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 a) Research Question Whether Wal-Mart revenue is closely associated with the general state of the U.S. economy Chaodong Han OPRE504 Data Analysis and Decisions ClassHandout Page 5 of 9 b) State Hypotheses Based on economic theories and reasoning, we could formulate the following hypotheses: c) H1: Wal-Mart revenue is positively associated with the Total U.S. Retail Sales H2: Wal-Mart revenue is positively associated with Personal Consumption Index H3: Wal-Mart revenue is negatively associated with Consumer Price Index The Regression Model Revenue = β0 + β1 Total Retail Sales + β2 Personal Consumption Index + β3 Consumer Price Index d) Descriptive Statistics In DDXL, Charts and Plots – Normal Probability Plot: Wal-Mart Revenue Retail Sales Index Personal Consumption CPI Summary: There appear to be four outlier values for Wal-Mart Revenue (23.106 in 12/2003; 26.177 in 12/2004; 27.92 in 12/2005; 28.795 in 12/2006). Other variables appear to be approximately normal. e) Correlation Table (Only include independent variables used in the final regression) Excel – Data – Data Analysis – Correlation (highlight all independent variables) Chaodong Han OPRE504 Data Analysis and Decisions ClassHandout Page 6 of 9 CPI Personal Consumption Retail Sales Index CPI 1.00 0.98 Personal Consumption 0.63 0.64 Retail Sales Index 1.00 1.00 Summary: CPI and Personal Consumption are highly correlated, which could cause multicollinearity issue in the multiple regression. Special caution should be made with respect to the regression results. Many methods can be used to address multicollinearity issues. f) Check Regression Assumptions (1) Linearity (check a scatterplot of y vs. x using DDXL) Retail Sales Personal Consumption CPI Summary: independent variables show linear correlations with dependent variable (Wal-Mart revenues) Homoscedasticity (check scatterplot of residuals vs. predicted/fitted values) 0 -4 -2 Residuals 2 4 2. 10 15 20 Fitted values 25 30 Summary: no particular pattern (bend) is observed. DDXL 3. Independence (whether there is a serial correlation) Chaodong Han OPRE504 Data Analysis and Decisions ClassHandout Page 7 of 9 5 0 -10 -5 Residuals 0 10 20 Time 30 40 Summary: residuals show a bumping pattern, indicating a time serial correlation; independence assumption may be violated; some other models rather than linear regressions may be used.) 4. Normality (check the normal probability plot of the residuals) Summary: a largely straight line is shown, indicating normality assumption is met. Data Analysis Toolpack: Regression – Standardized Residuals – Residual Plot g) Regression Results Source SS MS Model Residual 378.748744 189.474058 3 35 126.249581 5.4135445 Total 568.222802 38 14.9532316 walmartrev~e Coef. retailsale~x personalco~n cpi _cons .0001032 .0000111 -.3447946 87.00878 h) df Std. Err. .0000155 4.40e-06 .120335 33.59896 t 6.67 2.52 -2.87 2.59 Number of obs F( 3, 35) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.017 0.007 0.014 = = = = = = 39 23.32 0.0000 0.6665 0.6380 2.3267 [95% Conf. Interval] .0000718 2.15e-06 -.5890876 18.79926 .0001345 .00002 -.1005017 155.2183 Interpret the Coefficients and Testing Hypotheses Chaodong Han OPRE504 Data Analysis and Decisions ClassHandout Page 8 of 9 The coefficient for Retail Sales Index is 0.0001032 and highly significant (p<0.000), suggesting Retail Sales Index is positively associated with Wal-Mart revenue. H1 is supported. The coefficient for Personal Consumption is 0.0000111 and highly significant (p<0.05), suggesting Personal Consumption is positively associated with Wal-Mart revenue. H2 is supported. The coefficient for CPI is -0.3447946 and highly significant (p<0.01), suggesting CPI is negatively associated with Wal-Mart revenue. H3 is supported. i) Conclusion Wal-Mart revenue is closely related to the general state of U.S. economy. Outlier: Regression Results without Outlier Values of Revenue: Regression Statistics Multiple R 0.649807474 R Square 0.422249753 Adjusted R Square 0.366338439 Standard Error 1.87418242 Observations 35 ANOVA Regression Residual Total Intercept CPI Personal Consumption Retail Sales Index MS 26.52732289 3.512559744 F 7.55213429 Significance F 0.0006233 Standard Error 35.989036 0.129729 t Stat -0.629109 0.366678 P-value 0.533887 0.716350 Lower 95% -96.041140 -0.217015 Upper 95% 50.759106 0.312152 0.000004 0.000023 0.183319 0.591614 0.855741 0.558399 -0.000008 -0.000033 0.000010 0.000060 df 3 31 34 SS 79.58196867 108.8893521 188.4713207 Coefficients -22.641017 0.047569 0.000001 0.000013 Conclusion: When outliers (December holiday sales) are excluded, there is no relationship between WalMart revenue and the general state of U.S. economy. Chaodong Han OPRE504 Data Analysis and Decisions ClassHandout Page 9 of 9