Chapter 13 Multiple Regression Analysis © 2002 Thomson / South-Western Slide 13-1 Learning Objectives • Develop a multiple regression model. • Understand and apply techniques that can be used to determine how well a regression model fits data. • Analyze and interpret nonlinear variables and how to use them in multiple regression analysis. • Understand the role of qualitative variables and how to use them in multiple regression analysis. • Learn how to build and evaluate multiple regression models. © 2002 Thomson / South-Western Slide 13-2 The Multiple Regression Model • Multiple regression is regression analysis with one dependent variable and two or more independent variables, or at least one nonlinear independent variable. • The response variable is the dependent variable, the variable that the business analyst is trying to predict. © 2002 Thomson / South-Western Slide 13-3 Regression Models Probabilistic Multiple Regression Model Y = b0 + b1X1 + b2X2 + b3X3 + . . . + bkXk+ Y = the value of the dependent (response) variable b0 = the regression constant b1 = the partial regression coefficient of independent variable 1 b2 = the partial regression coefficient of independent variable 2 bk = the partial regression coefficient of independent variable k k = the number of independent variables = the error of prediction © 2002 Thomson / South-Western Slide 13-4 Estimated Regression Model Yˆ b 0 b1 X 1 b 2 X 2 b 3 X 3 bk X k where : Yˆ predicted value of Y 0 estimate of regression constant 1 estimate of regression coefficient 1 2 estimate of regression coefficient 2 3 estimate of regression coefficient 3 k estimate of regression coefficient k b b b b b k = number of independent variables © 2002 Thomson / South-Western Slide 13-5 Multiple Regression Model with Two Independent Variables (First-Order) Y 0 1 X 1 2 X 2 where: 0 1 2 Population Model = the regression constant the partial regression coefficient for independent variable 1 the partial regression coefficient for independent variable 2 = the error of prediction Y b b X b X 0 1 1 2 2 where: Y predicted value of Y b b b 0 estimate of regression constant 1 estimate of regression coefficient 1 2 estimate of regression coefficient 2 © 2002 Thomson / South-Western Estimated Model Slide 13-6 Response Plane for First-Order TwoPredictor Multiple Regression Model Y Vertical Intercept Y1 Response Plane X2 © 2002 Thomson / South-Western X1 Slide 13-7 Least Squares Equations for k = 2 Least squares analysis is the process by which a regression model is developed based on calculus techniques that attempt to produce a minimum sum of the squared error values. b n b X b X Y b X b X b X X X Y b X b X X b X X Y 0 1 1 2 2 2 0 1 1 1 2 1 2 1 2 0 2 © 2002 Thomson / South-Western 1 1 2 2 2 2 Slide 13-8 Real Estate Data Observation 1 2 3 4 5 6 7 8 9 10 11 12 Market Price ($1,000) Y 63.0 65.1 69.9 7 76.8 73.9 77.9 74.9 78.0 79.0 63.4 79.5 83.9 Square Feet X1 1,605 2,489 1,553 2,404 1,884 1,558 1,748 3,105 1,682 2,470 1,820 2,143 © 2002 Thomson / South-Western Age (Years) X2 35 45 20 32 25 14 8 10 28 30 2 6 Observation 13 14 15 16 17 18 19 20 21 22 23 Market Price ($1,000) Y 79.7 84.5 96.0 109.5 102.5 121.0 104.9 128.0 129.0 117.9 140.0 Square Feet Age (Years) X1 2,121 2,485 2,300 2,714 2,463 3,076 3,048 3,267 3,069 4,765 4,540 Slide 13-9 X2 14 9 19 4 5 7 3 6 10 11 8 Predicting the Price of Home Yˆ 57.351 0.0177 X 1 0.663 X 2 For X 1 2500 and X 2 12, Yˆ 57.351 0.0177 2500 0.663 12 93.605 thousand dollars © 2002 Thomson / South-Western Slide 13-10 Evaluating the Multiple Regression Model H 0: 1 2 3 k 0 Ha: At least one of the regression coefficients is 0 H : H 0: a H : H 0: a 1 0 0 1 H : H 0: a 0 H 0: 2 2 0 H : a 3 0 0 3 k k © 2002 Thomson / South-Western Testing the Overall Model 0 Significance Tests for Individual Regression Coefficients 0 Slide 13-11 Testing the Overall Model for the Real Estate Example H 0 : 2 0 1 Ha : At least one of the regression coefficien ts is 0 SSR MSR k F F .01,2 ,20 Cal 585 . 28.63 585 . , reject H0. SSE MSR MSE F n k 1 MSE ANOVA Regression Residual (Error) Total df SS MS 2 8189.723 4094.862 20 2861.017 143.051 22 11050.740 © 2002 Thomson / South-Western F 28.63 p .0000014 Slide 13-12 H 0: 1 0 Ha: 1 0 H 0: 2 0 Ha: 2 0 t.025,20 = 2.086 tCal = 5.63 > 2.086, reject H0. Coefficients Std Dev x1 (Sq.Feet) x2 (Age) Significance Test of the Regression Coefficients for the Real Estate Example 0.0177 -0.666 © 2002 Thomson / South-Western 0.003146 0.2280 t Stat 5.63 -2.92 p .000016 .008418 Slide 13-13 Residuals • The residual is the difference between the actual Y value and Y value predicted by the regression model. • It is the error of the regression model in prediting each value of the dependent variable. © 2002 Thomson / South-Western Slide 13-14 SSE and Standard Error of the Estimate ANOVA Regression Residual (Error) Total S e df SS 2 8189.7 20 2861.0 22 11050.7 MS 4094.9 143.1 SSE n k 1 F 28.63 P .000 SSE 2861 23 2 1 11.96 where: n = number of observations k = number of independent variables © 2002 Thomson / South-Western Slide 13-15 Coefficient of Multiple Determination (R2) SSYY ANOVA Regression Residual (Error) Total SSR SSE df SS 2 8189.7 20 2861.0 22 11050.7 MS 4094.89 143.1 F 28.63 p .000 SSR 8189.723 R SSY 11050.74 .741 SSE 2861017 . 2 R 1 SSY 1 11050.74 .741 2 © 2002 Thomson / South-Western Slide 13-16 Adjusted R2 n-1 n-k-1 ANOVA Regression Residual (Error) Total SSE df SS MS 2 8189.723 4094.862 20 2861.017 143.051 22 11050.740 F 28.63 SSYY p .0000014 SSE 2861017 . 2 adj. R 1 n k 1 1 23 2 1 1.285 .715 SSY 11050.74 n 1 23 1 © 2002 Thomson / South-Western Slide 13-17 Indicator (Dummy) Variables • Qualitative (indicator or dummy) Variables • The number of dummy variables needed for a qualitative variable is the number of categories less one. • For dichotomous variables, such as gender, only one dummy variable is needed. There are two categories (female, male); c = 1; c - 1 = 0. • Your office is located in which region of the country? ___Northeast___ Midwest ___South___West Number of dummy variables = c - 1 = 4 - 1 = 3 © 2002 Thomson / South-Western Slide 13-18 Data for the Monthly Salary Example Observation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 © 2002 Thomson / South-Western Monthly Salary ($1000) 1.548 1.629 1.011 1.229 1.746 1.528 1.018 1.190 1.551 0.985 1.610 1.432 1.215 0.990 1.585 Age (10 Years) 3.2 3.8 2.7 3.4 3.6 4.1 3.8 3.4 3.3 3.2 3.5 2.9 3.3 2.8 3.5 Gender (1=Male, 0=Female) 1 1 0 0 1 1 0 0 1 0 1 1 0 0 1 Slide 13-19 Regression Output for the Monthly Salary Example The regression equation is Salary = 0.732 + 0.111 Age + 0.459 Gender Predictor Constant Age Gender Coef 0.7321 0.11122 0.45868 S = 0.09679 StDev 0.2356 0.07208 0.05346 R-Sq = 89.0% T P 3.11 0.009 1.54 0.149 8.58 0.000 R-Sq(adj) = 87.2% Analysis of Variance Source Regression Error Total © 2002 Thomson / South-Western DF 2 12 14 SS 0.90949 0.11242 1.02191 MS F P 0.45474 48.54 0.000 0.00937 Slide 13-20 Regression Model Depicted with Males and Females Separated 1.800 1.600 Males 1.400 1.200 Females 1.000 0.800 0 2 © 2002 Thomson / South-Western 3 4 Slide 13-21 More Complex Regression Models Y 0 1 X 1 2 X 2 First-order with Two Independent Variables Y 0 1 X 1 2 X 1 Second-order with One Independent Variable 2 Y 0 1 X 1 2 X 2 3 X 1 X 2 Second-order with an Interaction Term Y 0 1 X 1 2 X 2 3 X 1 4 X 2 5 X 1 X 2 2 © 2002 Thomson / South-Western 2 Second-order with Two Independent Variables Slide 13-22 Example: Sales Data and Scatter Plot for 13 Manufacturing Companies Sales Manufacturer ($1,000,000) 1 2.1 2 3.6 3 6.2 4 10.4 5 22.8 6 35.6 7 57.1 8 83.5 9 109.4 10 128.6 11 196.8 12 280.0 13 462.3 Number of Manufacturing Representatives 2 1 2 3 4 4 5 5 6 7 8 10 11 © 2002 Thomson / South-Western 500 450 400 350 300 Sales 250 200 150 100 50 0 0 2 4 6 8 10 Number of Representatives Slide 13-23 12 Excel Simple Linear Regression Output for the Manufacturing Example Regression Statistics Multiple R 0.933 R Square 0.870 Adjusted R Square 0.858 Standard Error 51.10 Observations Coefficients Standard Error -107.03 28.737 41.026 4.779 Intercept numreps t Stat -3.72 8.58 13 P-value 0.003 0.000 ANOVA df Regression Residual Total 1 11 12 © 2002 Thomson / South-Western SS 192395 28721 221117 MS 192395 2611 F 73.69 Significance F 0.000 Slide 13-24 Manufacturing Data with Newly Created Variable Sales Manufacturer ($1,000,000) 1 2.1 2 3.6 3 6.2 4 10.4 5 22.8 6 35.6 7 57.1 8 83.5 9 109.4 10 128.6 11 196.8 12 280.0 13 462.3 © 2002 Thomson / South-Western Number of Mgfr Reps X1 2 1 2 3 4 4 5 5 6 7 8 10 11 (No. Mgfr Reps)2 X2 = (X1)2 4 1 4 9 16 16 25 25 36 49 64 100 121 Slide 13-25 Scatter Plots Using Original and Transformed Data Sales Sales 500 450 400 350 300 250 200 150 100 50 0 0 2 4 6 8 Number of Representatives © 2002 Thomson / South-Western 10 12 500 450 400 350 300 250 200 150 100 50 0 0 50 100 150 Number of Mfg. Reps. Squared Slide 13-26 Excel Output for Quadratic Model to Predict Sales Intercept MfgrRp MfgrRpSq Regression Statistics Multiple R 0.986 R Square 0.973 Adjusted R Square 0.967 Standard Error 24.593 Observations 13 Coefficients Standard Error 18.067 24.673 -15.723 9.5450 4.750 0.776 t Stat 0.73 - 1.65 6.12 P-value 0.481 0.131 0.000 ANOVA df Regression Residual Total 2 10 12 © 2002 Thomson / South-Western SS 215069 6048 221117 MS 107534 605 F 177.79 Significance F 0.000 Slide 13-27 Regression Models with Interaction Example: Prices of Three Stocks over a 15-Month Period © 2002 Thomson / South-Western Stock 1 Stock 2 Stock 3 41 36 35 39 36 35 38 38 32 45 51 41 41 52 39 43 55 55 47 57 52 49 58 54 41 62 65 35 70 77 36 72 75 39 74 74 33 83 81 28 101 92 31 107 91 Slide 13-28 Regression Models for the Three Stocks Y 0 1 X 1 2 X 2 First-order with Two Independent Variables where: Y = price of stock 1 X X 1 price of stock 2 2 price of stock 3 Y 0 1 X 1 2 X X X Y X X X 0 1 1 2 2 3 1 2 3 3 2 Second-order with an Interaction Term where: Y = price of stock 1 X X X 1 price of stock 2 2 price of stock 3 3 X X 1 2 © 2002 Thomson / South-Western Slide 13-29 Regression for Three Stocks: Two Predictors, No Interaction The regression equation is Stock 1 = 50.9 - 0.119 Stock 2 - 0.071 Stock 3 Predictor Coef Constant 50.855 Stock 2 -0.1190 Stock 3 -0.0708 S = 4.570 StDev 3.791 0.1931 0.1990 R-Sq = 47.2% T P 13.41 0.000 -0.62 0.549 -0.36 0.728 R-Sq(adj) = 38.4% Analysis of Variance Source Regression Error Total DF 2 12 14 © 2002 Thomson / South-Western SS 224.29 250.64 474.93 MS 112.15 20.89 F Sig. F 5.37 0.022 Slide 13-30 Regression for Three Stocks: with Interaction Term The regression equation is Stock 1 = 12.0 - 0.879 Stock 2 - 0.220 Stock 3 – 0.00998 Inter Predictor Coef StDev Constant 12.046 9.312 Stock 2 0.8788 0.2619 Stock 3 0.2205 0.1435 Inter -0.009985 0.002314 S = 2.909 R-Sq = 80.4% T 1.29 3.36 1.54 -4.31 P 0.222 0.006 0.153 0.001 R-Sq(adj) = 25.1% Analysis of Variance Source DF SS Regression 3 381.85 Error 11 93.09 © 2002 Thomson / South-Western Total 14 474.93 MS 127.28 8.46 F Sig. F 15.04 0.000 Slide 13-31 Nonlinear Regression Models: Model Transformation Y Yˆ b b log b X 0 1 X 0 Yˆ ' b b 0 ' 1 © 2002 Thomson / South-Western 1 b0 b1 X ' where : ' 1 Yˆ ' ' log Yˆ log b0 log b1 Slide 13-32 Data Set for Model Transformation Example ORIGINAL DATA Company 1 2 3 4 5 6 7 Y 2580 11942 9845 27800 18926 4800 14550 X 1.2 2.6 2.2 3.2 2.9 1.5 2.7 Y = Sales ($ million/year) © 2002 Thomson / South-Western TRANSFORMED DATA Company 1 2 3 4 5 6 7 LOG Y 3.41162 4.077077 3.993216 4.444045 4.277059 3.681241 4.162863 X 1.2 2.6 2.2 3.2 2.9 1.5 2.7 X = Advertising ($ million/year) Slide 13-33 Regression Statistics Multiple R 0.990 R Square 0.980 Adjusted R Square 0.977 Standard Error 0.054 Observations 7 Regression Output for Model Transformation Example Coefficients Standard Error 2.9003 0.0729 0.4751 0.0300 Intercept X t Stat 39.80 15.82 P-value 0.000 0.000 ANOVA df Regression Residual Total 1 5 6 SS 0.7392 0.0148 0.7540 © 2002 Thomson / South-Western MS 0.7392 0.0030 F 250.36 Significance F 0.000 Slide 13-34 Prediction with the Transformed Model X Yˆ b 0b1 log Yˆ log b 0 X log b1 2.900364 X 0.475127 For X=2, log Yˆ 2.900364 2 0.475127 3.850618 Yˆ antilog(log Yˆ ) antilog(3.850618) 7087.61 © 2002 Thomson / South-Western Slide 13-35 Prediction with the Transformed Model X Yˆ b 0b1 log Yˆ log b 0 X log b1 2.900364 X 0.475127 log b 0 2.900364 0 antilog(2.900364) 794.99427 1 0.475127 1 antilog(0.475127) 2.986256 b log b b For X =2, Yˆ 794.99427 2.986256 2 7089.5 © 2002 Thomson / South-Western Slide 13-36 Model-Building: Search Procedures • • • • All Possible Regressions Forward Selection Backward Elimination Stepwise Regression © 2002 Thomson / South-Western Slide 13-37 Data for Multiple Regression to Predict Crude Oil Production Y X1 X2 X3 X4 World Crude Oil Production U.S. Energy Consumption U.S. Nuclear Generation U.S. Coal Production U.S. Fuel Rate for Autos © 2002 Thomson / South-Western Y 55.7 55.7 52.8 57.3 59.7 60.2 62.7 59.6 56.1 53.5 53.3 54.5 54.0 56.2 56.7 58.7 59.9 60.6 60.2 60.2 60.6 60.9 X1 74.3 72.5 70.5 74.4 76.3 78.1 78.9 76.0 74.0 70.8 70.5 74.1 74.0 74.3 76.9 80.2 81.3 81.3 81.1 82.1 83.9 85.6 X2 X3 83.5 598.6 114.0 610.0 172.5 654.6 191.1 684.9 250.9 697.2 276.4 670.2 255.2 781.1 251.1 829.7 272.7 823.8 282.8 838.1 293.7 782.1 327.6 895.9 383.7 883.6 414.0 890.3 455.3 918.8 527.0 950.3 529.4 980.7 576.9 1029.1 612.6 996.0 618.8 997.5 610.3 945.4 640.4 1033.5 X4 13.30 13.42 13.52 13.53 13.80 14.04 14.41 15.46 15.94 16.65 17.14 17.83 18.20 18.27 19.20 19.87 20.31 21.02 21.69 21.68 21.04 21.48 Slide 13-38 Example: All Possible Regressions with Four Independent Variables Single Predictor X1 X2 X3 X4 Two Predictors X1, X2 X1, X3 X1, X4 X2, X3 X2, X4 X3, X4 © 2002 Thomson / South-Western Three Predictors X1, X2, X3 X1, X2, X4 X1, X3, X4 X2, X3, X4 Four Predictors X1, X2, X3, X4 Slide 13-39 Forward Selection Like stepwise, except variables are not reevaluated after entering the model © 2002 Thomson / South-Western Slide 13-40 Backward Elimination • Start with the “full model” (all k predictors) • If all predictors are significant, stop • Otherwise, eliminate the most nonsignificant predictor; return to previous step © 2002 Thomson / South-Western Slide 13-41 Stepwise Regression • Perform k simple regressions; and select the best as the initial model • Evaluate each variable not in the model – If none meet the criterion, stop – Add the best variable to the model; reevaluate previous variables, and drop any which are not significant • Return to previous step © 2002 Thomson / South-Western Slide 13-42 Multicollinearity Condition that occurs when two or more of the independent variables of a multiple regression model are highly correlated – Difficult to interpret the estimates of the regression coefficients – Inordinately small t values for the regression coefficients may result – Standard deviations of regression coefficients are overestimated – Sign of predictor variable’s coefficient opposite of what expected © 2002 Thomson / South-Western Slide 13-43