Chapter 18 Multiple Regression 18.1 Introduction • In this chapter we extend the simple linear regression model, and allow for any number of independent variables. • We expect to build a model that fits the data better than the simple linear regression model. • We will use computer printout to – Assess the model • How well it fits the data • Is it useful • Are any required conditions violated? – Employ the model • Interpreting the coefficients • Predictions using the prediction equation • Estimating the expected value of the dependent variable 18.2 Model and Required Conditions • We allow for k independent variables to potentially be related to the dependent variable Coefficients Random error variable y = b0 + b1x1+ b2x2 + …+ bkxk + e Dependent variable Independent variables y The simple linear regression model allows for one independent variable, “x” y =b0 + b1x + e Note how the straight line becomes a plain, and... X1 The multiple linear regression model allows for more than one independent variable. Y = b0 + b1x1 + b2x2 + e X2 y y= b0+ b1x2 b0 X1 y = b0 + b1x12 + b2x2 … a parabola becomes a parabolic surface X2 • Required conditions for the error variable e – The error e is normally distributed with mean equal to zero and a constant standard deviation se (independent of the value of y). se is unknown. – The errors are independent. • These conditions are required in order to – estimate the model coefficients, – assess the resulting model. 18.3 Estimating the Coefficients and Assessing the Model • The procedure – Obtain the model coefficients and statistics using a statistical computer software. – Diagnose violations of required conditions. Try to remedy problems when identified. – Assess the model fit and usefulness using the model statistics. – If the model passes the assessment tests, use it to interpret the coefficients and generate predictions. Example 18.1 Where to locate a new motor inn? – La Quinta Motor Inns is planning an expansion. – Management wishes to predict which sites are likely to be profitable. – Several areas where predictors of profitability can be identified are: • • • • • Competition Market awareness Demand generators Demographics Physical quality Profitability Competition Rooms Number of hotels/motels rooms within 3 miles from the site. Market awareness Nearest Distance to the nearest La Quinta inn. Customers Office space College enrollment Margin Community Physical Income Disttwn Median household income. Distance to downtown. – Data was collected from randomly selected 100 inns that belong to La Quinta, and ran for the following suggested model: Margin =b0 + b1Rooms + b2Nearest + b3Office + b4College + b5Income + b6Disttwn + INN 1 2 3 4 5 6 e MARGIN ROOMS NEAREST OFFICE COLLEGE INCOME DISTTWN 55.5 3203 0.1 549 8 37 12.1 33.8 2810 1.5 496 17.5 39 0.4 49 2890 1.9 254 20 39 12.2 31.9 3422 1 434 15.5 36 2.7 57.4 2687 3.4 678 15.5 32 7.9 49 3759 1.4 635 19 41 4 • Excel output SUMMARY OUTPUT This is the sample regression equation (sometimes called the prediction equation) MARGIN = 72.455 - 0.008ROOMS -1.646NEAREST + 0.02OFFICE +0.212COLLEGE - 0.413INCOME + 0.225DISTTWN Regression Statistics Multiple R 0.724611 R Square 0.525062 Adjusted R Square 0.49442 Standard Error 5.512084 Observations 100 Let us assess this equation ANOVA df Regression Residual Total SS MS F Significance F 6 3123.832 520.6387 17.13581 3.03E-13 93 2825.626 30.38307 99 5949.458 Coefficients Standard Error t Stat Intercept 72.45461 7.893104 9.179483 ROOMS -0.00762 0.001255 -6.06871 NEAREST -1.64624 0.632837 -2.60136 OFFICE 0.019766 0.00341 5.795594 COLLEGE 0.211783 0.133428 1.587246 INCOME -0.41312 0.139552 -2.96034 DISTTWN 0.225258 0.178709 1.260475 P-value Lower 95% Upper 95% 1.11E-14 56.78049 88.12874 2.77E-08 -0.01011 -0.00513 0.010803 -2.90292 -0.38955 9.24E-08 0.012993 0.026538 0.115851 -0.05318 0.476744 0.003899 -0.69025 -0.136 0.210651 -0.12962 0.580138 • Standard error of estimate – We need to estimate the standard error of estimate SSE se n k 1 – Compare se to the mean value of y • From the printout, Standard Error = 5.5121 • Calculating the mean value of y we have y 45.739 – It seems se is not particularly small. – Can we conclude the model does not fit the data well? • Coefficient of determination – The definition is R2 1 SSE (y i y )2 – From the printout, R2 = 0.5251 – 52.51% of the variation in the measure of profitability is explained by the linear regression model formulated above. – When adjusted for degrees of freedom, Adjusted R2 = 1-[SSE/(n-k-1)] / [SS(Total)/(n-1)] = = 49.44% • Testing the validity of the model – We pose the question: Is there at least one independent variable linearly related to the dependent variable? – To answer the question we test the hypothesis H0: b1 = b2 = … = bk = 0 H1: At least one bi is not equal to zero. – If at least one bi is not equal to zero, the model is valid. • To test these hypotheses we perform an analysis of variance procedure. • The F test – Construct the F statistic [Variation in y] = SSR + SSE. Large F results from a large SSR. Then, much of the variation in y is explained by the regression model. The null hypothesis should be rejected; the modelregion is valid. – thus, Rejection MSR=SSR/k MSR F MSE MSE=SSE/(n-k-1) F>Fa,k,n-k-1 Required conditions must be satisfied. Example 18.1 - continued • Excel provides the following ANOVA results MSR/MSE ANOVA df Regression Residual Total SSE SSR SS MS F Significance F 6 3123.832 520.6387 17.13581 3.03382E-13 93 2825.626 30.38307 99 5949.458 MSE MSR Example 18.1 - continued • Excel provides the following ANOVA results ANOVA Conclusion: There is sufficient evidence to reject the nulldfhypothesisSS in favor of the MS alternativeFhypothesis. Significance F At least one6of the bi is not equal to zero.17.13581 Thus, at least Regression 3123.832 520.6387 3.03382E-13 variable is30.38307 linearly related to y. Residual one independent 93 2825.626 This linear regression model is valid Total 99 5949.458 Fa,k,n-k-1 = F0.05,6,100-6-1=2.17 F = 17.14 > 2.17 Also, the p-value (Significance F) = 3.03382(10)-13 Clearly, a = 0.05>3.03382(10)-13, and the null hypothesis is rejected. • Let us interpret the coefficients – This is the intercept, the value of y when all the variables take the value zero. Since the data range of all the independent variables do not cover the value zero, do not interpret the intercept. – In this model, for each additional 1000 rooms within 3 mile of the La Quinta inn, the operating margin decreases on the average by 7.6% (assuming the other variables are held constant). b0 72.5 b1 .0076 – b2 1.65 – b 4 .21 In this model, for each additional mile that the nearest competitor is to La Quinta inn, the average operating margin decreases by 1.65% – b3 .02 For each additional 1000 sq-ft of office space, the average increase in operating margin will be .02%. For additional thousand students MARGIN increases by .21%. – b5 .41 For additional $1000 increase in median household income, MARGIN decreases by .41% – b6 .23 For each additional mile to the downtown center, MARGIN increases by .23% on the average • Testing the coefficients – The hypothesis for each bi H0: bi = 0 H1: bi = 0 Test statistic b bi t i sb i – Excel printout Coefficients Standard Error t Stat Intercept 72.45461 7.893104 9.179483 ROOMS -0.00762 0.001255 -6.06871 NEAREST -1.64624 0.632837 -2.60136 OFFICE 0.019766 0.00341 5.795594 COLLEGE 0.211783 0.133428 1.587246 INCOME -0.41312 0.139552 -2.96034 DISTTWN 0.225258 0.178709 1.260475 P-value 1.11E-14 2.77E-08 0.010803 9.24E-08 0.115851 0.003899 0.210651 d.f. = n - k -1 Lower 95% Upper 95% 56.78048735 88.12874 -0.010110582 -0.00513 -2.902924523 -0.38955 0.012993085 0.026538 -0.053178229 0.476744 -0.690245235 -0.136 -0.12962198 0.580138 • Using the linear regression equation – The model can be used by • Producing a prediction interval for the particular value of y, for a given set of values of xi. • Producing an interval estimate for the expected value of y, for a given set of values of xi. – The model can be used to learn about relationships between the independent variables xi, and the dependent variable y, by interpreting the coefficients bi • Example 18.1 - continued. Produce predictions – Predict the MARGIN of an inn at a site with the following characteristics: • • • • • • 3815 rooms within 3 miles, Closet competitor 3.4 miles away, 476,000 sq-ft of office space, 24,500 college students, $39,000 median household income, 3.6 miles distance to downtown center. MARGIN = 72.455 - 0.008(3815) -1.646(3.4) + 0.02(476) +0.212(24.5) - 0.413(39) + 0.225(3.6) = 37.1% 18.4 Regression Diagnostics - II • The required conditions for the model assessment to apply must be checked. – Is the error variable normally Draw a histogram of the residuals distributed? – Is the error variance constant? Plot the residuals versus y^ – Are the errors independent? Plot the residuals versus the time periods – Can we identify outliers? – Is multicollinearity a problem? • Example 18.2 House price and multicollinearity – A real estate agent believes that a house selling price can be predicted using the house size, number of bedrooms, and lot size. – A random sample of 100 houses was drawn and data recorded. Price 124100 218300 117800 . . Bedrooms 3 4 3 . . H Size 1290 2080 1250 . . Lot Size 3900 6600 3750 . . – Analyze the relationship among the four variables • Solution • The proposed model is PRICE = b0 + b1BEDROOMS + b2H-SIZE +b3LOTSIZE + e – Excel solution Regression Statistics Multiple R 0.74833 R Square 0.559998 Adjusted R Square 0.546248 Standard Error 25022.71 Observations 100 The model is valid, but no variable is significantly related to the selling price !! ANOVA df Regression Residual Total SS MS 3 7.65E+10 2.55E+10 96 6.01E+10 6.26E+08 99 1.37E+11 Coefficients Standard Error t Stat Intercept 37717.59 14176.74 2.660526 Bedrooms 2306.081 6994.192 0.329714 H Size 74.29681 52.97858 1.402393 Lot Size -4.36378 17.024 -0.25633 F Significance F 40.7269 4.57E-17 P-value Lower 95% Upper 95% 0.009145 9576.963 65858.23 0.742335 -11577.3 16189.45 0.164023 -30.8649 179.4585 0.798244 -38.1562 29.42862 • However, – when regressing the price on each independent variable alone, it is found that each variable is strongly related to the selling price. – Multicollinearity is the source of this problem. Price Bedrooms H Size Price 1 Bedrooms 0.645411 1 H Size 0.747762 0.846454 1 Lot Size 0.740874 0.83743 0.993615 Lot Size 1 • Multicollinearity causes two kinds of difficulties: – The t statistics appear to be too small. – The b coefficients cannot be interpreted as “slopes”. • Remedying violations of the required conditions – Nonnormality or heteroscedasticity can be remedied using transformations on the y variable. – The transformations can improve the linear relationship between the dependent variable and the independent variables. – Many computer software systems allow us to make the transformations easily. • A brief list of transformations » y’ = log y (for y > 0) • Use when the se increases with y, or • Use when the error distribution is positively skewed » y’ = y2 • Use when the s2e is proportional to E(y), or • Use when the error distribution is negatively skewed » y’ = y1/2 (for y > 0) • Use when the s2e is proportional to E(y) » y’ = 1/y • Use when s2e increases significantly when y increases beyond some value. • Example 18.3: Analysis, diagnostics, transformations. – A statistics professor wanted to know whether time limit affect the marks on a quiz? – A random sample of 100 students was split into 5 groups. – Each student wrote a quiz, but each group was given a different time limit. See data below. Time M a r k s 40 45 50 55 60 20 24 26 30 32 23 26 25 32 31 . Analyze . . these. results, . and . . . include .diagnostics. 50 40 The model tested: MARK = b0 + b1TIME + e 30 20 10 0 -2.5 -1.5 -0.5 0.5 1.5 2.5 SUMMARY OUTPUT Regression Statistics Multiple R 0.86254 R Square 0.743974 Adjusted R Square 0.741362 Standard Error 2.304609 Observations 100 This model is useful and provides a good fit. ANOVA df Regression Residual Total Intercept Time 1 98 99 SS MS F Significance F 1512.5 1512.5 284.7743 9.42E-31 520.5 5.311224 2033 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% -2.2 1.64582 -1.33672 0.184409 -5.46608 1.066077 0.55 0.032592 16.87526 9.42E-31 0.485322 0.614678 The errors seem to be normally distributed More Standardized errors vs. predicted mark. 4 3 2 1 0 -1 20 22 24 26 28 30 32 -2 -3 The standard error of estimate seems to increase with the predicted value of y. Two transformations are used to remedy this problem: 1. y’ = logey 2. y’ = 1/y Let us see what happens when a transformation is applied Mark 40 The original data, where “Mark” is a function of “Time” LogMark 35 4 30 The modified data, where LogMark is a function of “Time" Loge23 = 3.135 25 40,23 3 40, 3.135 40, 2.89 20 Loge18 = 2.89 40,18 15 2 0 20 40 60 80 0 20 40 60 80 The new regression analysis and the diagnostics are: The model tested: LOGMARK = b’0 + b’1TIME + e’ Predicted LogMark = 2.1295 + .0217Time SUMMARY OUTPUT Regression Statistics Multiple R 0.8783 R Square 0.771412 Adjusted R Square 0.769079 Standard Error 0.084437 Observations 100 This model is useful and provides a good fit. ANOVA df Regression Residual Total SS MS F Significance F 1 2.357901 2.357901 330.7181 3.58E-33 98 0.698705 0.00713 99 3.056606 Coefficients Standard Error t Stat Intercept 2.129582 0.0603 35.31632 Time 0.021716 0.001194 18.18566 P-value Lower 95% Upper 95% 1.51E-57 2.009918 2.249246 3.58E-33 0.019346 0.024086 40 30 The errors seem to be normally distributed 20 10 0 -2.5 -1.5 -0.5 0.5 1.5 2.5 More Standard Residuals The standard errors still changes with the predicted y, but the change is smaller than before. 4 2 0 -2 2.9 -4 3 3.1 3.2 3.3 3.4 3.5 Let TIME = 55 minutes LogMark = 2.1295 + .0217Time = 2.1295 + .0217(55) = 3.323 How do we use the modified model to predict? To find the predicted mark, take the antilog: antiloge3.323 = e3.323 = 27.743 18.5 Regression Diagnostics - III • The Durbin - Watson Test – This test detects first order auto-correlation between consecutive residuals in a time series – If autocorrelation exists the error variables are not independent n Residual at time i d (ri ri1 ) 2 i2 n ri 2 i 1 The range of d is 0 d 4 Positive first order autocorrelation occurs when consecutive residuals tend to be similar. Then, Residuals the value of d is small (less than 2). Positive first order autocorrelation + + + + 0 + + Time + + Negative first order autocorrelation Residuals + + Negative first order autocorrelation occurs when consecutive residuals tend to markedly differ. Then, the value of d is large (greater than 2). + + + + + 0 Time • One tail test for positive first order auto-correlation – If d<dL there is enough evidence to show that positive first-order correlation exists – If d>dU there is not enough evidence to show that positive first-order correlation exists – If d is between dL and dU the test is inconclusive. • One tail test for negative first order auto-correlation – If d>4-dL, negative first order correlation exists – If d<4-dU, negative first order correlation does not exists – if d falls between 4-dU and 4-dL the test is inconclusive. • Two-tail test for first order auto-correlation – If d<dL or d>4-dL first order auto-correlation exists – If d falls between dL and dU or between 4-dU and 4-dL the test is inconclusive – If d falls between dU and 4-dU there is no evidence for first order auto-correlation First order correlation exists 0 dL First order correlation does not exist Inconclusive test dU 2 First order correlation does not exist Inconclusive test 4-dU First order correlation exists 4-dL 4 • Example 18.4 – How does the weather affect the sales of lift tickets in a ski resort? – Data of the past 20 years sales of tickets, along with the total snowfall and the average temperature during Christmas week in each year, was collected. – The model hypothesized was TICKETS=b0+b1SNOWFALL+b2TEMPERATURE+e – Regression analysis yielded the following results: SUMMARY OUTPUT The model seems to be very poor: Regression Statistics Multiple R 0.3464529 R Square 0.1200296 Adjusted R Square 0.0165037 Standard Error 1711.6764 Observations 20 ANOVA df Regression Residual Total Intercept Snowfall Tempture • The fit is very low (R-square=0.12), • It is not valid (Signif. F =0.33) • No variable is linearly related to Sales Diagnosis of the required conditions resulted with SS MS F Signif. F following 6793798.2 the 3396899.1 1.1594 findings 0.3372706 2 17 49807214 2929836.1 19 56601012 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% 8308.0114 903.7285 9.1930391 5E-08 6401.3083 10214.715 74.593249 51.574829 1.4463111 0.1663 -34.22028 183.40678 -8.753738 19.704359 -0.444254 0.6625 -50.32636 32.818884 3000 2000 7 y Residual vs. predicted 1000 0 -10007500 -2000 The errors may be normally distributed 8500 9500 -3000 -4000 The error distribution 6 5 4 3 105002 11500 1 0 The error variance is constant 12500 -2.5 -1.5 -0.5 0.5 1.5 2.5 More Residual over time The errors are not independent 3000 2000 1000 0 -1000 0 -2000 -3000 -4000 5 10 15 20 25 Test for positive first order auto-correlation: n=20, k=2. From the Durbin-Watson table we have: dL=1.10, dU=1.54. The statistic d=0.59 Conclusion: Because d<dL , there is sufficient evidence to infer that positive first order auto-correlation exists. Using the computer - Excel Tools > data Analysis > Regression (check the residual option and then OK) Tools > Data Analysis Plus > Durbin Watson Statistic > Highlight the range of the residuals from the regression run > OK Durbin-Watson Statistic -2793.99 -1723.23 d = 0.5931 -2342.03 -956.955 The residuals -1963.73 . . Residuals 4000 2000 0 -2000 0 -4000 5 10 15 20 25 The modified regression model The autocorrelation has occurred over time. Therefore, a time dependent variable added TICKETS=b b2TEMPERATURE+ b3YEARS+e to the model may correct the problem 0+ b1SNOWFALL+ • All the required conditions are met for this model. • The fit of this model is high R2 = 0.74. • The model is useful. Significance F = 5.93 E-5. • SNOWFALL and YEARS are linearly related to ticket sales. • TEMPERATURE is not linearly related to ticket sales.