Slides Prepared by JOHN S. LOUCKS St. Edward’s University © 2003 South-Western/Thomson Learning™ Slide 1 Chapter 16 Regression Analysis: Model Building General Linear Model Determining When to Add or Delete Variables Analysis of a Larger Problem Variable-Selection Procedures Residual Analysis Multiple Regression Approach to Analysis of Variance and Experimental Design © 2003 South-Western/Thomson Learning™ Slide 2 General Linear Model Models in which the parameters (0, 1, . . . , p ) all have exponents of one are called linear models. First-Order Model with One Predictor Variable y 0 1 x1 Second-Order Model with One Predictor Variable y 0 1 x 1 2 x 12 Second-Order Model with Two Predictor Variables with Interaction y 0 1 x 1 2 x 2 3 x 12 4 x 22 5 x 1 x 2 © 2003 South-Western/Thomson Learning™ Slide 3 General Linear Model Often the problem of nonconstant variance can be corrected by transforming the dependent variable to a different scale. Logarithmic Transformations Most statistical packages provide the ability to apply logarithmic transformations using either the base-10 (common log) or the base e = 2.71828... (natural log). Reciprocal Transformation Use 1/y as the dependent variable instead of y. © 2003 South-Western/Thomson Learning™ Slide 4 General Linear Model Models in which the parameters (0, 1, . . . , p ) have exponents other than one are called nonlinear models. In some cases we can perform a transformation of variables that will enable us to use regression analysis with the general linear model. Exponential Model The exponential model involves the regression equation: E( y ) 0 x1 We can transform this nonlinear model to a linear model by taking the logarithm of both sides. © 2003 South-Western/Thomson Learning™ Slide 5 Variable Selection Procedures Stepwise Regression Forward Selection Backward Elimination Best-Subsets Regression © 2003 South-Western/Thomson Learning™ Iterative; one independent variable at a time is added or deleted based on the F statistic Different subsets of the independent variables are evaluated Slide 6 Variable Selection Procedures F Test To test whether the addition of x2 to a model involving x1 (or the deletion of x2 from a model involving x1and x2) is statistically significant (SSE(reduced)-SSE(full))/number of extra terms F MSE(full) (SSE(x1 )-SSE(x1 ,x2 ))/1 F (SSE(x1 , x2 ))/(n p 1) The p-value corresponding to the F statistic is the criterion used to determine if a variable should be added or deleted © 2003 South-Western/Thomson Learning™ Slide 7 Stepwise Regression Compute F stat. and p-value for each indep. variable in model Any p-value > alpha to remove ? No Yes Compute F stat. and p-value for each indep. variable not in model Indep. variable with smallest p-value is entered into model Indep. variable with largest p-value is removed from model Yes Any p-value < alpha to enter ? No Start © 2003 South-Western/Thomson Learning™ Stop Slide 8 Forward Selection This procedure is similar to stepwise-regression, but does not permit a variable to be deleted. This forward-selection procedure starts with no independent variables. It adds variables one at a time as long as a significant reduction in the error sum of squares (SSE) can be achieved. © 2003 South-Western/Thomson Learning™ Slide 9 Forward Selection Start with no indep. variables in model Compute F stat. and p-value for each indep. variable not in model Any p-value < alpha to enter ? Yes Indep. variable with smallest p-value is entered into model No Stop © 2003 South-Western/Thomson Learning™ Slide 10 Backward Elimination This procedure begins with a model that includes all the independent variables the modeler wants considered. It then attempts to delete one variable at a time by determining whether the least significant variable currently in the model can be removed because its pvalue is less than the user-specified or default value. Once a variable has been removed from the model it cannot reenter at a subsequent step. © 2003 South-Western/Thomson Learning™ Slide 11 Backward Elimination Start with all indep. variables in model Compute F stat. and p-value for each indep. variable in model Any p-value > alpha to remove ? Yes Indep. variable with largest p-value is removed from model No Stop © 2003 South-Western/Thomson Learning™ Slide 12 Example: Clarksville Homes Tony Zamora, a real estate investor, has just moved to Clarksville and wants to learn about the city’s residential real estate market. Tony has randomly selected 25 house-for-sale listings from the Sunday newspaper and collected the data listed on the next three slides. Develop, using the backward elimination procedure, a multiple regression model to predict the selling price of a house in Clarksville. © 2003 South-Western/Thomson Learning™ Slide 13 Using Excel to Perform the Backward Elimination Procedure Worksheet (showing partial data) A 1 2 3 4 5 6 7 8 9 Segment of City Northwest South Northeast Northwest West South West West B C D E F Selling House Number Number Garage Price Size of of Size ($000) (00 sq. ft.) Bedrms. Bathrms. (cars) 290 21 4 2 2 95 11 2 1 0 170 19 3 2 2 375 38 5 4 3 350 24 4 3 2 125 10 2 2 0 310 31 4 4 2 275 25 3 2 2 Note: Rows 10-26 are not shown. © 2003 South-Western/Thomson Learning™ Slide 14 Using Excel to Perform the Backward Elimination Procedure Worksheet (showing partial data) A 1 10 11 12 13 14 15 16 17 Segment of City Northwest Northeast Northwest South Northwest West South South B C D E F Selling House Number Number Garage Price Size of of Size ($000) (00 sq. ft.) Bedrms. Bathrms. (cars) 340 27 5 3 3 215 22 4 3 2 295 20 4 3 2 190 24 4 3 2 385 36 5 4 3 430 32 5 4 2 185 14 3 2 1 175 18 4 2 2 Note: Rows 2-9 are hidden and rows 18-26 not shown. © 2003 South-Western/Thomson Learning™ Slide 15 Using Excel to Perform the Backward Elimination Procedure Worksheet (showing partial data) A 1 18 19 20 21 22 23 24 25 26 Segment of City Northeast Northwest West Northeast West Northwest South Northeast West B C D E F Selling House Number Number Garage Price Size of of Size ($000) (00 sq. ft.) Bedrms. Bathrms. (cars) 190 19 4 2 2 330 29 4 4 3 405 33 5 4 3 170 23 4 2 2 365 34 5 4 3 280 25 4 2 2 135 17 3 1 1 205 21 4 3 2 260 26 4 3 2 Note: Rows 2-17 are hidden. © 2003 South-Western/Thomson Learning™ Slide 16 Using Excel to Perform the Backward Elimination Procedure Value Worksheet (partial) A 27 28 29 30 31 32 33 34 35 36 B C SUMMARY OUTPUT Regression Statistics Multiple R 0.898964443 R Square 0.80813707 Adjusted R Square 0.769764484 Standard Error 45.87155025 Observations 25 © 2003 South-Western/Thomson Learning™ Slide 17 Using Excel to Perform the Backward Elimination Procedure Value Worksheet (partial) A 36 37 38 39 40 41 42 B C D E F ANOVA df SS MS F Significance F Regression 4 177260 44315 21.06027 6.1385E-07 Residual 20 42083.98 2104.199 Total 24 219344 © 2003 South-Western/Thomson Learning™ Slide 18 Using Excel to Perform the Backward Elimination Procedure Value Worksheet (partial) A 42 43 44 45 46 47 48 49 B C Coeffic. Std. Err. Intercept -59.416 54.6072 House Size 6.50587 3.24687 Bedrooms 29.1013 26.2148 Bathrooms 26.4004 18.8077 Cars -10.803 27.329 © 2003 South-Western/Thomson Learning™ D E t Stat -1.0881 2.0037 1.1101 1.4037 -0.3953 P-value 0.28951 0.05883 0.28012 0.17574 0.6968 Slide 19 Using Excel to Perform the Backward Elimination Procedure Cars (garage size) is the independent variable with the highest p-value (.697) > .05 Cars is removed from the model Multiple regression is performed again on the remaining independent variables © 2003 South-Western/Thomson Learning™ Slide 20 Using Excel to Perform the Backward Elimination Procedure Value Worksheet (partial) A 27 28 29 30 31 32 33 34 35 36 B C SUMMARY OUTPUT Regression Statistics Multiple R 0.898130279 R Square 0.806637998 Adjusted R Square 0.779014855 Standard Error 44.94059302 Observations 25 © 2003 South-Western/Thomson Learning™ Slide 21 Using Excel to Perform the Backward Elimination Procedure Value Worksheet (partial) A 36 37 38 39 40 41 42 B C D E F ANOVA df SS MS F Significance F Regression 4 177260 44315 21.06027 6.1385E-07 Residual 20 42083.98 2104.199 Total 24 219344 © 2003 South-Western/Thomson Learning™ Slide 22 Using Excel to Perform the Backward Elimination Procedure Value Worksheet (partial) A 42 43 44 45 46 47 48 49 B C Coeffic. Std. Err. Intercept -47.342 44.3467 House Size 6.02021 2.94446 Bedrooms 23.0353 20.8229 Bathrooms 27.0286 18.3601 © 2003 South-Western/Thomson Learning™ D E t Stat -1.0675 2.0446 1.1062 1.4721 P-value 0.29785 0.05363 0.28113 0.15581 Slide 23 Using Excel to Perform the Backward Elimination Procedure Bedrooms is the independent variable with the highest p-value (.281) > .05 Bedrooms is removed from the model Multiple regression is performed again on the remaining independent variables © 2003 South-Western/Thomson Learning™ Slide 24 Using Excel to Perform the Backward Elimination Procedure Value Worksheet (partial) A 27 28 29 30 31 32 33 34 35 36 B C SUMMARY OUTPUT Regression Statistics Multiple R 0.891835053 R Square 0.795369762 Adjusted R Square 0.776767013 Standard Error 45.1685807 Observations 25 © 2003 South-Western/Thomson Learning™ Slide 25 Using Excel to Perform the Backward Elimination Procedure Value Worksheet (partial) A 36 37 38 39 40 41 42 B C D E F ANOVA df SS MS Regression 2 174459.6 87229.79 Residual 22 44884.42 2040.201 Total 24 219344 © 2003 South-Western/Thomson Learning™ F Significance F 42.7555 2.63432E-08 Slide 26 Using Excel to Perform the Backward Elimination Procedure Value Worksheet (partial) A B C D E 42 43 Coeffic. Std. Err. t Stat P-value 44 Intercept -12.349 31.2392 -0.3953 0.69642 45 House Size 7.94652 2.38644 3.3299 0.00304 46 Bathrooms 30.3444 18.2056 1.6668 0.10974 47 48 49 © 2003 South-Western/Thomson Learning™ Slide 27 Using Excel to Perform the Backward Elimination Procedure Bathrooms is the independent variable with the highest p-value (.110) > .05 Bathrooms is removed from the model Regression is performed again on the remaining independent variable © 2003 South-Western/Thomson Learning™ Slide 28 Using Excel to Perform the Backward Elimination Procedure Value Worksheet (partial) A 27 28 29 30 31 32 33 34 35 36 B C SUMMARY OUTPUT Regression Statistics Multiple R 0.877228487 R Square 0.769529819 Adjusted R Square 0.759509376 Standard Error 46.88202186 Observations 25 © 2003 South-Western/Thomson Learning™ Slide 29 Using Excel to Perform the Backward Elimination Procedure Value Worksheet (partial) A 36 37 38 39 40 41 42 B C D E F ANOVA df SS MS F Significance F Regression 1 168791.7 168791.7 76.79599 8.67454E-09 Residual 23 50552.25 2197.924 Total 24 219344 © 2003 South-Western/Thomson Learning™ Slide 30 Using Excel to Perform the Backward Elimination Procedure Value Worksheet (partial) A B C D E 42 43 Coeffic. Std. Err. t Stat P-value 44 Intercept -9.8669 32.3874 -0.3047 0.76337 45 House Size 11.3383 1.29384 8.7633 8.7E-09 46 47 48 49 © 2003 South-Western/Thomson Learning™ Slide 31 Using Excel to Perform the Backward Elimination Procedure House size is the only independent variable remaining in the model The estimated regression equation is: yˆ 9.8669 11.3383(House Size) The Adjusted R Square value is .760 © 2003 South-Western/Thomson Learning™ Slide 32 Variable-Selection Procedures Best-Subsets Regression • The three preceding procedures are one-variableat-a-time methods offering no guarantee that the best model for a given number of variables will be found. • Some statistical software packages include bestsubsets regression that enables the user to find, given a specified number of independent variables, the best regression model. • Typical output identifies the two best one-variable estimated regression equations, the two best twovariable regression equations, and so on. © 2003 South-Western/Thomson Learning™ Slide 33 Example: PGA Tour Data The Professional Golfers Association keeps a variety of statistics regarding performance measures. Data include the average driving distance, percentage of drives that land in the fairway, percentage of greens hit in regulation, average number of putts, percentage of sand saves, and average score. The variable names and definitions are shown on the next slide. © 2003 South-Western/Thomson Learning™ Slide 34 Example: PGA Tour Data Variable Names and Definitions Drive: average length of a drive in yards Fair: percentage of drives that land in the fairway Green: percentage of greens hit in regulation (a par-3 green is “hit in regulation” if the player’s first shot lands on the green) Putt: average number of putts for greens that have been hit in regulation Sand: percentage of sand saves (landing in a sand trap and still scoring par or better) Score: average score for an 18-hole round © 2003 South-Western/Thomson Learning™ Slide 35 Example: PGA Tour Data Sample Data Drive 277.6 259.6 269.1 267.0 267.3 255.6 272.9 265.4 Fair .681 .691 .657 .689 .581 .778 .615 .718 Green .667 .665 .649 .673 .637 .674 .667 .699 © 2003 South-Western/Thomson Learning™ Putt 1.768 1.810 1.747 1.763 1.781 1.791 1.780 1.790 Sand .550 .536 .472 .672 .521 .455 .476 .551 Score 69.10 71.09 70.12 69.88 70.71 69.76 70.19 69.73 Slide 36 Example: PGA Tour Data Sample Data (continued) Drive 272.6 263.9 267.0 266.0 258.1 255.6 261.3 262.2 Fair .660 .668 .686 .681 .695 .792 .740 .721 Green .672 .669 .687 .670 .641 .672 .702 .662 © 2003 South-Western/Thomson Learning™ Putt 1.803 1.774 1.809 1.765 1.784 1.752 1.813 1.754 Sand .431 .493 .492 .599 .500 .603 .529 .576 Score 69.97 70.33 70.32 70.09 70.46 69.49 69.88 70.27 Slide 37 Example: PGA Tour Data Sample Data (continued) Drive 260.5 271.3 263.3 276.6 252.1 263.0 263.0 253.5 266.2 Fair .703 .671 .714 .634 .726 .687 .639 .732 .681 Green .623 .666 .687 .643 .639 .675 .647 .693 .657 © 2003 South-Western/Thomson Learning™ Putt 1.782 1.783 1.796 1.776 1.788 1.786 1.760 1.797 1.812 Sand .567 .492 .468 .541 .493 .486 .374 .518 .472 Score 70.72 70.30 69.91 70.69 70.59 70.20 70.81 70.26 70.96 Slide 38 Example: PGA Tour Data Sample Correlation Coefficients Drive Fair Green Putt Sand Score -.154 -.427 -.556 .258 -.278 Drive Fair Green Putt -.679 -.045 -.139 -.024 .421 .101 .265 .354 .083 -.296 © 2003 South-Western/Thomson Learning™ Slide 39 Example: PGA Tour Data Best Subsets Regression of SCORE Vars R-sq R-sq(a) C-p 1 30.9 27.9 26.9 1 18.2 14.6 35.7 2 54.7 50.5 12.4 2 54.6 50.5 12.5 3 60.7 55.1 10.2 3 59.1 53.3 11.4 4 72.2 66.8 4.2 4 60.9 53.1 12.1 5 72.6 65.4 6.0 © 2003 South-Western/Thomson Learning™ s .39685 .43183 .32872 .32891 .31318 .31957 .26913 .32011 .27499 D F G X X X X X X X X X X X X X X X X X X P S X X X X X X X Slide 40 Example: PGA Tour Data The regression equation Score = 74.678 - .0398(Drive) - 6.686(Fair) - 10.342(Green) + 9.858(Putt) Predictor Coef Stdev t-ratio p Constant 74.678 6.952 10.74 .000 Drive -.0398 .01235 -3.22 .004 Fair -6.686 1.939 -3.45 .003 Green -10.342 3.561 -2.90 .009 Putt 9.858 3.180 3.10 .006 s = .2691 R-sq = 72.4% R-sq(adj) = 66.8% © 2003 South-Western/Thomson Learning™ Slide 41 Example: PGA Tour Data Analysis of Variance SOURCE Regression Error Total DF 4 20 24 © 2003 South-Western/Thomson Learning™ SS 3.79469 1.44865 5.24334 MS .94867 .07243 F 13.10 P .000 Slide 42 Residual Analysis: Autocorrelation Durbin-Watson Test for Autocorrelation • Statistic n 2 ( et et 1 ) d t 2 n 2 et2 t 1 • • • • The statistic ranges in value from zero to four. If successive values of the residuals are close together (positive autocorrelation), the statistic will be small. If successive values are far apart (negative autocorrelation), the statistic will be large. A value of two indicates no autocorrelation. © 2003 South-Western/Thomson Learning™ Slide 43 End of Chapter 16 © 2003 South-Western/Thomson Learning™ Slide 44