Regression Model Building LPGA Golf Performance - 2008 Data Description • Response: log(Prize Winnings/Round) – Skewed data • Potential Predictors: Average Drive Distance Percentage of Drives Reaching Fairway Percentage of Greens Reached in Regulation Average Putts per Hole Average Number of Sand Traps Hit per Round (Sandshot) Percentage of Sand Saves • Samples: Training Sample – 100 Randomly Sampled Golfers Validation Sample – 57 Remaining Golfers used to assess fit Modeling Strategies • Select Training Sample • Select “best” subset of predictors based on Backward Elimination, Forward Selection, Stepwise Regression and/or All Possible Regressions based on Minimizing: AIC n log SSE Model n log(n) 2 p ' p ' # parameters in model • Identify any Influential Observations (based on Outliers, Leverage Values, DFFITS, DFBETAS, Cook’s D) • Test Model Assumptions: Normality (Shapiro-Wilk), Constant Variance (Brown-Forsyth and Breusch-Pagan) • Determine Validity of model by obtaining prediction errors for validation sample Top of Entire Sample (First 20 Golfers) golfer Ahn, Shi Hyun Alfredsson, Helen Ammaccapane, Dina Bader, Beth Bae, Kyeong Baena, Marisa Bastel, Emily Blasberg, Erica Blomqvist, Minea Bowie Young, Heather Bunch, Ashli Burks, Audra Burton, Brandie Castrale, Nicole Cavalleri, Silvia Cho, Irene Choi, Hye Jung Choi, Na Yeon Chung, Ilmi Coutu, Taylor drive fairway green putts sandshot sandsave prz 249.4 64.6 61.2 27.44 55 34.5 253.8 62.7 68.2 29.36 49 38.8 246.3 70.2 64.6 30.20 37 40.5 249.1 64.1 61.2 29.78 73 41.1 244.0 62.4 60.7 28.38 66 43.9 254.2 64.7 60.9 29.21 66 33.3 237.4 73.6 60.5 30.60 59 28.8 245.4 69.2 63.2 28.68 44 27.3 253.2 62.6 59.7 27.35 70 44.3 251.0 67.4 63.0 28.83 58 34.5 246.6 70.1 64.7 31.36 35 42.9 239.2 68.6 60.5 30.11 28 39.3 244.2 65.5 67.3 30.62 47 27.7 245.2 71.3 67.1 28.92 61 27.9 240.7 69.1 59.6 30.08 53 35.8 243.5 70.2 63.3 29.29 49 42.9 242.5 69.3 60.9 27.78 73 37.0 257.4 68.5 68.5 28.43 55 45.5 242.6 64.6 63.0 28.54 75 29.3 241.0 70.0 63.0 30.13 37 48.6 logprz 6063 8.7099 19343 9.8701 1873 7.5353 1212 7.1004 2555 7.8459 2282 7.7327 921 6.8258 1923 7.5614 6726 8.8137 2689 7.8969 1281 7.1551 1460 7.2863 1668 7.4193 7209 8.8830 1947 7.5742 3214 8.0754 3470 8.1518 14808 9.6029 2827 7.9470 2252 7.7194 Backward Elimination (RSS = SSE) Step 1: Start: AIC=-200.22 logprz ~ drive + fairway + green + putts + sandshot + sandsave Step 2: AIC=-202.13 logprz ~ drive + green + putts + sandshot + sandsave Df Sum of Sq - fairway <none> - drive - sandsave - sandshot - green - putts Df Sum of Sq RSS 1 0.010 11.750 11.740 1 0.397 12.138 1 0.405 12.145 1 1.030 12.770 1 24.960 36.700 1 35.360 47.100 AIC -202.132 -200.216 -198.887 -198.827 -193.806 -88.238 -63.289 <none> - sandsave - drive - sandshot - green - putts 1 1 1 1 1 0.400 0.537 1.034 32.091 35.688 RSS 11.750 12.150 12.287 12.784 43.841 47.438 • At Step 1, Fairway is eliminated, AIC Is minimized (-202.132 < -200.216) • At Step 2, no other variables are removed (no AIC < -202.132) AIC -202.132 -200.784 -199.665 -195.698 -72.461 -64.575 Forward Selection (RSS = SSE) Step 1: Start: logprz ~ 1 AIC=-6.61 Df Sum of Sq RSS + green 1 38.599 53.150 + putts 1 33.043 58.706 + drive 1 11.622 80.126 + sandshot 1 8.951 82.798 + sandsave 1 3.118 88.631 <none> 91.749 + fairway 1 0.409 91.340 Step 2: AIC=-59.21 logprz ~ green AIC -59.206 -49.263 -18.156 -14.876 -8.069 -6.611 -5.058 Step 4: AIC=-196.8 logprz ~ green + putts + sandshot Df Sum of Sq + drive 1 0.74905 + sandsave 1 0.61234 <none> + fairway 1 0.25056 RSS 12.150 12.287 12.899 12.649 AIC -200.78 -199.66 -196.80 -196.76 Step 5: AIC=-200.78 logprz ~ green + putts + sandshot + drive Df Sum of Sq RSS AIC + sandsave 1 0.40005 11.750 -202.13 <none> 12.150 -200.78 + fairway 1 0.00524 12.145 -198.83 Df Sum of Sq + putts 1 39.514 + sandsave 1 4.859 <none> + fairway 1 0.635 + drive 1 0.361 + sandshot 1 0.004 RSS AIC 13.636 -193.246 48.291 -66.793 53.150 -59.206 52.514 -58.408 52.788 -57.888 53.146 -57.214 Step 3: AIC=-193.25 logprz ~ green + putts Df Sum of Sq + sandshot 1 0.73688 + sandsave 1 0.66486 + drive 1 0.31495 <none> + fairway 1 0.09401 RSS 12.899 12.971 13.321 13.636 13.542 AIC -196.80 -196.25 -193.58 -193.25 -191.94 Step 6: AIC=-202.13 logprz ~ green + putts + sandshot + drive + sandsave Df Sum of Sq <none> + fairway RSS AIC 11.75 -202.13 1 0.0099086 11.74 -200.22 Model – green, putts, sandshot, sandsave, drive Call: lm(formula = logprz ~ green + putts + sandshot + sandsave + drive, data = lpga.cv.in) Residuals: Min 1Q -0.72852 -0.20634 Median 0.01067 3Q 0.22439 Max 0.72316 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 14.272879 1.580975 9.028 2.14e-14 *** green 0.210379 0.013130 16.023 < 2e-16 *** putts -0.625367 0.037011 -16.897 < 2e-16 *** sandshot 0.790771 0.274937 2.876 0.00498 ** sandsave 0.008334 0.004658 1.789 0.07684 . drive -0.009563 0.004615 -2.072 0.04098 * --Residual standard error: 0.3536 on 94 degrees of freedom Multiple R-squared: 0.8719, Adjusted R-squared: 0.8651 F-statistic: 128 on 5 and 94 DF, p-value: < 2.2e-16 ^ Y 14.2729 + 0.2104green - 0.6254putts + 0.7908sandshot + 0.0083sandsave - 0.0096drive s 0.354 R 2 0.8719 Influence Measures (n=100, p’=6) .05 ,93 3.607 Studentized Residuals: Outlier if ri t , n p ' 1 t 2(100) 2n Leverage Values: Potentially highly Influential if hi 2( p ') 12 0.12 100 n DFFITS: Highly influential wrt own fitted value if DFFITSi 2 6 p' 0.49 2 100 n DFBETAS: Highly influential wrt regression coefficient if DFBETASi ( j ) 2 2 0.20 100 n Cook's D:Aggregate impact on all regression coefficients and fitted value if Di 1 Another often used rule for Cook's D: Di F 0.50, p ', n p ' (also graphics used to detect) Summary of Influence Measures - I • Studentized Residuals (Exceed 3.607 in absolute value) Extreme values (in absolute value): -2.172 and +2.112 • Leverage Values (Exceed 0.12) Golfers 111 (h=0.1543), 127 (0.1263), 113 (0.1213) (No big problem) • DFFITS (Exceed 0.49 in absolute value) Three Golfers between -0.61 and -0.49 (Golfers 142, 91, and 117) One Golfer between 0.49 and 0.59 (Golfer 59) • Cook’s D (Exceed 1, sometimes suggested to exceed 0.5) Max value is .0626. None come close to 1 (or the sometimes suggested ½) Summary of Influence Measures • DFBETAS (Exceed 0.20 in absolute value) Intercept: Golfer 117 (-0.54), 28 (0.24), 45 (0.29), 59 (0.34), 142 (0.45) Greens: Golfer 132 (-0.25), 91 (0.24), 110 (0.25), 142 (0.33) Putts: Golfer 142 (-0.41), 25 (0.24), 117 (0.43) Sandshots: Golfer 132 (-0.25), 111 (0.23), 39 (0.23), 110 (0.24) Sandsaves: Golfers 59 (-0.43), 22 (-0.31), 91 (-0.30), 102 (-0.25), 115 (0.23), 47 (0.43) Drive: Golfers 142 (-0.49), 59 (-0.24), 56 (0.28), 117 (0.29), 48 (0.30) • Note that while some of these exceed the “threshold” none seem to be way too excessive. However, golfers 142 and 117 appear regularly, they should be checked out Residuals appear to be (reasonably) approximately normal. Shapiro-Wilk test does not reject the hypothesis of normal errors > shapiro.test(residuals(lpga.mod1)) Shapiro-Wilk normality test data: residuals(lpga.mod1) W = 0.9833, p-value = 0.2390 No Evidence of non-constant error variance (Data had been transformed prior to fitting model) Equal (Homogeneous) Variance - I Brown-Forsythe Test: H 0 : Equal Variance Among Errors V i 2 i H A : Unequal Variance Among Errors (Increasing or Decreasing in X ) ^ 1) Split Dataset into 2 groups based on levels of Y with sample sizes: n1 , n2 2) Compute the median residual in each group: e1 , e 2 3) Compute absolute deviation from group median for each residual: dij eij e j i 1,..., n j j 1, 2 4) Compute the mean and variance for each group of dij : d 1 , s12 5) Compute the pooled variance: s Test Statistic: t BF d1 d 2 1 1 s n1 n2 2 n1 1 s12 n2 1 s22 n1 n2 2 H0 ~ d 2 , s22 tn1 n2 2 Group 1 2 Yhat_L 5.976 8.005 Yhat_H 7.972 10.217 n(i) 50 50 s2 0.0415 t(BF) -1.3211 t(.025) 1.9646 P-value 0.1871 med(e) 0.0379 -0.0310 dbar(i) 0.2493 0.3031 s2(i) 0.0404 0.0427 No evidence to reject the null hypothesis of equal variance among errors Equal (Homogeneous) Variance Breusch-Pagan (aka Cook-Weisberg) Test: H 0 : Equal Variance Among Errors V i 2 i ANOVA H A : Unequal Variance Among Errors h 1 X i1 ... p X ip 2 i 2 df Regression Residual SS 5 0.053308 94 1.941871 n 1) Let SSE ei2 i 1 2) Fit Regression of ei2 on X i1 ,...X ip and obtain SS Reg * Test Statistic: X 2 BP SS Reg * 2 n 2 ei n i 1 2 H0 ~ p2 There is no evidence of unequal variance, based on either Brown-Forsyth or BreuschPagan tests SS(Reg*) SSE SS(Reg*)/2 SSE/512 X2(BP) X2(.05,df=5) P-value Breusch-Pagan test data: logprz ~ green + putts + sandshot + sandsave + drive BP = 1.9306, df = 5, p-value = 0.8587 0.053308 11.74995 0.026654 0.1175 1.930591 11.0705 0.858663