LPGA 2008 Case Study

advertisement
Regression Model Building
LPGA Golf Performance - 2008
Data Description
• Response: log(Prize Winnings/Round) – Skewed data
• Potential Predictors:






Average Drive Distance
Percentage of Drives Reaching Fairway
Percentage of Greens Reached in Regulation
Average Putts per Hole
Average Number of Sand Traps Hit per Round (Sandshot)
Percentage of Sand Saves
• Samples:
 Training Sample – 100 Randomly Sampled Golfers
 Validation Sample – 57 Remaining Golfers used to assess fit
Modeling Strategies
• Select Training Sample
• Select “best” subset of predictors based on Backward
Elimination, Forward Selection, Stepwise Regression
and/or All Possible Regressions based on Minimizing:
AIC  n log  SSE  Model    n log(n)  2 p '
p '  # parameters in model
• Identify any Influential Observations (based on Outliers,
Leverage Values, DFFITS, DFBETAS, Cook’s D)
• Test Model Assumptions: Normality (Shapiro-Wilk),
Constant Variance (Brown-Forsyth and Breusch-Pagan)
• Determine Validity of model by obtaining prediction
errors for validation sample
Top of Entire Sample (First 20 Golfers)
golfer
Ahn, Shi Hyun
Alfredsson, Helen
Ammaccapane, Dina
Bader, Beth
Bae, Kyeong
Baena, Marisa
Bastel, Emily
Blasberg, Erica
Blomqvist, Minea
Bowie Young, Heather
Bunch, Ashli
Burks, Audra
Burton, Brandie
Castrale, Nicole
Cavalleri, Silvia
Cho, Irene
Choi, Hye Jung
Choi, Na Yeon
Chung, Ilmi
Coutu, Taylor
drive
fairway green
putts
sandshot sandsave prz
249.4
64.6
61.2
27.44
55
34.5
253.8
62.7
68.2
29.36
49
38.8
246.3
70.2
64.6
30.20
37
40.5
249.1
64.1
61.2
29.78
73
41.1
244.0
62.4
60.7
28.38
66
43.9
254.2
64.7
60.9
29.21
66
33.3
237.4
73.6
60.5
30.60
59
28.8
245.4
69.2
63.2
28.68
44
27.3
253.2
62.6
59.7
27.35
70
44.3
251.0
67.4
63.0
28.83
58
34.5
246.6
70.1
64.7
31.36
35
42.9
239.2
68.6
60.5
30.11
28
39.3
244.2
65.5
67.3
30.62
47
27.7
245.2
71.3
67.1
28.92
61
27.9
240.7
69.1
59.6
30.08
53
35.8
243.5
70.2
63.3
29.29
49
42.9
242.5
69.3
60.9
27.78
73
37.0
257.4
68.5
68.5
28.43
55
45.5
242.6
64.6
63.0
28.54
75
29.3
241.0
70.0
63.0
30.13
37
48.6
logprz
6063
8.7099
19343
9.8701
1873
7.5353
1212
7.1004
2555
7.8459
2282
7.7327
921
6.8258
1923
7.5614
6726
8.8137
2689
7.8969
1281
7.1551
1460
7.2863
1668
7.4193
7209
8.8830
1947
7.5742
3214
8.0754
3470
8.1518
14808
9.6029
2827
7.9470
2252
7.7194
Backward Elimination (RSS = SSE)
Step 1:
Start: AIC=-200.22
logprz ~ drive + fairway + green + putts
+ sandshot + sandsave
Step 2: AIC=-202.13
logprz ~ drive + green + putts + sandshot
+ sandsave
Df Sum of Sq
- fairway
<none>
- drive
- sandsave
- sandshot
- green
- putts
Df Sum of Sq
RSS
1
0.010 11.750
11.740
1
0.397 12.138
1
0.405 12.145
1
1.030 12.770
1
24.960 36.700
1
35.360 47.100
AIC
-202.132
-200.216
-198.887
-198.827
-193.806
-88.238
-63.289
<none>
- sandsave
- drive
- sandshot
- green
- putts
1
1
1
1
1
0.400
0.537
1.034
32.091
35.688
RSS
11.750
12.150
12.287
12.784
43.841
47.438
• At Step 1, Fairway is eliminated, AIC Is minimized (-202.132 < -200.216)
• At Step 2, no other variables are removed (no AIC < -202.132)
AIC
-202.132
-200.784
-199.665
-195.698
-72.461
-64.575
Forward Selection (RSS = SSE)
Step 1: Start:
logprz ~ 1
AIC=-6.61
Df Sum of Sq
RSS
+ green
1
38.599 53.150
+ putts
1
33.043 58.706
+ drive
1
11.622 80.126
+ sandshot 1
8.951 82.798
+ sandsave 1
3.118 88.631
<none>
91.749
+ fairway
1
0.409 91.340
Step 2: AIC=-59.21
logprz ~ green
AIC
-59.206
-49.263
-18.156
-14.876
-8.069
-6.611
-5.058
Step 4: AIC=-196.8
logprz ~ green + putts + sandshot
Df Sum of Sq
+ drive
1
0.74905
+ sandsave 1
0.61234
<none>
+ fairway
1
0.25056
RSS
12.150
12.287
12.899
12.649
AIC
-200.78
-199.66
-196.80
-196.76
Step 5: AIC=-200.78
logprz ~ green + putts + sandshot + drive
Df Sum of Sq
RSS
AIC
+ sandsave 1
0.40005 11.750 -202.13
<none>
12.150 -200.78
+ fairway
1
0.00524 12.145 -198.83
Df Sum of Sq
+ putts
1
39.514
+ sandsave 1
4.859
<none>
+ fairway
1
0.635
+ drive
1
0.361
+ sandshot 1
0.004
RSS
AIC
13.636 -193.246
48.291 -66.793
53.150 -59.206
52.514 -58.408
52.788 -57.888
53.146 -57.214
Step 3: AIC=-193.25
logprz ~ green + putts
Df Sum of Sq
+ sandshot 1
0.73688
+ sandsave 1
0.66486
+ drive
1
0.31495
<none>
+ fairway
1
0.09401
RSS
12.899
12.971
13.321
13.636
13.542
AIC
-196.80
-196.25
-193.58
-193.25
-191.94
Step 6: AIC=-202.13
logprz ~ green + putts + sandshot + drive +
sandsave
Df Sum of Sq
<none>
+ fairway
RSS
AIC
11.75 -202.13
1 0.0099086 11.74 -200.22
Model – green, putts, sandshot, sandsave, drive
Call:
lm(formula = logprz ~ green + putts + sandshot + sandsave + drive,
data = lpga.cv.in)
Residuals:
Min
1Q
-0.72852 -0.20634
Median
0.01067
3Q
0.22439
Max
0.72316
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.272879
1.580975
9.028 2.14e-14 ***
green
0.210379
0.013130 16.023 < 2e-16 ***
putts
-0.625367
0.037011 -16.897 < 2e-16 ***
sandshot
0.790771
0.274937
2.876 0.00498 **
sandsave
0.008334
0.004658
1.789 0.07684 .
drive
-0.009563
0.004615 -2.072 0.04098 *
--Residual standard error: 0.3536 on 94 degrees of freedom
Multiple R-squared: 0.8719,
Adjusted R-squared: 0.8651
F-statistic:
128 on 5 and 94 DF, p-value: < 2.2e-16
^
Y  14.2729 + 0.2104green - 0.6254putts + 0.7908sandshot + 0.0083sandsave - 0.0096drive
s  0.354
R 2  0.8719
Influence Measures (n=100, p’=6)

  .05

,93   3.607
Studentized Residuals: Outlier if ri  t  , n  p ' 1   t 
  2(100)
 2n

Leverage Values: Potentially highly Influential if hi 
2( p ') 12
 0.12

100
n
DFFITS: Highly influential wrt own fitted value if DFFITSi  2
6
p'
 0.49
2
100
n
DFBETAS: Highly influential wrt regression coefficient if DFBETASi ( j ) 
2
2
 0.20

100
n
Cook's D:Aggregate impact on all regression coefficients and fitted value if Di  1
Another often used rule for Cook's D: Di  F  0.50, p ', n  p '  (also graphics used to detect)
Summary of Influence Measures - I
• Studentized Residuals (Exceed 3.607 in absolute value)
 Extreme values (in absolute value): -2.172 and +2.112
• Leverage Values (Exceed 0.12)
 Golfers 111 (h=0.1543), 127 (0.1263), 113 (0.1213) (No big
problem)
• DFFITS (Exceed 0.49 in absolute value)
 Three Golfers between -0.61 and -0.49 (Golfers 142, 91, and 117)
 One Golfer between 0.49 and 0.59 (Golfer 59)
• Cook’s D (Exceed 1, sometimes suggested to exceed 0.5)
 Max value is .0626. None come close to 1 (or the sometimes
suggested ½)
Summary of Influence Measures
• DFBETAS (Exceed 0.20 in absolute value)
 Intercept: Golfer 117 (-0.54), 28 (0.24), 45 (0.29), 59 (0.34),
142 (0.45)
 Greens: Golfer 132 (-0.25), 91 (0.24), 110 (0.25), 142 (0.33)
 Putts: Golfer 142 (-0.41), 25 (0.24), 117 (0.43)
 Sandshots: Golfer 132 (-0.25), 111 (0.23), 39 (0.23), 110 (0.24)
 Sandsaves: Golfers 59 (-0.43), 22 (-0.31), 91 (-0.30), 102 (-0.25),
115 (0.23), 47 (0.43)
 Drive: Golfers 142 (-0.49), 59 (-0.24), 56 (0.28), 117 (0.29),
48 (0.30)
• Note that while some of these exceed the “threshold”
none seem to be way too excessive. However, golfers 142
and 117 appear regularly, they should be checked out
Residuals
appear to be
(reasonably)
approximately
normal.
Shapiro-Wilk
test does not
reject the
hypothesis of
normal errors
> shapiro.test(residuals(lpga.mod1))
Shapiro-Wilk normality test
data: residuals(lpga.mod1) W = 0.9833, p-value = 0.2390
No Evidence of
non-constant
error variance
(Data had
been
transformed
prior to fitting
model)
Equal (Homogeneous) Variance - I
Brown-Forsythe Test:
H 0 : Equal Variance Among Errors V   i    2  i
H A : Unequal Variance Among Errors (Increasing or Decreasing in X )
^
1) Split Dataset into 2 groups based on levels of Y with sample sizes: n1 , n2
2) Compute the median residual in each group: e1 , e 2
3) Compute absolute deviation from group median for each residual:
dij  eij  e j
i  1,..., n j
j  1, 2
4) Compute the mean and variance for each group of dij : d 1 , s12
5) Compute the pooled variance: s
Test Statistic: t BF
d1  d 2

1 1
s

n1 n2
2
n1  1 s12   n2  1 s22


n1  n2  2
H0
~
d 2 , s22
tn1  n2  2
Group
1
2
Yhat_L
5.976
8.005
Yhat_H
7.972
10.217
n(i)
50
50
s2
0.0415
t(BF)
-1.3211
t(.025)
1.9646
P-value
0.1871
med(e)
0.0379
-0.0310
dbar(i)
0.2493
0.3031
s2(i)
0.0404
0.0427
No evidence to reject the null hypothesis of equal
variance among errors
Equal (Homogeneous) Variance
Breusch-Pagan (aka Cook-Weisberg) Test:
H 0 : Equal Variance Among Errors V   i    2  i
ANOVA
H A : Unequal Variance Among Errors    h   1 X i1  ...   p X ip 
2
i
2
df
Regression
Residual
SS
5 0.053308
94 1.941871
n
1) Let SSE   ei2
i 1
2) Fit Regression of ei2 on X i1 ,...X ip and obtain SS  Reg *
Test Statistic: X
2
BP

SS  Reg * 2
 n 2 
  ei n 
 i 1

2
H0
~
 p2
There is no evidence of unequal variance,
based on either Brown-Forsyth or BreuschPagan tests
SS(Reg*)
SSE
SS(Reg*)/2
SSE/512
X2(BP)
X2(.05,df=5)
P-value
Breusch-Pagan test
data: logprz ~ green + putts + sandshot + sandsave + drive
BP = 1.9306, df = 5, p-value = 0.8587
0.053308
11.74995
0.026654
0.1175
1.930591
11.0705
0.858663
Download