What Factors Influence a Golfers` Score? By: Travis Atkinson

advertisement
What Factors Influence a Golfers’ Score?
By: Travis Atkinson
Abstract
Have you ever wondered what would be a quick way to lower your average golf score? There
are many factors you must consider when you look at a score of 72 strokes for a round of golf. Many
variables and factors influence your score. Some factors you cannot control are the course and weather,
but a few we can control are our clubs, swing, and mental decisions. What do you think most greatly
affects your average golf score?
Introduction
This project will focus on many factors and variables that are related to a golfer’s average score.
The problem is to determine what golfers should focus more on, to improve their average score. This
project will try to figure out what influences a good golf score. This is important to solve because it will
help golfers know what they need to improve in their game to be able to shoot a low score. I decided to
focus on the golfers first shot the drive, because without a good drive your whole round might be
incomplete. The number of rounds golfed per year, driving accuracy, or driving distance are all related
to the average score, is what my project will help answer.
Potential Solution
The statistical tool I plan to use is multiple linear regression tests. By using multiple linear
regression tests to determine the strength of these variables against average score, I will be able to tell if
all the variables are related. Using linear regression and assigning my response variable and inputting
my explanatory variables, I will see which is more important for golfers to focus on when golfing to
lower their average score.
Environment
The population for my project is golfers in the Professional Golf Association from 1998 to
present. The reason for a ten year sample is it is not too large, but won’t go back so far that the
technology will change greatly through the years. Also professionals are best to look at because they
get paid to focus and improve on their golf game. Their stats are already kept by the PGA with easy
access and over 25 years of data. There were 48 golfers who have golfed from 1998 to present, with a
few of them missing years because of injury or dropping performance. The data includes all 48 golfers’
shots and averages through the years. The different variables were rounds, driving distance average,
total distance, total drives, driving accuracy average, total fairways hit, possible fairways hit, scoring
average, and total strokes. The response variable is the players scoring average. Golfers can be
compared by who hits it farther or straighter but in the end it comes down to your final score and who
can have the lowest. The explanatory variables I will be comparing to the response variable are driving
accuracy average, driving distance average, and number of rounds. There are many more variables that
could be looked at but I chose these variables because I believe the first shot of your round is important
to how the rest of the round goes.
Model
The model will show if the variables are actually related to the average score and how closely
related. Using the Multi Linear Regression equation and plugging in the response and explanatory
variables my equation looked like:
Average score =
0
+
1Driving
distance +
2
Driving accuracy +
3Round
+ Error
The relationship between the error term and the regressors, for example whether they are
correlated, is a crucial step in formulating a linear regression model, as it will determine the method to
use for estimation. The statistical relationship between the error terms and the regressors plays an
important role in determining whether an estimation procedure has desirable sampling properties such
as being unbiased and consistent. Essentially we just find the coefficients such that the sum of the
squares of the errors is a minimum. Using this equation will help me answer my question.
Descriptive Stats
My minimums and maximums might not be too close to the median for the fact that some
golfers are strong and in their prime and will focus on the distance, where some are old and short and
will focus on accuracy. There were many variables to pick from but I believe these three are very closely
related to finding a golfers average score.
Summary for Rounds
A nderson-Darling N ormality Test
30
45
60
75
90
105
120
A -S quared
P -V alue <
7.24
0.005
M ean
S tDev
V ariance
S kew ness
Kurtosis
N
81.138
17.676
312.445
-0.764977
0.315890
559
M inimum
1st Q uartile
M edian
3rd Q uartile
M aximum
26.000
73.000
83.000
94.000
121.000
95% C onfidence Interv al for M ean
79.669
82.606
95% C onfidence Interv al for M edian
82.000
85.000
95% C onfidence Interv al for S tD ev
9 5 % C onfidence Inter vals
16.697
18.778
Mean
Median
79
80
81
82
83
84
85
Looking at the median it is very similar to the mean, but the minimum and maximum are 95
rounds apart which is a large gap and is shown in the graph that it is right skewed. The standard
deviation for the rounds is 18 which is somewhat high.
Summary for Drv. Acc. %
A nderson-Darling N ormality Test
48
54
60
66
72
A -S quared
P -V alue
0.76
0.049
M ean
S tDev
V ariance
S kew ness
Kurtosis
N
66.222
5.689
32.362
-0.330688
0.241694
557
M inimum
1st Q uartile
M edian
3rd Q uartile
M aximum
78
46.190
62.500
66.700
70.185
81.100
95% C onfidence Interv al for M ean
65.749
66.696
95% C onfidence Interv al for M edian
66.009
67.156
95% C onfidence Interv al for S tD ev
9 5 % C onfidence Inter vals
5.373
6.044
Mean
Median
66.0
66.4
66.8
67.2
This graph is closely related to a Bell curve which is also shown when looking at how close the
mean and median are. However the minimum and maximum are separated greatly but understandably.
An 81% drive accuracy compared to 46% is a big difference but understandable that two golfers can
have such great differences. The standard deviation is only 6 for the accuracy, but 6 % is not a
something that can be ignored ether.
Summary for Avg. Drv. Dis.
A nderson-Darling N ormality Test
250
260
270
280
290
300
A -S quared
P -V alue
0.61
0.115
M ean
S tDev
V ariance
S kew ness
Kurtosis
N
282.91
10.22
104.38
0.091037
0.300973
559
M inimum
1st Q uartile
M edian
3rd Q uartile
M aximum
310
249.00
275.50
282.60
289.00
316.10
95% C onfidence Interv al for M ean
282.06
283.76
95% C onfidence Interv al for M edian
281.70
283.92
95% C onfidence Interv al for S tD ev
9 5 % C onfidence Inter vals
9.65
10.85
Mean
Median
282.0
282.5
283.0
283.5
284.0
This graph is also closely related to a Bell curve which is also shown when looking at how the
mean and median are almost exactly the same. Looking at the minimum and maximum you can
understand that someone like Tiger Woods can average 316 yards on a drive because he is in prime
shape and has almost a perfect swing, but looking at a older and smaller golfer with a not as good swing
it is understandable that they might only be able to drive the ball 249 yards. The standard deviation is
10 yards which might not seem like a lot but in some situations that might be the difference between
using a different club and not being able to get as close to the pin on your second shot.
Summary for Scoring Avg.
A nderson-Darling N ormality Test
68.8
69.6
70.4
71.2
72.0
72.8
A -S quared
P -V alue
1.07
0.008
M ean
S tDev
V ariance
S kew ness
Kurtosis
N
70.822
0.688
0.474
-0.03362
1.09826
510
M inimum
1st Q uartile
M edian
3rd Q uartile
M aximum
73.6
68.170
70.407
70.825
71.240
73.730
95% C onfidence Interv al for M ean
70.762
70.882
95% C onfidence Interv al for M edian
70.764
70.880
95% C onfidence Interv al for S tD ev
9 5 % C onfidence Inter vals
0.649
0.733
Mean
Median
70.750
70.775
70.800
70.825
70.850
70.875
When looking at the graph it is similar to a bell curve, and it is easily seen that most of the data
is in the middle. That can easily be seen when looking at the minimum and maximum being only 6
strokes apart. The median and mean are very similar, and the fact that all 48 of the golfers scoring
averages are so close that it is a very competitive sport. With the standard deviation at .7 it tells you
how close all the golfers are.
Estimation:
After collecting all the data from all 48 PGA golfers over the ten years I used regression analysis
in Minitab, I entered my response and explanatory variables for the years 1998 to 2008; although I
collected data for the year 2009 I left this year out on purpose when I did my regression analysis to test
my equation later on. When I initially tested my variables I received the following data:
Regression Analysis: Scoring Avg. versus Avg. Drv. Di, Drv. Acc. %, ...
The regression equation is
Scoring Avg. = 85.9 - 0.0364 Avg. Drv. Dis. - 0.0609 Drv. Acc. %
- 0.00894 Rounds
557 cases used, 31 cases contain missing values
Predictor
Constant
Coef
85.917
SE Coef
1.269
T
67.70
P
0.000
Avg. Drv. Dis.
Drv. Acc. %
Rounds
S = 0.624756
-0.036449
-0.060855
-0.008943
0.003418
0.006318
0.001553
R-Sq = 25.0%
-10.66
-9.63
-5.76
0.000
0.000
0.000
R-Sq(adj) = 24.6%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
3
553
556
Source
Avg. Drv. Dis.
Drv. Acc. %
Rounds
DF
1
1
1
SS
71.845
215.847
287.692
MS
23.948
0.390
F
61.36
P
0.000
Seq SS
8.951
49.960
12.935
As you can see, my P-values all fall below the 0.05 level I was looking for. My coefficient
of determination value was at 25% suggesting that 75% of the golfers’ average score come from
other variables. Because my test worked out, but with a low coefficient of determination value I
was curious to see if the equation could predict the 2009 average scores.
Histogram
(response is Scoring Avg.)
80
70
Frequency
60
50
40
30
20
10
0
-1.8
-1.2
-0.6
0.0
Residual
0.6
1.2
1.8
The Histogram shows how close my data is to a natural bell curve. This is a very good case
which shows that all the variables are related. The fact that the frequency is most consistent at 0 it
shows that the residual bell is met here.
Normal Probability Plot
(response is Scoring Avg.)
99.9
99
Percent
95
90
80
70
60
50
40
30
20
10
5
1
0.1
-2
-1
0
Residual
1
2
This Normal Probability Plot shows the distribution of error term being close enough to the bell
shape. This graph indicates a good model in the way that the points follow a line.
Versus Fits
(response is Scoring Avg.)
2
Residual
1
0
-1
-2
70.0
70.5
71.0
Fitted Value
71.5
72.0
72.5
This graph shows an error term Versus Fit. This graph looks for randomness in the results of the
regression statistics. There is a fair amount of randomness to this graph which means that it is a decent
sample. The amount of points around and along the zero horizontal scatter suggests a constant
variance. This also shows that there is no problem with the equation, the difference between the
response and explanatory variables can’t be improved.
Validation
In this chart it shows that the data from 1998 to 2008 can predict the golfers average scores in
2009 with the explanatory variables. This will show my model and equation worked to be able to
predict the golfers’ scores to within a small margin similar to the standard deviation of the average
score, but not considering many outside factors. The standard deviation for the average score was .7
and when looking at the differences between the actual score and the models score most of the
differences are within .7, so this shows that predicting 2009 and including a standard deviation error of
.7 would be safe.
2009
Name
Stephen Ames
Billy Andrade
Stuart Appleby
Tommy Armour III
Mark Brooks
Mark Calcavecchia
Stewart Cink
Fred Couples
John Daly
Glen Day
Chris DiMarco
Joe Durant
David Duval
Ernie Els
Bob Estes
Brad Faxon
Steve Flesch
Harrison Frazar
Jim Furyk
Paul Goydos
Dudley Hart
Tim Herron
John Huston
Lee Janzen
Kent Jones
Jonathan Kaye
Jerry Kelly
Greg Kraft
Tom Lehman
Justin Leonard
Davis Love III
Steve Lowery
Jeff Maggert
Billy Mayfair
Scott McCarron
Rocco Mediate
Phil Mickelson
Jesper Parnevik
Corey Pavin
Tom Pernice, Jr.
Kenny Perry
Brett Quigley
Vijay Singh
Kevin Sutherland
Kirk Triplett
David Toms
Scott Verplank
Mike Weir
Tiger Woods
Avg. Drv. Dis.
Rounds Drv. Acc. %
43
66.37
288.1
30
65.66
282
43
56.34
286.7
45
66.83
285.7
32
70.36
264.8
40
57.19
288.8
47
57.44
292.2
32
53.03
299.1
35
62
38
34
42
53
35
45
52
42
41
38
51
39
45
41
30
48
37
33
53
49
50
44
42
49
47
38
30
41
50
52
55
41
55
38
52
52
45
26
68.21
66.17
72.64
46.19
65.49
68.11
56.58
62.66
47.98
69.69
70.17
63
55.65
58.31
60.7
64.71
60.48
60.22
61.12
69.83
68.6
58.7
62.12
66.01
66.9
71.58
63.55
51.12
62.07
68.56
61.76
64.52
57.98
61.38
62.75
69.33
74.74
71.68
64.6
61.9
292.2
277.5
277.3
292.4
290.7
281.5
267.2
282.6
300.6
275.1
274
273.5
289.7
290
281.7
277.9
288.1
281.7
270.6
269.4
284.3
297.8
288
287.1
280.1
284
280.7
298.4
285.2
257.9
279.8
292.6
286.9
294.8
289.9
284.5
286.3
281.7
280.9
293.4
Real
Model Est.
Scoring Avg. Scoring Avg. Difference
70.29
71.14
0.85
72.83
71.55
-1.28
72.45
71.77
-0.68
71.00
71.18
0.18
71.38
71.85
0.47
71.45
71.69
0.24
71.15
71.47
0.32
70.88
71.67
0.79
71.04
70.42
71.05
72.59
70.42
70.04
74.09
71.31
70.38
70.15
70.83
71.79
71.27
71.67
70.93
71.71
71.50
70.17
72.78
71.45
69.78
70.68
71.78
70.98
73.14
70.98
71.72
70.71
71.60
70.93
71.60
69.86
71.05
71.41
70.83
71.71
69.42
70.12
70.43
69.58
70.99
71.30
71.21
72.27
71.12
71.15
72.54
71.53
71.67
71.42
71.44
71.91
71.61
71.59
71.67
71.62
71.64
71.66
72.13
71.71
71.03
71.18
71.31
71.19
71.40
70.91
71.52
71.73
71.65
72.09
71.62
70.99
71.53
71.22
71.15
71.16
70.61
70.95
71.47
71.42
-0.05
0.88
0.16
-0.32
0.70
1.11
-1.55
0.22
1.29
1.27
0.61
0.12
0.34
-0.08
0.74
-0.09
0.14
1.49
-0.65
0.26
1.25
0.50
-0.47
0.21
-1.74
-0.07
-0.20
1.02
0.05
1.16
0.02
1.13
0.48
-0.19
0.32
-0.55
1.19
0.83
1.04
1.84
Conclusion
In conclusion my assumptions were correct. All three of my variables helped to determine what
the golfers’ average score is. That was seen in the P-Value’s and the standard deviations being so small.
However the explanatory variables are not the only variables that decide the response variable. That is
proven when the R-sq percentage was 25% in my model. R-sq measures how well the model fits the
data. This value can help select the model with the best fit. R-sq describes the amount of variation in
the observed response values that is explained by the predictors R-sq always increases with additional
predictors. In my model the assumed best three-predictor variables will always have a higher R-sq than
the best two-predictor variables. In my model the R-sq was only 25% which means my variables only
cover 25% of the reason the golfers shoots their average. So that means there is 75% of the reason
golfers shoot there score unexplained for. Examples of other variables I could have included are putting,
greens in regulation, sand saves, and scrambling. Other possible relations could be size of golfers, age,
weather, difficulty of courses, swing, clubs, and even mental decisions. I am very happy with the 25% I
got from my model. A recommendation I would have for someone interested in continuing this
research would be to focus more on putting, I believe the relationship between putts per round and
average score would be greater than the drives. Most golfers say drive for show and putt for dough.
Bibliography
PGA Tour, Statistics, 15 June 2009, http://www.pgatour.com/r/stats/
Moore, David and McCabe, George and Craig, Bruce. Introduction to the Practice of Statistics;
New York, W. H. Freeman and Company, 2009
Download