What Factors Influence a Golfers’ Score? By: Travis Atkinson Abstract Have you ever wondered what would be a quick way to lower your average golf score? There are many factors you must consider when you look at a score of 72 strokes for a round of golf. Many variables and factors influence your score. Some factors you cannot control are the course and weather, but a few we can control are our clubs, swing, and mental decisions. What do you think most greatly affects your average golf score? Introduction This project will focus on many factors and variables that are related to a golfer’s average score. The problem is to determine what golfers should focus more on, to improve their average score. This project will try to figure out what influences a good golf score. This is important to solve because it will help golfers know what they need to improve in their game to be able to shoot a low score. I decided to focus on the golfers first shot the drive, because without a good drive your whole round might be incomplete. The number of rounds golfed per year, driving accuracy, or driving distance are all related to the average score, is what my project will help answer. Potential Solution The statistical tool I plan to use is multiple linear regression tests. By using multiple linear regression tests to determine the strength of these variables against average score, I will be able to tell if all the variables are related. Using linear regression and assigning my response variable and inputting my explanatory variables, I will see which is more important for golfers to focus on when golfing to lower their average score. Environment The population for my project is golfers in the Professional Golf Association from 1998 to present. The reason for a ten year sample is it is not too large, but won’t go back so far that the technology will change greatly through the years. Also professionals are best to look at because they get paid to focus and improve on their golf game. Their stats are already kept by the PGA with easy access and over 25 years of data. There were 48 golfers who have golfed from 1998 to present, with a few of them missing years because of injury or dropping performance. The data includes all 48 golfers’ shots and averages through the years. The different variables were rounds, driving distance average, total distance, total drives, driving accuracy average, total fairways hit, possible fairways hit, scoring average, and total strokes. The response variable is the players scoring average. Golfers can be compared by who hits it farther or straighter but in the end it comes down to your final score and who can have the lowest. The explanatory variables I will be comparing to the response variable are driving accuracy average, driving distance average, and number of rounds. There are many more variables that could be looked at but I chose these variables because I believe the first shot of your round is important to how the rest of the round goes. Model The model will show if the variables are actually related to the average score and how closely related. Using the Multi Linear Regression equation and plugging in the response and explanatory variables my equation looked like: Average score = 0 + 1Driving distance + 2 Driving accuracy + 3Round + Error The relationship between the error term and the regressors, for example whether they are correlated, is a crucial step in formulating a linear regression model, as it will determine the method to use for estimation. The statistical relationship between the error terms and the regressors plays an important role in determining whether an estimation procedure has desirable sampling properties such as being unbiased and consistent. Essentially we just find the coefficients such that the sum of the squares of the errors is a minimum. Using this equation will help me answer my question. Descriptive Stats My minimums and maximums might not be too close to the median for the fact that some golfers are strong and in their prime and will focus on the distance, where some are old and short and will focus on accuracy. There were many variables to pick from but I believe these three are very closely related to finding a golfers average score. Summary for Rounds A nderson-Darling N ormality Test 30 45 60 75 90 105 120 A -S quared P -V alue < 7.24 0.005 M ean S tDev V ariance S kew ness Kurtosis N 81.138 17.676 312.445 -0.764977 0.315890 559 M inimum 1st Q uartile M edian 3rd Q uartile M aximum 26.000 73.000 83.000 94.000 121.000 95% C onfidence Interv al for M ean 79.669 82.606 95% C onfidence Interv al for M edian 82.000 85.000 95% C onfidence Interv al for S tD ev 9 5 % C onfidence Inter vals 16.697 18.778 Mean Median 79 80 81 82 83 84 85 Looking at the median it is very similar to the mean, but the minimum and maximum are 95 rounds apart which is a large gap and is shown in the graph that it is right skewed. The standard deviation for the rounds is 18 which is somewhat high. Summary for Drv. Acc. % A nderson-Darling N ormality Test 48 54 60 66 72 A -S quared P -V alue 0.76 0.049 M ean S tDev V ariance S kew ness Kurtosis N 66.222 5.689 32.362 -0.330688 0.241694 557 M inimum 1st Q uartile M edian 3rd Q uartile M aximum 78 46.190 62.500 66.700 70.185 81.100 95% C onfidence Interv al for M ean 65.749 66.696 95% C onfidence Interv al for M edian 66.009 67.156 95% C onfidence Interv al for S tD ev 9 5 % C onfidence Inter vals 5.373 6.044 Mean Median 66.0 66.4 66.8 67.2 This graph is closely related to a Bell curve which is also shown when looking at how close the mean and median are. However the minimum and maximum are separated greatly but understandably. An 81% drive accuracy compared to 46% is a big difference but understandable that two golfers can have such great differences. The standard deviation is only 6 for the accuracy, but 6 % is not a something that can be ignored ether. Summary for Avg. Drv. Dis. A nderson-Darling N ormality Test 250 260 270 280 290 300 A -S quared P -V alue 0.61 0.115 M ean S tDev V ariance S kew ness Kurtosis N 282.91 10.22 104.38 0.091037 0.300973 559 M inimum 1st Q uartile M edian 3rd Q uartile M aximum 310 249.00 275.50 282.60 289.00 316.10 95% C onfidence Interv al for M ean 282.06 283.76 95% C onfidence Interv al for M edian 281.70 283.92 95% C onfidence Interv al for S tD ev 9 5 % C onfidence Inter vals 9.65 10.85 Mean Median 282.0 282.5 283.0 283.5 284.0 This graph is also closely related to a Bell curve which is also shown when looking at how the mean and median are almost exactly the same. Looking at the minimum and maximum you can understand that someone like Tiger Woods can average 316 yards on a drive because he is in prime shape and has almost a perfect swing, but looking at a older and smaller golfer with a not as good swing it is understandable that they might only be able to drive the ball 249 yards. The standard deviation is 10 yards which might not seem like a lot but in some situations that might be the difference between using a different club and not being able to get as close to the pin on your second shot. Summary for Scoring Avg. A nderson-Darling N ormality Test 68.8 69.6 70.4 71.2 72.0 72.8 A -S quared P -V alue 1.07 0.008 M ean S tDev V ariance S kew ness Kurtosis N 70.822 0.688 0.474 -0.03362 1.09826 510 M inimum 1st Q uartile M edian 3rd Q uartile M aximum 73.6 68.170 70.407 70.825 71.240 73.730 95% C onfidence Interv al for M ean 70.762 70.882 95% C onfidence Interv al for M edian 70.764 70.880 95% C onfidence Interv al for S tD ev 9 5 % C onfidence Inter vals 0.649 0.733 Mean Median 70.750 70.775 70.800 70.825 70.850 70.875 When looking at the graph it is similar to a bell curve, and it is easily seen that most of the data is in the middle. That can easily be seen when looking at the minimum and maximum being only 6 strokes apart. The median and mean are very similar, and the fact that all 48 of the golfers scoring averages are so close that it is a very competitive sport. With the standard deviation at .7 it tells you how close all the golfers are. Estimation: After collecting all the data from all 48 PGA golfers over the ten years I used regression analysis in Minitab, I entered my response and explanatory variables for the years 1998 to 2008; although I collected data for the year 2009 I left this year out on purpose when I did my regression analysis to test my equation later on. When I initially tested my variables I received the following data: Regression Analysis: Scoring Avg. versus Avg. Drv. Di, Drv. Acc. %, ... The regression equation is Scoring Avg. = 85.9 - 0.0364 Avg. Drv. Dis. - 0.0609 Drv. Acc. % - 0.00894 Rounds 557 cases used, 31 cases contain missing values Predictor Constant Coef 85.917 SE Coef 1.269 T 67.70 P 0.000 Avg. Drv. Dis. Drv. Acc. % Rounds S = 0.624756 -0.036449 -0.060855 -0.008943 0.003418 0.006318 0.001553 R-Sq = 25.0% -10.66 -9.63 -5.76 0.000 0.000 0.000 R-Sq(adj) = 24.6% Analysis of Variance Source Regression Residual Error Total DF 3 553 556 Source Avg. Drv. Dis. Drv. Acc. % Rounds DF 1 1 1 SS 71.845 215.847 287.692 MS 23.948 0.390 F 61.36 P 0.000 Seq SS 8.951 49.960 12.935 As you can see, my P-values all fall below the 0.05 level I was looking for. My coefficient of determination value was at 25% suggesting that 75% of the golfers’ average score come from other variables. Because my test worked out, but with a low coefficient of determination value I was curious to see if the equation could predict the 2009 average scores. Histogram (response is Scoring Avg.) 80 70 Frequency 60 50 40 30 20 10 0 -1.8 -1.2 -0.6 0.0 Residual 0.6 1.2 1.8 The Histogram shows how close my data is to a natural bell curve. This is a very good case which shows that all the variables are related. The fact that the frequency is most consistent at 0 it shows that the residual bell is met here. Normal Probability Plot (response is Scoring Avg.) 99.9 99 Percent 95 90 80 70 60 50 40 30 20 10 5 1 0.1 -2 -1 0 Residual 1 2 This Normal Probability Plot shows the distribution of error term being close enough to the bell shape. This graph indicates a good model in the way that the points follow a line. Versus Fits (response is Scoring Avg.) 2 Residual 1 0 -1 -2 70.0 70.5 71.0 Fitted Value 71.5 72.0 72.5 This graph shows an error term Versus Fit. This graph looks for randomness in the results of the regression statistics. There is a fair amount of randomness to this graph which means that it is a decent sample. The amount of points around and along the zero horizontal scatter suggests a constant variance. This also shows that there is no problem with the equation, the difference between the response and explanatory variables can’t be improved. Validation In this chart it shows that the data from 1998 to 2008 can predict the golfers average scores in 2009 with the explanatory variables. This will show my model and equation worked to be able to predict the golfers’ scores to within a small margin similar to the standard deviation of the average score, but not considering many outside factors. The standard deviation for the average score was .7 and when looking at the differences between the actual score and the models score most of the differences are within .7, so this shows that predicting 2009 and including a standard deviation error of .7 would be safe. 2009 Name Stephen Ames Billy Andrade Stuart Appleby Tommy Armour III Mark Brooks Mark Calcavecchia Stewart Cink Fred Couples John Daly Glen Day Chris DiMarco Joe Durant David Duval Ernie Els Bob Estes Brad Faxon Steve Flesch Harrison Frazar Jim Furyk Paul Goydos Dudley Hart Tim Herron John Huston Lee Janzen Kent Jones Jonathan Kaye Jerry Kelly Greg Kraft Tom Lehman Justin Leonard Davis Love III Steve Lowery Jeff Maggert Billy Mayfair Scott McCarron Rocco Mediate Phil Mickelson Jesper Parnevik Corey Pavin Tom Pernice, Jr. Kenny Perry Brett Quigley Vijay Singh Kevin Sutherland Kirk Triplett David Toms Scott Verplank Mike Weir Tiger Woods Avg. Drv. Dis. Rounds Drv. Acc. % 43 66.37 288.1 30 65.66 282 43 56.34 286.7 45 66.83 285.7 32 70.36 264.8 40 57.19 288.8 47 57.44 292.2 32 53.03 299.1 35 62 38 34 42 53 35 45 52 42 41 38 51 39 45 41 30 48 37 33 53 49 50 44 42 49 47 38 30 41 50 52 55 41 55 38 52 52 45 26 68.21 66.17 72.64 46.19 65.49 68.11 56.58 62.66 47.98 69.69 70.17 63 55.65 58.31 60.7 64.71 60.48 60.22 61.12 69.83 68.6 58.7 62.12 66.01 66.9 71.58 63.55 51.12 62.07 68.56 61.76 64.52 57.98 61.38 62.75 69.33 74.74 71.68 64.6 61.9 292.2 277.5 277.3 292.4 290.7 281.5 267.2 282.6 300.6 275.1 274 273.5 289.7 290 281.7 277.9 288.1 281.7 270.6 269.4 284.3 297.8 288 287.1 280.1 284 280.7 298.4 285.2 257.9 279.8 292.6 286.9 294.8 289.9 284.5 286.3 281.7 280.9 293.4 Real Model Est. Scoring Avg. Scoring Avg. Difference 70.29 71.14 0.85 72.83 71.55 -1.28 72.45 71.77 -0.68 71.00 71.18 0.18 71.38 71.85 0.47 71.45 71.69 0.24 71.15 71.47 0.32 70.88 71.67 0.79 71.04 70.42 71.05 72.59 70.42 70.04 74.09 71.31 70.38 70.15 70.83 71.79 71.27 71.67 70.93 71.71 71.50 70.17 72.78 71.45 69.78 70.68 71.78 70.98 73.14 70.98 71.72 70.71 71.60 70.93 71.60 69.86 71.05 71.41 70.83 71.71 69.42 70.12 70.43 69.58 70.99 71.30 71.21 72.27 71.12 71.15 72.54 71.53 71.67 71.42 71.44 71.91 71.61 71.59 71.67 71.62 71.64 71.66 72.13 71.71 71.03 71.18 71.31 71.19 71.40 70.91 71.52 71.73 71.65 72.09 71.62 70.99 71.53 71.22 71.15 71.16 70.61 70.95 71.47 71.42 -0.05 0.88 0.16 -0.32 0.70 1.11 -1.55 0.22 1.29 1.27 0.61 0.12 0.34 -0.08 0.74 -0.09 0.14 1.49 -0.65 0.26 1.25 0.50 -0.47 0.21 -1.74 -0.07 -0.20 1.02 0.05 1.16 0.02 1.13 0.48 -0.19 0.32 -0.55 1.19 0.83 1.04 1.84 Conclusion In conclusion my assumptions were correct. All three of my variables helped to determine what the golfers’ average score is. That was seen in the P-Value’s and the standard deviations being so small. However the explanatory variables are not the only variables that decide the response variable. That is proven when the R-sq percentage was 25% in my model. R-sq measures how well the model fits the data. This value can help select the model with the best fit. R-sq describes the amount of variation in the observed response values that is explained by the predictors R-sq always increases with additional predictors. In my model the assumed best three-predictor variables will always have a higher R-sq than the best two-predictor variables. In my model the R-sq was only 25% which means my variables only cover 25% of the reason the golfers shoots their average. So that means there is 75% of the reason golfers shoot there score unexplained for. Examples of other variables I could have included are putting, greens in regulation, sand saves, and scrambling. Other possible relations could be size of golfers, age, weather, difficulty of courses, swing, clubs, and even mental decisions. I am very happy with the 25% I got from my model. A recommendation I would have for someone interested in continuing this research would be to focus more on putting, I believe the relationship between putts per round and average score would be greater than the drives. Most golfers say drive for show and putt for dough. Bibliography PGA Tour, Statistics, 15 June 2009, http://www.pgatour.com/r/stats/ Moore, David and McCabe, George and Craig, Bruce. Introduction to the Practice of Statistics; New York, W. H. Freeman and Company, 2009