A Quantitative Analysis of Success Factors in the Association of Tennis Professionals (ATP) www.talksport.co.uk Nick Korach UP-STAT 2013 Overview I. Introduction II. Research Objective III. Research Process A. B. C. Data Collection Supervised Learning Techniques Unsupervised Learning Techniques IV. Results V. Conclusions VI. Extensions Introduction Why Choose Tennis? 1. In the ever-growing field of Sports Statistics there has been very little research done with tennis. 2. One of my favorite sports. Research Objectives To discover what factors are most important in determining success of male singles players(ATP Singles Points). To reduce the dimensionality of predictor variables in order to identify new significant underlying variables. Data Collection ATP Singles Data Five Years: 2008, 2009, 2010, 2011, 2012 Top 100 Ranked Male Singles Players www.faconnable.com Data Collection Cumulated Season “Match Stats” 1 Response (Y) Variable 10 Offense/Serving Predictor Variables (Xi) 7 Defense/Return Predictor Variables (Xi) 1 Additional Predictor Variable (Xi) www.atpworldtour.com Response (Y) Variable 1. ATP Singles Points • • • • Each ATP Tournament is worth a certain number of ATP Singles Points. Generally 250, 500, 1000, 2000 (GS) Points depend on how far a player advances in a tournament. The rankings period is the past 52 weeks Current ATP Rankings Rank 1 2 3 4 5 6 7 8 9 10 Name Nationality Points Week Change Tourn. Played Novak Djokovic SRB 12,370 0 19 Andy Murray GBR 8,750 1 19 Roger Federer SUI 8,670 -1 20 David Ferrer ESP 7,050 1 26 Rafael Nadal ESP 6,385 -1 20 Tomas Berdych CZE 5,145 0 24 Juan Martin Del Potro ARG 4,750 0 22 Jo-Wilfried Tsonga FRA 3,660 0 26 Richard Gasquet FRA 3,230 1 23 Janko Tipsarevic SRB 3,000 -1 29 Predictor Variables - Serving Number of Aces Number of Double Faults 1st Serve Percentage Win Percentage of 1st Serve Points Win Percentage of 2nd Serve Points Number of Break Points Faced Percentage of Break Points Saved Service Games Played Win Percentage of Service Games Win Percentage of Service Points www.bleacherreport.com Predictor Variables - Returning Win Percentage of 1st Serve Return Points Win Percentage of 2nd Serve Return Points Number of Break Point Opportunities Percentage of Break Points Converted Return Games Played Win Percentage of Return Games Win Percentage of Return Points www.bleacherreport.com Predictor Variables - Other Win Percentage of Total Points www.bleacherreport.com www.bleacherreport.com Data Mining Techniques 1. Supervised Learning Techniques • Both the response variable (Y) and the explanatory variables (Xi) are used. Multiple Linear Regression 2. Unsupervised Learning Techniques • Only explanatory variables (Xi) are used. Cluster Analysis Principal Component Analysis Supervised Learning • Regression Analysis: a statistical technique for finding the relationship between one or more predictor variables (Xi) and a response (Y). Y = β0 + β1X1 + β2X2 + … + βnXn + ε 2012 – ATP Singles Points (Y) 40 20 0 Frequency 60 80 Histogram of ATP Singles Points 0 2000 4000 6000 8000 ATP Singles Points 10000 12000 14000 Call: lm(formula = ATP.Singles.Pts ~ ., data = xy) Note: log(Y) used Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.8386094 1.7168561 0.488 0.62655 Aces 0.0001113 0.0004352 0.256 0.79883 Double.Faults -0.0017176 0.0007569 -2.269 0.02591 * X1st.Serve.. 0.0106702 0.0119668 0.892 0.37522 W...1st.Serve.Pts 0.0516267 0.0335954 1.537 0.12826 W...2nd.Serve.Pts 0.0428350 0.0249383 1.718 0.08968 . No..Break.Pts.Faced -0.0039531 0.0009885 -3.999 0.00014 *** X..Break.Pts.Saved 0.0391788 0.0085976 4.557 1.81e-05 *** Service.Gms.Played 0.0045717 0.0031527 1.450 0.15090 Service.Gms.W.. -0.0538452 0.0223019 -2.414 0.01802 * Service.Pts.W.. 0.0095231 0.0614412 0.155 0.87721 W...1st.Serve.Return.Pts 0.0054565 0.0340298 0.160 0.87301 W...2nd.Serve.Return.Pts -0.0067646 0.0214298 -0.316 0.75307 No..Break.Pt.Opportunities 0.0051992 0.0010445 4.978 3.56e-06 *** X..Break.Pts.Converted 0.0309062 0.0094591 3.267 0.00159 ** Return.Gms.Played -0.0042145 0.0031852 -1.323 0.18952 W...Return.Gms -0.0234497 0.0247880 -0.946 0.34696 W...Return.Pts 0.0492521 0.0487705 1.010 0.31556 Total.Pts.W.. -0.0389876 0.0759609 -0.513 0.60917 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2147 on 81 degrees of freedom Multiple R-squared: 0.9215, Adjusted R-squared: 0.9041 F-statistic: 52.83 on 18 and 81 DF, p-value: < 2.2e-16 Pairwise Scatter Plot with Y = ATP Singles Points 55 45 50 60 25 0 200 30 2000 0 ATP.Singles.Pts 0 Aces 55 0 Double.Faults X1st.Serve.. 60 W...1st.Serve.Pts 100 45 W...2nd.Serve.Pts No..Break.Pts.Faced 200 50 X..Break.Pts.Saved Service.Gms.Played 60 Service.Gms.W.. 25 55 Service.Pts.W.. W...1st.Serve.Return.Pts 40 W...2nd.Serve.Return.Pts 0 No..Break.Pt.Opportunities 200 30 X..Break.Pts.Converted Return.Gms.Played 10 W...Return.Gms Total.Pts.W.. 2000 0 60 100 200 55 40 30 10 48 48 30 W...Return.Pts Possible Multicollinearity? When two or more predictor variables are highly correlated with one another. Two best examples: Win Percentage of Service Points Win Percentage of Service Games Win Percentage of Return Points Win Percentage of Return Games Reduced Model Using Stepwise Regression using the Bayesian Information Criterion Call: lm(formula = ATP.Singles.Pts ~ Double.Faults + W...1st.Serve.Pts + W...2nd.Serve.Pts + No..Break.Pts.Faced + X..Break.Pts.Saved + Service.Gms.Played + Service.Gms.W.. + No..Break.Pt.Opportunities + X..Break.Pts.Converted, data = xy) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.5973047 0.8292794 3.132 0.002343 ** Double.Faults -0.0018275 0.0007234 -2.526 0.013273 * W...1st.Serve.Pts 0.0334651 0.0139064 2.406 0.018152 * W...2nd.Serve.Pts 0.0350553 0.0134740 2.602 0.010845 * No..Break.Pts.Faced -0.0044682 0.0007290 -6.129 2.30e-08 *** X..Break.Pts.Saved 0.0390309 0.0073757 5.292 8.46e-07 *** Service.Gms.Played 0.0010213 0.0004333 2.357 0.020581 * Service.Gms.W.. -0.0455592 0.0146466 -3.111 0.002501 ** No..Break.Pt.Opportunities 0.0047017 0.0004306 10.919 < 2e-16 *** X..Break.Pts.Converted 0.0234435 0.0058110 4.034 0.000115 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.2113 on 90 degrees of freedom Multiple R-squared: 0.9155, Adjusted R-squared: 0.9071 F-statistic: 108.3 on 9 and 90 DF, p-value: < 2.2e-16 Unsupervised Learning Cluster Analysis: the process of organizing objects into groups whose elements are similar in some way. Principal Component Analysis: the process of reducing the number of predictor variables into “components” to discover new underlying variables. 78 99 81 98 84 96 80 85 90 92 64 70 88 26 55 68 54 57 91 82 100 49 58 93 87 89 60 61 79 95 63 86 75 94 53 65 59 71 51 83 77 30 46 50 66 24 25 4 12 19 34 41 52 31 33 74 76 42 44 45 29 28 56 21 32 72 69 48 62 39 27 73 67 97 2 7 11 9 6 8 5 1 3 38 43 17 18 47 35 15 36 20 16 10 23 13 14 22 37 2012 Data – Cluster Dendrogram Cluster Dendrogram Player Singles Ranking hclust (*, "ward") 2012 Data – PCA Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation 2.8192100 2.2102469 1.19581683 1.05021129 0.86589111 Proportion of Variance 0.4460126 0.2741409 0.08024567 0.06189359 0.04207449 Cumulative Proportion 0.4460126 0.7201536 0.80039923 0.86229282 0.90436731 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10 Standard deviation 0.70453764 0.61115543 0.57260441 0.47311020 0.37507924 Proportion of Variance 0.02785484 0.02096021 0.01839932 0.01256079 0.00789475 Cumulative Proportion 0.93222215 0.95318236 0.97158168 0.98414247 0.99203722 Comp.11 Comp.12 Comp.13 Comp.14 Standard deviation 0.212694821 0.157319153 0.152737548 0.1324079805 Proportion of Variance 0.002538669 0.001388851 0.001309133 0.0009838313 Cumulative Proportion 0.994575889 0.995964739 0.997273873 0.9982577041 Comp.15 Comp.16 Comp.17 Comp.18 Standard deviation 0.1281058505 0.0882134002 0.0803585561 0.0199374495 Proportion of Variance 0.0009209376 0.0004366781 0.0003623736 0.0000223065 Cumulative Proportion 0.9991786418 0.9996153199 0.9999776935 1.0000000000 4 2 0 Variance 6 8 2012 Data – PCA Scree Plot 5 10 Component Number 15 14 6 97 13 67 4 26 40 53 2 -2 0 76 11 8 9 3 5 1 4 -4 Comp.2 2 22 37 39 73 71 81 68 9085 48 69 58 59 51 64 62 4332 20 70 84 98 65 8878 27 57 89 28 52 100 18 29 21 36 46 55 79 33 6396 72 83 9149 31 10 17 24 56 61 60 75 82 35 74 34 54 23 76 41 15 25 38 30 86 47 50 77 94 80 99 92 16 19 87 44 45 66 42 12 95 93 -6 -4 -2 0 Comp.1 2 4 Principal Components Comp.1 Comp.2 Aces -0.273 0.000 Double.Faults 0.000 0.000 X1st.Serve.. 0.000 0.000 W...1st.Serve.Pts 0.000 0.000 W...2nd.Serve.Pts 0.000 0.000 No..Break.Pts.Faced 0.000 0.000 X..Break.Pts.Saved 0.000 0.000 Service.Gms.Played -0.334 0.000 Service.Gms.W.. 0.000 0.268 Service.Pts.W.. 0.000 0.275 W...1st.Serve.Return.Pts 0.000 0.000 W...2nd.Serve.Return.Pts 0.000 0.000 No..Break.Pt.Opportunities -0.320 0.000 X..Break.Pts.Converted 0.000 0.000 Return.Gms.Played -0.333 0.000 W...Return.Gms 0.000 -0.389 W...Return.Pts 0.000 -0.392 Total.Pts.W.. -0.320 0.000 Comp.3 Comp.4 0.000 0.000 0.479 0.000 0.000 -0.858 0.000 0.301 0.000 0.000 0.362 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 Comp.5 Comp.6 Comp.7 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 -0.511 0.000 0.000 0.000 0.000 0.000 0.575 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.601 0.000 0.000 -0.739 0.000 0.000 0.000 0.841 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 Results Stepwise Regression Results Predictor Variable 2008 2009 2010 2011 2012 Number of Aces Number of Double Faults X 1st Serve Percentage X Win Percentage of 1st Serve Points X X X X Win Percentage of 2nd Serve Points X Number of Break Points Faced X X X X X Percentage of Break Points Saved X X X X X Service Games Played X X X X Win Percentage of Service Games X X X Win Percentage of Service Points X Win Percentage of 1st Serve Return Points Win Percentage of 2nd Serve Return Points Number of Break Point Opportunities X X X Percentage of Break Points Converted X Return Games Played X Win Percentage of Return Games Win Percentage of Return Points Win Percentage of Total Points X X X X X X X Results PCA - Percent of Variation Explained Year 2008 2009 2010 2011 2012 Scree Plot Elbow Component 5 Component 5 Component 5 Component 5 Component 5 90% of Variation Explained Component 5 Component 5 Component 5 Component 5 Component 5 95% Component 8 Component 7 Component 7 Component 7 Component 7 99% Component 11 Component 10 Component 10 Component 10 Component 10 Results PCA - Components Predictor Variable Number of Aces Number of Double Faults 1st Serve Percentage Win Percentage of 1st Serve Points Win Percentage of 2nd Serve Points Number of Break Points Faced Percentage of Break Points Saved Service Games Played Win Percentage of Service Games Win Percentage of Service Points Win Percentage of 1st Serve Return Points Win Percentage of 2nd Serve Return Points Number of Break Point Opportunities Percentage of Break Points Converted Return Games Played Win Percentage of Return Games Win Percentage of Return Points Win Percentage of Total Points 2008 2 3 4 4 6 3 5 1 2 2 7 7 1 5 1 2 2 1 2009 6 3 4 2 3 3 6 1 2 2 7 5 1 7 1 2 2 1 2010 1 3 4 4 6 3 5 1 1 1 5 7 1 5 1 2 2 1 2011 1 3 4 4 7 4 6 1 2 2 7 2 1 5 1 2 2 1 2012 1 3 4 4 6 3 6 1 2 2 7 7 1 5 1 2 2 1 New Underlying Variables Component 1 – “Physical” Service Games Played, Return Games Played, No. of Break Point Opportunities, Win % of Total Points Component 2 – “Technical” Win % of Service Games, Win % of Service Points, No. of Aces, Win % of Return Games, Win % of Return Points Component 3 – “Tactical” No. of Double Faults, No. of Break Points Faced Component 4 – “Mechanical” 1st Serve Percentage, Win % of 1st Serve Points Component 5 – “Psychological/Mental” % of Break Points Saved, % of Break Points Converted Conclusions The factors which are most important in determining the success of a male tennis players almost all deal with break of service. Reducing the dimensionality of the data allows us to identify new underlying variables: Physical, Technical, Tactical, Mechanical, Mental Extensions Perform regression analysis on additional response variables such as win percentage or prize money won. Decompose the data by studying variables that are not accumulated. Match by match. nbcsports.msnbc.com Questions? “Probability Formulas and Statistical Analysis in Tennis” Journal of Quantitative Analysis of Sports www.atpworldtour.com http://statracket.net www.stevegtennis.com www.inc-anto.net Special Thanks to Dr. Ernest Fokoue