Give Me A Break

advertisement
A Quantitative Analysis of Success Factors in the
Association of Tennis Professionals (ATP)
www.talksport.co.uk
Nick Korach
UP-STAT 2013
Overview
I. Introduction
II. Research Objective
III. Research Process
A.
B.
C.
Data Collection
Supervised Learning Techniques
Unsupervised Learning Techniques
IV. Results
V. Conclusions
VI. Extensions
Introduction
Why Choose Tennis?
1. In the ever-growing field of Sports
Statistics there has been very little
research done with tennis.
2. One of my favorite sports.
Research Objectives
 To discover what factors are most
important in determining success of male
singles players(ATP Singles Points).
 To reduce the dimensionality of predictor
variables in order to identify new
significant underlying variables.
Data Collection
 ATP Singles Data
 Five Years: 2008, 2009, 2010, 2011, 2012
 Top 100 Ranked Male Singles Players
www.faconnable.com
Data Collection
 Cumulated Season “Match Stats”
 1 Response (Y) Variable
 10 Offense/Serving Predictor Variables (Xi)
 7 Defense/Return Predictor Variables (Xi)
 1 Additional Predictor Variable (Xi)
 www.atpworldtour.com
Response (Y) Variable
1. ATP Singles Points
•
•
•
•
Each ATP Tournament is worth a certain
number of ATP Singles Points.
Generally 250, 500, 1000, 2000 (GS)
Points depend on how far a player
advances in a tournament.
The rankings period is the past 52 weeks
Current ATP Rankings
Rank
1
2
3
4
5
6
7
8
9
10
Name
Nationality Points Week Change Tourn. Played
Novak Djokovic
SRB
12,370
0
19
Andy Murray
GBR
8,750
1
19
Roger Federer
SUI
8,670
-1
20
David Ferrer
ESP
7,050
1
26
Rafael Nadal
ESP
6,385
-1
20
Tomas Berdych
CZE
5,145
0
24
Juan Martin Del Potro
ARG
4,750
0
22
Jo-Wilfried Tsonga
FRA
3,660
0
26
Richard Gasquet
FRA
3,230
1
23
Janko Tipsarevic
SRB
3,000
-1
29
Predictor Variables - Serving
 Number of Aces
 Number of Double Faults
 1st Serve Percentage
 Win Percentage of 1st Serve Points
 Win Percentage of 2nd Serve Points
 Number of Break Points Faced
 Percentage of Break Points Saved
 Service Games Played
 Win Percentage of Service Games
 Win Percentage of Service Points
www.bleacherreport.com
Predictor Variables - Returning
 Win Percentage of 1st Serve Return Points
 Win Percentage of 2nd Serve Return Points
 Number of Break Point Opportunities
 Percentage of Break Points Converted
 Return Games Played
 Win Percentage of Return Games
 Win Percentage of Return Points
www.bleacherreport.com
Predictor Variables - Other
 Win Percentage of Total Points
www.bleacherreport.com
www.bleacherreport.com
Data Mining Techniques
1. Supervised Learning Techniques
• Both the response variable (Y) and the
explanatory variables (Xi) are used.
 Multiple Linear Regression
2. Unsupervised Learning Techniques
• Only explanatory variables (Xi) are used.
 Cluster Analysis
 Principal Component Analysis
Supervised Learning
• Regression Analysis: a statistical technique for
finding the relationship between one or more
predictor variables (Xi) and a response (Y).
Y = β0 + β1X1 + β2X2 + … + βnXn + ε
2012 – ATP Singles Points (Y)
40
20
0
Frequency
60
80
Histogram of ATP Singles Points
0
2000
4000
6000
8000
ATP Singles Points
10000
12000
14000
Call:
lm(formula = ATP.Singles.Pts ~ ., data = xy)
Note: log(Y) used
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
0.8386094 1.7168561
0.488 0.62655
Aces
0.0001113 0.0004352
0.256 0.79883
Double.Faults
-0.0017176 0.0007569 -2.269 0.02591 *
X1st.Serve..
0.0106702 0.0119668
0.892 0.37522
W...1st.Serve.Pts
0.0516267 0.0335954
1.537 0.12826
W...2nd.Serve.Pts
0.0428350 0.0249383
1.718 0.08968 .
No..Break.Pts.Faced
-0.0039531 0.0009885 -3.999 0.00014 ***
X..Break.Pts.Saved
0.0391788 0.0085976
4.557 1.81e-05 ***
Service.Gms.Played
0.0045717 0.0031527
1.450 0.15090
Service.Gms.W..
-0.0538452 0.0223019 -2.414 0.01802 *
Service.Pts.W..
0.0095231 0.0614412
0.155 0.87721
W...1st.Serve.Return.Pts
0.0054565 0.0340298
0.160 0.87301
W...2nd.Serve.Return.Pts
-0.0067646 0.0214298 -0.316 0.75307
No..Break.Pt.Opportunities 0.0051992 0.0010445
4.978 3.56e-06 ***
X..Break.Pts.Converted
0.0309062 0.0094591
3.267 0.00159 **
Return.Gms.Played
-0.0042145 0.0031852 -1.323 0.18952
W...Return.Gms
-0.0234497 0.0247880 -0.946 0.34696
W...Return.Pts
0.0492521 0.0487705
1.010 0.31556
Total.Pts.W..
-0.0389876 0.0759609 -0.513 0.60917
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2147 on 81 degrees of freedom
Multiple R-squared: 0.9215,
Adjusted R-squared: 0.9041
F-statistic: 52.83 on 18 and 81 DF, p-value: < 2.2e-16
Pairwise Scatter Plot with Y = ATP Singles Points
55
45
50
60
25
0
200
30
2000
0
ATP.Singles.Pts
0
Aces
55
0
Double.Faults
X1st.Serve..
60
W...1st.Serve.Pts
100
45
W...2nd.Serve.Pts
No..Break.Pts.Faced
200
50
X..Break.Pts.Saved
Service.Gms.Played
60
Service.Gms.W..
25
55
Service.Pts.W..
W...1st.Serve.Return.Pts
40
W...2nd.Serve.Return.Pts
0
No..Break.Pt.Opportunities
200
30
X..Break.Pts.Converted
Return.Gms.Played
10
W...Return.Gms
Total.Pts.W..
2000
0
60
100
200
55
40
30
10
48
48
30
W...Return.Pts
 Possible Multicollinearity?
 When two or more predictor variables are highly
correlated with one another.
 Two best examples:
Win Percentage of Service Points
Win Percentage of Service Games
Win Percentage of Return Points
Win Percentage of Return Games
Reduced Model Using Stepwise Regression
using the Bayesian Information Criterion
Call:
lm(formula = ATP.Singles.Pts ~ Double.Faults + W...1st.Serve.Pts +
W...2nd.Serve.Pts + No..Break.Pts.Faced + X..Break.Pts.Saved +
Service.Gms.Played + Service.Gms.W.. + No..Break.Pt.Opportunities +
X..Break.Pts.Converted, data = xy)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
2.5973047 0.8292794
3.132 0.002343 **
Double.Faults
-0.0018275 0.0007234 -2.526 0.013273 *
W...1st.Serve.Pts
0.0334651 0.0139064
2.406 0.018152 *
W...2nd.Serve.Pts
0.0350553 0.0134740
2.602 0.010845 *
No..Break.Pts.Faced
-0.0044682 0.0007290 -6.129 2.30e-08 ***
X..Break.Pts.Saved
0.0390309 0.0073757
5.292 8.46e-07 ***
Service.Gms.Played
0.0010213 0.0004333
2.357 0.020581 *
Service.Gms.W..
-0.0455592 0.0146466 -3.111 0.002501 **
No..Break.Pt.Opportunities 0.0047017 0.0004306 10.919 < 2e-16 ***
X..Break.Pts.Converted
0.0234435 0.0058110
4.034 0.000115 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2113 on 90 degrees of freedom
Multiple R-squared: 0.9155,
Adjusted R-squared: 0.9071
F-statistic: 108.3 on 9 and 90 DF, p-value: < 2.2e-16
Unsupervised Learning
 Cluster Analysis: the process of organizing
objects into groups whose elements are
similar in some way.
 Principal Component Analysis: the process
of reducing the number of predictor
variables into “components” to discover
new underlying variables.
78
99
81
98
84
96
80
85
90
92
64
70
88
26
55
68
54
57
91
82
100
49
58
93
87
89
60
61
79
95
63
86
75
94
53
65
59
71
51
83
77
30
46
50
66
24
25
4
12
19
34
41
52
31
33
74
76
42
44
45
29
28
56
21
32
72
69
48
62
39
27
73
67
97
2
7
11
9
6
8
5
1
3
38
43
17
18
47
35
15
36
20
16
10
23
13
14
22
37
2012 Data – Cluster Dendrogram
Cluster Dendrogram
Player Singles Ranking
hclust (*, "ward")
2012 Data – PCA
Importance of components:
Comp.1
Comp.2
Comp.3
Comp.4
Comp.5
Standard deviation
2.8192100 2.2102469 1.19581683 1.05021129 0.86589111
Proportion of Variance 0.4460126 0.2741409 0.08024567 0.06189359 0.04207449
Cumulative Proportion 0.4460126 0.7201536 0.80039923 0.86229282 0.90436731
Comp.6
Comp.7
Comp.8
Comp.9
Comp.10
Standard deviation
0.70453764 0.61115543 0.57260441 0.47311020 0.37507924
Proportion of Variance 0.02785484 0.02096021 0.01839932 0.01256079 0.00789475
Cumulative Proportion 0.93222215 0.95318236 0.97158168 0.98414247 0.99203722
Comp.11
Comp.12
Comp.13
Comp.14
Standard deviation
0.212694821 0.157319153 0.152737548 0.1324079805
Proportion of Variance 0.002538669 0.001388851 0.001309133 0.0009838313
Cumulative Proportion 0.994575889 0.995964739 0.997273873 0.9982577041
Comp.15
Comp.16
Comp.17
Comp.18
Standard deviation
0.1281058505 0.0882134002 0.0803585561 0.0199374495
Proportion of Variance 0.0009209376 0.0004366781 0.0003623736 0.0000223065
Cumulative Proportion 0.9991786418 0.9996153199 0.9999776935 1.0000000000
4
2
0
Variance
6
8
2012 Data – PCA Scree Plot
5
10
Component Number
15
14
6
97
13
67
4
26
40
53
2
-2
0
76
11
8
9
3
5
1
4
-4
Comp.2
2
22 37
39
73 71
81
68
9085
48 69 58
59 51 64
62
4332
20
70 84 98
65 8878
27
57
89
28
52
100
18
29
21
36
46 55 79
33
6396
72
83
9149
31
10 17 24
56 61 60 75
82
35
74 34
54
23
76 41
15 25
38
30
86
47 50
77
94
80
99
92
16 19
87
44 45
66
42
12
95
93
-6
-4
-2
0
Comp.1
2
4
Principal Components
Comp.1 Comp.2
Aces
-0.273 0.000
Double.Faults
0.000 0.000
X1st.Serve..
0.000 0.000
W...1st.Serve.Pts
0.000 0.000
W...2nd.Serve.Pts
0.000 0.000
No..Break.Pts.Faced
0.000 0.000
X..Break.Pts.Saved
0.000 0.000
Service.Gms.Played
-0.334 0.000
Service.Gms.W..
0.000 0.268
Service.Pts.W..
0.000 0.275
W...1st.Serve.Return.Pts
0.000 0.000
W...2nd.Serve.Return.Pts
0.000 0.000
No..Break.Pt.Opportunities -0.320 0.000
X..Break.Pts.Converted
0.000 0.000
Return.Gms.Played
-0.333 0.000
W...Return.Gms
0.000 -0.389
W...Return.Pts
0.000 -0.392
Total.Pts.W..
-0.320 0.000
Comp.3 Comp.4
0.000 0.000
0.479 0.000
0.000 -0.858
0.000 0.301
0.000 0.000
0.362 0.000
0.000 0.000
0.000 0.000
0.000 0.000
0.000 0.000
0.000 0.000
0.000 0.000
0.000 0.000
0.000 0.000
0.000 0.000
0.000 0.000
0.000 0.000
0.000 0.000
Comp.5 Comp.6 Comp.7
0.000 0.000 0.000
0.000 0.000 0.000
0.000 0.000 0.000
0.000 0.000 0.000
0.000 -0.511 0.000
0.000 0.000 0.000
0.000 0.575 0.000
0.000 0.000 0.000
0.000 0.000 0.000
0.000 0.000 0.000
0.000 0.000 0.601
0.000 0.000 -0.739
0.000 0.000 0.000
0.841 0.000 0.000
0.000 0.000 0.000
0.000 0.000 0.000
0.000 0.000 0.000
0.000 0.000 0.000
Results
 Stepwise Regression Results
Predictor Variable
2008
2009
2010
2011
2012
Number of Aces
Number of Double Faults
X
1st Serve Percentage
X
Win Percentage of 1st Serve Points
X
X
X
X
Win Percentage of 2nd Serve Points
X
Number of Break Points Faced
X
X
X
X
X
Percentage of Break Points Saved
X
X
X
X
X
Service Games Played
X
X
X
X
Win Percentage of Service Games
X
X
X
Win Percentage of Service Points
X
Win Percentage of 1st Serve Return Points
Win Percentage of 2nd Serve Return Points
Number of Break Point Opportunities
X
X
X
Percentage of Break Points Converted
X
Return Games Played
X
Win Percentage of Return Games
Win Percentage of Return Points
Win Percentage of Total Points
X
X
X
X
X
X
X
Results
 PCA - Percent of Variation Explained
Year
2008
2009
2010
2011
2012
Scree Plot Elbow
Component 5
Component 5
Component 5
Component 5
Component 5
90% of Variation Explained
Component 5
Component 5
Component 5
Component 5
Component 5
95%
Component 8
Component 7
Component 7
Component 7
Component 7
99%
Component 11
Component 10
Component 10
Component 10
Component 10
Results
 PCA - Components
Predictor Variable
Number of Aces
Number of Double Faults
1st Serve Percentage
Win Percentage of 1st Serve Points
Win Percentage of 2nd Serve Points
Number of Break Points Faced
Percentage of Break Points Saved
Service Games Played
Win Percentage of Service Games
Win Percentage of Service Points
Win Percentage of 1st Serve Return Points
Win Percentage of 2nd Serve Return Points
Number of Break Point Opportunities
Percentage of Break Points Converted
Return Games Played
Win Percentage of Return Games
Win Percentage of Return Points
Win Percentage of Total Points
2008
2
3
4
4
6
3
5
1
2
2
7
7
1
5
1
2
2
1
2009
6
3
4
2
3
3
6
1
2
2
7
5
1
7
1
2
2
1
2010
1
3
4
4
6
3
5
1
1
1
5
7
1
5
1
2
2
1
2011
1
3
4
4
7
4
6
1
2
2
7
2
1
5
1
2
2
1
2012
1
3
4
4
6
3
6
1
2
2
7
7
1
5
1
2
2
1
New Underlying Variables
 Component 1 – “Physical”
 Service Games Played, Return Games Played,
No. of Break Point Opportunities, Win % of Total Points
 Component 2 – “Technical”
 Win % of Service Games, Win % of Service Points, No. of Aces,
Win % of Return Games, Win % of Return Points
 Component 3 – “Tactical”
 No. of Double Faults, No. of Break Points Faced
 Component 4 – “Mechanical”
 1st Serve Percentage, Win % of 1st Serve Points
 Component 5 – “Psychological/Mental”
 % of Break Points Saved, % of Break Points Converted
Conclusions
 The factors which are most important in
determining the success of a male tennis players
almost all deal with break of service.
 Reducing the dimensionality of the data allows
us to identify new underlying variables:
Physical, Technical, Tactical, Mechanical, Mental
Extensions
 Perform regression analysis on additional response
variables such as win percentage or prize money won.
 Decompose the data by studying variables that are
not accumulated. Match by match.
nbcsports.msnbc.com
Questions?
“Probability Formulas and Statistical
Analysis in Tennis”
Journal of Quantitative Analysis of Sports
www.atpworldtour.com
http://statracket.net
www.stevegtennis.com
www.inc-anto.net
Special Thanks to Dr. Ernest Fokoue
Download