TampaBayBucs

advertisement
INTRODUCTION
The Tampa Bay Buccaneers football team has a history of futility when they have played
in cold temperatures. In fact, the team has lost so many times in colder weather that they
are legendary for their inability to win unless the temperature at kickoff is above a certain
degree. But perhaps there are other factors that have influenced Tampa Bay’s losses. We
want to try and predict Tampa Bay’s win/loss margin based upon four different variables:
1) Home/Away – Does Tampa Bay fare better with a home field advantage? Is it enough
to make a real difference and can it help predict their margin of victory?
2) Altitude - At high altitudes, the oxygen level per breath is reduced by about 20%
compared to sea level, so heart rate and breathing rates increase to compensate. This is
experienced as shortness of breath and early fatigue. The altitude in Tampa Bay is 15 ft.
above sea level. Is the team affected when playing in cities with altitudes much above
this level?
3) Opponent’s Prior Year Winning Percentage – The relative strength or weakness of the
opposing team may strongly influence Tampa Bay’s ability to win and its margin of
victory. We decided to use the opponent’s prior season winning percentage as the
measurement of the teams’ strength since it was the most relevant statistic we could
obtain.
4) Temperature – And finally, the statistic that led us to begin this project initially. We
want to see whether or not there is any truth to the articles that have been written
regarding Tampa Bay’s inability to win in colder temperatures and to see if their
performance actually improves in warmer temperatures, or if perhaps some of the above
variables may have had a greater influence on their success.
We obtained data from Tampa Bay’s past four full seasons (1997 – 2000), which gave us
63 observations. (Note that we removed all games that were played in domed stadiums
due to the artificial effects of altitude and temperature in the domes.) We gathered the
data for our project from various sports and weather websites, including official NFL
team sites for prior season winning percentages, www.wundergound.com (this site gave
average temperature for each day of a game), and www.mit.edu/geo (this site provided
altitudes for each city)
The null hypothesis for our project is that none of the above variables have any effect on
Tampa Bay’s win/loss margin. However, we are attempting to prove that this is not the
case.
DESCRIPTIVE STATISTICS
We start by running descriptive statistics on our data:
Descriptive Statistics: Home/Away, Margin, PY winning %, Altitude, Temperature,
1
Variable
Home/Awa
Margin
PY winni
Altitude
Temperat
N
63
63
63
63
63
Mean
0.5873
4.21
0.5323
148.7
63.12
Median
1.0000
3.00
0.5000
15.0
66.60
TrMean
0.5965
4.42
0.5324
124.5
64.33
Variable
Home/Awa
Margin
PY winni
Altitude
Temperat
Minimum
0.0000
-45.00
0.0630
8.0
5.60
Maximum
1.0000
41.00
0.9380
812.0
84.70
Q1
0.0000
-4.00
0.4380
15.0
52.30
Q3
1.0000
13.00
0.6880
60.0
77.40
StDev
0.4963
14.26
0.1850
257.4
17.24
SE Mean
0.0625
1.80
0.0233
32.4
2.17
One thing that jumped out at us in the above data is the large difference between the
mean altitude and the median altitude. This can be explained by the fact that most of our
data comes from games that were played at below 50 ft. elevation, however a few games
were played at much higher elevations such as games in Kansas City (750 ft.) and
Minnesota (812 ft.). This also indicates to us that the altitude is long right-tailed. A
histogram of the altitude variable confirms this.
Frequency
40
30
20
10
0
0
100 200 300 400 500 600 700 800 900
Altitude
However, when we log the altitude, we do not see a significant improvement in terms of
normality.
2
40
Frequency
30
20
10
0
1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0
Log altitude
We also run a box plot of the variable that we are trying to predict, the margin, to get a
better sense of what is going on in the data.
40
30
20
Margin
10
0
-10
-20
-30
-40
-50
We notice a few outliers in the results, which we will keep in mind as observations that
might have a significant impact on our model. After we run the model, we may have to
go back and take a closer look at these outliers.
3
FITTED LINE PLOTS
After running the descriptive statistics, we ran fitted line plots of each of our individual
variables in order to get a better understanding for our predictors.
Regression Plot
Margin = -0.115385 + 7.35863 Home/Away
S = 13.8986
R-Sq = 6.6 %
R-Sq(adj) = 5.0 %
40
30
20
Margin
10
0
-10
-20
-30
-40
-50
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Home/Away
Regression Plot
Margin = 4.13695 + 0.0004667 Altitude
S = 14.3775
R-Sq = 0.0 %
R-Sq(adj) = 0.0 %
40
30
20
Margin
10
0
-10
-20
-30
-40
-50
0
100
200
300
400
500
600
700
800
Altitude
4
Regression Plot
Margin = 3.58975 + 1.15837 PY winning %
S = 14.3764
R-Sq = 0.0 %
R-Sq(adj) = 0.0 %
40
30
20
Margin
10
0
-10
-20
-30
-40
-50
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
PY winning %
Regression Plot
Margin = -10.7611 + 0.239294 Temperature
S = 11.8216
R-Sq = 10.7 %
R-Sq(adj) = 9.2 %
40
30
Margin
20
10
0
-10
-20
-30
0
10
20
30
40
50
60
70
80
90
Temperature
Based solely on the above graphs, temperature and home/away variables appear to be the
two with which margin has some type of relationship. However, that does not mean we
are discounting the other variables, since we know that while relationships may not be
apparent in a linear regression, they may factor into the equation when run in a multiple
regression with the other variables.
5
Thus, our next step is to run a multiple regression with all four variables.
MULTIPLE REGRESSION
The results of our regression are as follows:
The regression equation is
Margin = - 12.0 + 10.6 Home/Away - 5.92 PY winning % + 0.0178 Altitude
+ 0.166 Temperature
Predictor
Constant
Home/Awa
PY winni
Altitude
Temperat
Coef
-11.990
10.609
-5.916
0.017816
0.1658
S = 13.64
SE Coef
8.719
4.865
9.651
0.008791
0.1192
R-Sq = 14.4%
T
-1.38
2.18
-0.61
2.03
1.39
P
0.174
0.033
0.542
0.047
0.170
VIF
1.9
1.1
1.7
1.4
R-Sq(adj) = 8.5%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
4
58
62
SS
1821.9
10788.4
12610.3
MS
455.5
186.0
F
2.45
P
0.056
The F-value of 2.45 is rather low and does not appear to be statistically significant.
Furthermore, the p-value of .056 is not low enough for us to reject the null hypothesis,
and the low R-square suggests that our model is not extremely useful in predicting the
margin of victory. However, we know that collinearity does not appear to be a problem
with our model because of the low VIF values.
When we examine the individual p-values of each variable, we see that home/away and
altitude appear to be variables that may have a relationship with the margin of victory.
The p-values for the other variables were high enough that we feel that we can take them
out of the equation. However, we still believe that temperature has a correlation with the
victory margin because of the pattern that we noticed in our earlier fitted line model. A
higher margin of losses seems to show up at temperatures below 45 degrees. We decide
to change the temperature variable into a 0/1 variable, with 0 corresponding to games
played in temperatures above 45 degrees and 1 corresponding to games played in
temperatures below 45 degrees.
After making the change, our next step was to run a fitted line plot of the temp < 45
variable to gain a better understanding of the data and the variable, and to re-run the
multiple regression model, with the temp < 45 variable replacing the original temperature
variable.
The fitted line plot showed some outliers that we will keep in mind when running our
regression. It appears that there was one game when the temperature was above 45
degrees, but Tampa Bay lost by a large margin, and another game where the team won by
6
a significant margin, but the temperature was below 45 degrees. These margins are
unusual given the predicting variable, and we may decide to remove them later:
Regression Plot
Margin = 5.61818 - 11.1182 temp < 45
S = 13.8771
R-Sq = 6.8 %
R-Sq(adj) = 5.3 %
40
30
20
Margin
10
0
-10
-20
-30
-40
-50
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
temp < 45
The results of our new regression model, with temperature as a 0/1 variable, are shown
below:
The regression equation is
Margin = 0.42 + 10.8 Home/Away - 7.45 PY winning % + 0.0186 Altitude
- 10.4 temp < 45
Predictor
Constant
Home/Awa
PY winni
Altitude
temp < 4
Coef
0.417
10.752
-7.454
0.018609
-10.436
S = 13.50
SE Coef
6.162
4.696
9.634
0.008729
5.883
R-Sq = 16.1%
T
0.07
2.29
-0.77
2.13
-1.77
P
0.946
0.026
0.442
0.037
0.081
VIF
1.8
1.1
1.7
1.3
R-Sq(adj) = 10.4%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
4
58
62
SS
2035.9
10574.5
12610.3
MS
509.0
182.3
F
2.79
P
0.034
The R-square has improved only slightly from our original model, and the F-value is not
very statistically significant. We still do not seem to have an extremely useful model.
7
However, the lower p-value of .034 leads us to conclude that there is a relationship
between some of these variables and the margin of victory, even if it is not a very strong
relationship.
We decide that removing those two outliers we noticed in the above fitted line plot of
temp < 45 may improve our model.
Before simply removing the outliers, we review NFL records to see if we can identify any
extenuating circumstances surrounding the two “unusual” margins. One outlier was a
game where Tampa Bay had a huge margin of victory over Cincinnati in 1998. We note
that in 1998, Cincinnati had a horrendous record. Finishing dead last in their division, the
Bengals sported a 1-7 record at home that year, often suffering from large blowouts, such
as the one shown from our data.
The other outlier was a substantial loss to Oakland in 1999, in warmer weather. We
expected to see that the Oakland Raiders were dominant during the 1999 season, however
this was not the case (they finished .500 for that year). Upon closer review, we did note
that, for whatever reason, the Raiders appear to “have Tampa’s number.” When looking
at Oakland’s historical winning percentage versus specific teams, their highest winning
percentage by far is against the Tampa Bay Buccaneers (.800 success rate).
Again, when looking at this regression plot, it becomes apparent that these two points
could be having an extreme effect on our model, and as such, we remove these points
from our data set. When we rerun the regression with all four variables, minus the
outliers, we get the following results:
Regression Analysis: Margin versus Home/Away, PY winning %, ...
The regression equation is
Margin = 5.57 + 5.13 Home/Away - 6.34 PY winning % + 0.00855 Altitude
- 17.3 temp < 45
Predictor
Constant
Home/Awa
PY winni
Altitude
temp < 4
S = 11.17
Coef
5.567
5.126
-6.341
0.008546
-17.340
SE Coef
5.203
4.042
7.975
0.007465
5.082
R-Sq = 24.3%
T
1.07
1.27
-0.80
1.14
-3.41
P
0.289
0.210
0.430
0.257
0.001
VIF
1.9
1.1
1.7
1.3
R-Sq(adj) = 18.9%
Now we do see a vast improvement in our regression, once these outliers have been
removed. Unfortunately, however, an R-square of merely 24.3% is not nearly the success
rate we were hoping for. In addition, this adjusted R-square of only 18.9% is indicating
that there is some extra random noise in this regression. At this point, it appears clear
that all our variables (home/away, prior year winning percentage, altitude, and
temperature < 45 degrees) are not needed in this model. The question then becomes
which combination of variables would be most effective.
8
Since the p-value for the temp < 45 variable is so low (.001), we feel comfortable that
this variable has some impact on Tampa Bay’s margin of victory. Thus, we will keep this
variable in our model. However, we note the large p-values for the other three predictor
variables. We believe that we can remove these other variables, and end up with a one
variable model.
We decide to run a “Best Subsets” test to confirm our theory. We ran the test both with
and without the outliers. Upon running these two separate tests, it becomes quite clear
that with all variable combinations, the models are more accurate once the outliers have
been removed. Following is the result of the Best Subplots test, on the data where the
outliers are removed (note that from this point on, all our tests will be based on this new
data set):
Best Subsets Regression: Margin versus Home/Away, PY winning %, ...
Vars
R-Sq
R-Sq(adj)
C-p
S
1
1
2
2
3
3
4
21.4
7.6
22.0
21.8
23.4
22.5
24.3
20.0
6.1
19.3
19.1
19.4
18.5
18.9
1.2
11.3
2.7
2.9
3.6
4.3
5.0
11.095
12.025
11.147
11.160
11.137
11.204
11.173
a i e 4
X
X
X
X
X
X
X X
X X
X
X X X X
X
As expected, the R-squares become higher as more variables are added to the equation.
This may seem to indicate that the 4 variables should be included, but upon closer
examination of the results, we see two indications that this is not true. First, we note that
the four variables offer an R-square of 24.3%, yet this is not a large jump at all from the
R-square provided by the 1 simple variable of above or below 45 degrees. In fact, it can
be concluded that the strength of any relationship is coming mainly from this one
variable, which offers an R-square of 21.4 on its own. Secondly, we note that the
adjusted R-squares actually appear to decrease, as more variables enter into the model.
This indicates that these variables are actually contributing more noise than useful
information. The highest adjusted R-square value actually occurs when just that one
variable is used, once again suggesting that we should only include this one variable in
our model.
Next, we run a regression with just this 0/1 variable (again, where 0 indicates equal to or
above 45 degrees, and 1 indicates below 45 degrees). This regression gives us the
following results:
9
Regression Analysis: Margin versus temp < 45
The regression equation is
Margin = 6.56 - 17.8 temp < 45
Predictor
Constant
temp < 4
Coef
6.556
-17.841
S = 11.09
SE Coef
1.510
4.457
R-Sq = 21.4%
T
4.34
-4.00
P
0.000
0.000
R-Sq(adj) = 20.0%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
59
60
SS
1972.5
7262.8
9235.2
MS
1972.5
123.1
F
16.02
P
0.000
Our original null hypothesis was that none of our variables affected the Buccaneers
success (winning margin). Clearly, the t-statistic of -4, along with the p-value of zero,
indicates that there is something going on, and as such, we can strongly reject the null
hypothesis. It can also be noted, with this model, that our t2 equals our F statistic, which
makes sense since we have now edited our model down to be simple regression.
Due to the simplicity of this regression, caused by the elimination of all variables except
for this 0/1 variable, we are now looking at a two-sample t-test. Thus, we can interpret
these regression results quite easily. Simply said, if the Buccaneers are playing in
warmer weather (45 degrees and over) they are likely to have an average winning margin
of 6.556 points. If they are playing in more extreme cold weather (below 45 degrees)
their margin will likely decrease by an average of 17.841 points. More technically stated,
the coefficients explain that the movement per 1 unit of the variable is -17.841 points, but
again, since our variable is simply a 0/1 variable, this means that this is the movement
when playing in warmer (above 45 degrees) vs. extremely cold weather (below 45
degrees).
Perhaps an easier way to view this data is to look at the descriptive statistics of margin,
by this one predictor variable. These descriptive statistics are as follows:
Descriptive Statistics: Margin by temp < 45
Variable
Margin
temp < 4
0
1
N
54
7
Mean
6.56
-11.29
Median
6.00
-6.00
TrMean
6.21
-11.29
StDev
11.16
10.48
Variable
Margin
temp < 4
0
1
SE Mean
1.52
3.96
Minimum
-18.00
-31.00
Maximum
41.00
-3.00
Q1
-2.25
-18.00
Q3
14.00
-3.00
Again, this shows that if playing in the warmer temperatures, the average winning margin
is 6.56, and if playing in the extremely cold (below 45) temperature, there is an average
losing margin of 11.29 points (or a decrease of the 17.84 points.)
10
Now, we will run a fitted line plot in order to, once again, visualize this relationship:
Regression Plot
Margin = 6.55556 - 17.8413 temp < 45
S = 11.0949
R-Sq = 21.4 %
R-Sq(adj) = 20.0 %
40
30
Margin
20
10
0
-10
-20
-30
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
temp < 45
RESIDUAL PLOTS AND CHECKING ASSUMPTIONS
Now that we have a regression model, albeit not a perfect one, it is time to test our
regression assumptions. In order to do so, we will run a histogram of residuals, a normal
probability plot of residuals, and a residual vs. fitted plot:
Histogram of the Residuals
(response is Margin)
Frequency
15
10
5
0
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Standardized Residual
11
Normal Probability Plot of the Residuals
(response is Margin)
Normal Score
2
1
0
-1
-2
-2
-1
0
1
2
3
Standardized Residual
These first two graphs indicate that the errors appear to be independently normal (the
errors are reasonably normally distributed), and there also appears to be a strong linear
relationship here. Now we will look at the third graph:
Residuals Versus the Fitted Values
(response is Margin)
Standardized Residual
3
2
1
0
-1
-2
-10
-5
0
5
Fitted Value
Although this picture doesn’t show violations in the assumption regarding correlation of
the errors or the assumption that they average to zero, it does show that the final
regression assumption, constant variance, is strongly violated.
Upon review of this graph, it is clear that this model violates the assumption that the
variance of errors is constant across variables. We see that the variance in errors is much
larger for the games where the temperature was above 45 degrees. This violation of
homoscedasticity is often a warning that well-defined subgroups in the data are causing
the problem. In this case, that is quite obvious, considering that the subgroups are simply
extremely cold weather versus not extremely cold weather. In other words, while we
may be able to predict, with more accuracy, the results for the Buccaneers when playing
12
in extremely cold weather, we have far less predictive power when the weather is not as
extreme.
CONCLUSION
When we began this paper, we did so in an effort to prove that some of our variables
systematically affected the Tampa Bay Buccaneers’ winning margin. Through a series of
tests, we were able to reject our null hypothesis to show, in fact, that something was
going on.
Virtually all football fans have heard the argument that the Tampa Bay Buccaneers just
cannot win in extremely cold weather. We were hoping, at the beginning of this project,
to conclude much more. We were looking for trends, driven by altitudes, home field
advantages, or opponents’ prior year success records, which could help us to predict the
Buccaneers’ margins more accurately. In addition, we were hoping to find a more
predictable nature between temperature and winning margin, to show a correlation
between all temperatures and the Bucs margins (for example, that the Bucs were both
worse in cold weather and actually better as the weather became warmer.)
Unfortunately, no such trends appeared.
In hindsight, if there had been such a trend to report, the hundreds of sports announcers
and statisticians who are always desperately looking for something would likely have
reported it. Although disappointed, we now know that there is no such predictive
formula, at least not involving these variables. We have found evidence, however, to
support that one known theory: if there’s going to be a blizzard, you should bet against
the Buccaneers.
13
Download