INTRODUCTION The Tampa Bay Buccaneers football team has a history of futility when they have played in cold temperatures. In fact, the team has lost so many times in colder weather that they are legendary for their inability to win unless the temperature at kickoff is above a certain degree. But perhaps there are other factors that have influenced Tampa Bay’s losses. We want to try and predict Tampa Bay’s win/loss margin based upon four different variables: 1) Home/Away – Does Tampa Bay fare better with a home field advantage? Is it enough to make a real difference and can it help predict their margin of victory? 2) Altitude - At high altitudes, the oxygen level per breath is reduced by about 20% compared to sea level, so heart rate and breathing rates increase to compensate. This is experienced as shortness of breath and early fatigue. The altitude in Tampa Bay is 15 ft. above sea level. Is the team affected when playing in cities with altitudes much above this level? 3) Opponent’s Prior Year Winning Percentage – The relative strength or weakness of the opposing team may strongly influence Tampa Bay’s ability to win and its margin of victory. We decided to use the opponent’s prior season winning percentage as the measurement of the teams’ strength since it was the most relevant statistic we could obtain. 4) Temperature – And finally, the statistic that led us to begin this project initially. We want to see whether or not there is any truth to the articles that have been written regarding Tampa Bay’s inability to win in colder temperatures and to see if their performance actually improves in warmer temperatures, or if perhaps some of the above variables may have had a greater influence on their success. We obtained data from Tampa Bay’s past four full seasons (1997 – 2000), which gave us 63 observations. (Note that we removed all games that were played in domed stadiums due to the artificial effects of altitude and temperature in the domes.) We gathered the data for our project from various sports and weather websites, including official NFL team sites for prior season winning percentages, www.wundergound.com (this site gave average temperature for each day of a game), and www.mit.edu/geo (this site provided altitudes for each city) The null hypothesis for our project is that none of the above variables have any effect on Tampa Bay’s win/loss margin. However, we are attempting to prove that this is not the case. DESCRIPTIVE STATISTICS We start by running descriptive statistics on our data: Descriptive Statistics: Home/Away, Margin, PY winning %, Altitude, Temperature, 1 Variable Home/Awa Margin PY winni Altitude Temperat N 63 63 63 63 63 Mean 0.5873 4.21 0.5323 148.7 63.12 Median 1.0000 3.00 0.5000 15.0 66.60 TrMean 0.5965 4.42 0.5324 124.5 64.33 Variable Home/Awa Margin PY winni Altitude Temperat Minimum 0.0000 -45.00 0.0630 8.0 5.60 Maximum 1.0000 41.00 0.9380 812.0 84.70 Q1 0.0000 -4.00 0.4380 15.0 52.30 Q3 1.0000 13.00 0.6880 60.0 77.40 StDev 0.4963 14.26 0.1850 257.4 17.24 SE Mean 0.0625 1.80 0.0233 32.4 2.17 One thing that jumped out at us in the above data is the large difference between the mean altitude and the median altitude. This can be explained by the fact that most of our data comes from games that were played at below 50 ft. elevation, however a few games were played at much higher elevations such as games in Kansas City (750 ft.) and Minnesota (812 ft.). This also indicates to us that the altitude is long right-tailed. A histogram of the altitude variable confirms this. Frequency 40 30 20 10 0 0 100 200 300 400 500 600 700 800 900 Altitude However, when we log the altitude, we do not see a significant improvement in terms of normality. 2 40 Frequency 30 20 10 0 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 Log altitude We also run a box plot of the variable that we are trying to predict, the margin, to get a better sense of what is going on in the data. 40 30 20 Margin 10 0 -10 -20 -30 -40 -50 We notice a few outliers in the results, which we will keep in mind as observations that might have a significant impact on our model. After we run the model, we may have to go back and take a closer look at these outliers. 3 FITTED LINE PLOTS After running the descriptive statistics, we ran fitted line plots of each of our individual variables in order to get a better understanding for our predictors. Regression Plot Margin = -0.115385 + 7.35863 Home/Away S = 13.8986 R-Sq = 6.6 % R-Sq(adj) = 5.0 % 40 30 20 Margin 10 0 -10 -20 -30 -40 -50 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Home/Away Regression Plot Margin = 4.13695 + 0.0004667 Altitude S = 14.3775 R-Sq = 0.0 % R-Sq(adj) = 0.0 % 40 30 20 Margin 10 0 -10 -20 -30 -40 -50 0 100 200 300 400 500 600 700 800 Altitude 4 Regression Plot Margin = 3.58975 + 1.15837 PY winning % S = 14.3764 R-Sq = 0.0 % R-Sq(adj) = 0.0 % 40 30 20 Margin 10 0 -10 -20 -30 -40 -50 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 PY winning % Regression Plot Margin = -10.7611 + 0.239294 Temperature S = 11.8216 R-Sq = 10.7 % R-Sq(adj) = 9.2 % 40 30 Margin 20 10 0 -10 -20 -30 0 10 20 30 40 50 60 70 80 90 Temperature Based solely on the above graphs, temperature and home/away variables appear to be the two with which margin has some type of relationship. However, that does not mean we are discounting the other variables, since we know that while relationships may not be apparent in a linear regression, they may factor into the equation when run in a multiple regression with the other variables. 5 Thus, our next step is to run a multiple regression with all four variables. MULTIPLE REGRESSION The results of our regression are as follows: The regression equation is Margin = - 12.0 + 10.6 Home/Away - 5.92 PY winning % + 0.0178 Altitude + 0.166 Temperature Predictor Constant Home/Awa PY winni Altitude Temperat Coef -11.990 10.609 -5.916 0.017816 0.1658 S = 13.64 SE Coef 8.719 4.865 9.651 0.008791 0.1192 R-Sq = 14.4% T -1.38 2.18 -0.61 2.03 1.39 P 0.174 0.033 0.542 0.047 0.170 VIF 1.9 1.1 1.7 1.4 R-Sq(adj) = 8.5% Analysis of Variance Source Regression Residual Error Total DF 4 58 62 SS 1821.9 10788.4 12610.3 MS 455.5 186.0 F 2.45 P 0.056 The F-value of 2.45 is rather low and does not appear to be statistically significant. Furthermore, the p-value of .056 is not low enough for us to reject the null hypothesis, and the low R-square suggests that our model is not extremely useful in predicting the margin of victory. However, we know that collinearity does not appear to be a problem with our model because of the low VIF values. When we examine the individual p-values of each variable, we see that home/away and altitude appear to be variables that may have a relationship with the margin of victory. The p-values for the other variables were high enough that we feel that we can take them out of the equation. However, we still believe that temperature has a correlation with the victory margin because of the pattern that we noticed in our earlier fitted line model. A higher margin of losses seems to show up at temperatures below 45 degrees. We decide to change the temperature variable into a 0/1 variable, with 0 corresponding to games played in temperatures above 45 degrees and 1 corresponding to games played in temperatures below 45 degrees. After making the change, our next step was to run a fitted line plot of the temp < 45 variable to gain a better understanding of the data and the variable, and to re-run the multiple regression model, with the temp < 45 variable replacing the original temperature variable. The fitted line plot showed some outliers that we will keep in mind when running our regression. It appears that there was one game when the temperature was above 45 degrees, but Tampa Bay lost by a large margin, and another game where the team won by 6 a significant margin, but the temperature was below 45 degrees. These margins are unusual given the predicting variable, and we may decide to remove them later: Regression Plot Margin = 5.61818 - 11.1182 temp < 45 S = 13.8771 R-Sq = 6.8 % R-Sq(adj) = 5.3 % 40 30 20 Margin 10 0 -10 -20 -30 -40 -50 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 temp < 45 The results of our new regression model, with temperature as a 0/1 variable, are shown below: The regression equation is Margin = 0.42 + 10.8 Home/Away - 7.45 PY winning % + 0.0186 Altitude - 10.4 temp < 45 Predictor Constant Home/Awa PY winni Altitude temp < 4 Coef 0.417 10.752 -7.454 0.018609 -10.436 S = 13.50 SE Coef 6.162 4.696 9.634 0.008729 5.883 R-Sq = 16.1% T 0.07 2.29 -0.77 2.13 -1.77 P 0.946 0.026 0.442 0.037 0.081 VIF 1.8 1.1 1.7 1.3 R-Sq(adj) = 10.4% Analysis of Variance Source Regression Residual Error Total DF 4 58 62 SS 2035.9 10574.5 12610.3 MS 509.0 182.3 F 2.79 P 0.034 The R-square has improved only slightly from our original model, and the F-value is not very statistically significant. We still do not seem to have an extremely useful model. 7 However, the lower p-value of .034 leads us to conclude that there is a relationship between some of these variables and the margin of victory, even if it is not a very strong relationship. We decide that removing those two outliers we noticed in the above fitted line plot of temp < 45 may improve our model. Before simply removing the outliers, we review NFL records to see if we can identify any extenuating circumstances surrounding the two “unusual” margins. One outlier was a game where Tampa Bay had a huge margin of victory over Cincinnati in 1998. We note that in 1998, Cincinnati had a horrendous record. Finishing dead last in their division, the Bengals sported a 1-7 record at home that year, often suffering from large blowouts, such as the one shown from our data. The other outlier was a substantial loss to Oakland in 1999, in warmer weather. We expected to see that the Oakland Raiders were dominant during the 1999 season, however this was not the case (they finished .500 for that year). Upon closer review, we did note that, for whatever reason, the Raiders appear to “have Tampa’s number.” When looking at Oakland’s historical winning percentage versus specific teams, their highest winning percentage by far is against the Tampa Bay Buccaneers (.800 success rate). Again, when looking at this regression plot, it becomes apparent that these two points could be having an extreme effect on our model, and as such, we remove these points from our data set. When we rerun the regression with all four variables, minus the outliers, we get the following results: Regression Analysis: Margin versus Home/Away, PY winning %, ... The regression equation is Margin = 5.57 + 5.13 Home/Away - 6.34 PY winning % + 0.00855 Altitude - 17.3 temp < 45 Predictor Constant Home/Awa PY winni Altitude temp < 4 S = 11.17 Coef 5.567 5.126 -6.341 0.008546 -17.340 SE Coef 5.203 4.042 7.975 0.007465 5.082 R-Sq = 24.3% T 1.07 1.27 -0.80 1.14 -3.41 P 0.289 0.210 0.430 0.257 0.001 VIF 1.9 1.1 1.7 1.3 R-Sq(adj) = 18.9% Now we do see a vast improvement in our regression, once these outliers have been removed. Unfortunately, however, an R-square of merely 24.3% is not nearly the success rate we were hoping for. In addition, this adjusted R-square of only 18.9% is indicating that there is some extra random noise in this regression. At this point, it appears clear that all our variables (home/away, prior year winning percentage, altitude, and temperature < 45 degrees) are not needed in this model. The question then becomes which combination of variables would be most effective. 8 Since the p-value for the temp < 45 variable is so low (.001), we feel comfortable that this variable has some impact on Tampa Bay’s margin of victory. Thus, we will keep this variable in our model. However, we note the large p-values for the other three predictor variables. We believe that we can remove these other variables, and end up with a one variable model. We decide to run a “Best Subsets” test to confirm our theory. We ran the test both with and without the outliers. Upon running these two separate tests, it becomes quite clear that with all variable combinations, the models are more accurate once the outliers have been removed. Following is the result of the Best Subplots test, on the data where the outliers are removed (note that from this point on, all our tests will be based on this new data set): Best Subsets Regression: Margin versus Home/Away, PY winning %, ... Vars R-Sq R-Sq(adj) C-p S 1 1 2 2 3 3 4 21.4 7.6 22.0 21.8 23.4 22.5 24.3 20.0 6.1 19.3 19.1 19.4 18.5 18.9 1.2 11.3 2.7 2.9 3.6 4.3 5.0 11.095 12.025 11.147 11.160 11.137 11.204 11.173 a i e 4 X X X X X X X X X X X X X X X X As expected, the R-squares become higher as more variables are added to the equation. This may seem to indicate that the 4 variables should be included, but upon closer examination of the results, we see two indications that this is not true. First, we note that the four variables offer an R-square of 24.3%, yet this is not a large jump at all from the R-square provided by the 1 simple variable of above or below 45 degrees. In fact, it can be concluded that the strength of any relationship is coming mainly from this one variable, which offers an R-square of 21.4 on its own. Secondly, we note that the adjusted R-squares actually appear to decrease, as more variables enter into the model. This indicates that these variables are actually contributing more noise than useful information. The highest adjusted R-square value actually occurs when just that one variable is used, once again suggesting that we should only include this one variable in our model. Next, we run a regression with just this 0/1 variable (again, where 0 indicates equal to or above 45 degrees, and 1 indicates below 45 degrees). This regression gives us the following results: 9 Regression Analysis: Margin versus temp < 45 The regression equation is Margin = 6.56 - 17.8 temp < 45 Predictor Constant temp < 4 Coef 6.556 -17.841 S = 11.09 SE Coef 1.510 4.457 R-Sq = 21.4% T 4.34 -4.00 P 0.000 0.000 R-Sq(adj) = 20.0% Analysis of Variance Source Regression Residual Error Total DF 1 59 60 SS 1972.5 7262.8 9235.2 MS 1972.5 123.1 F 16.02 P 0.000 Our original null hypothesis was that none of our variables affected the Buccaneers success (winning margin). Clearly, the t-statistic of -4, along with the p-value of zero, indicates that there is something going on, and as such, we can strongly reject the null hypothesis. It can also be noted, with this model, that our t2 equals our F statistic, which makes sense since we have now edited our model down to be simple regression. Due to the simplicity of this regression, caused by the elimination of all variables except for this 0/1 variable, we are now looking at a two-sample t-test. Thus, we can interpret these regression results quite easily. Simply said, if the Buccaneers are playing in warmer weather (45 degrees and over) they are likely to have an average winning margin of 6.556 points. If they are playing in more extreme cold weather (below 45 degrees) their margin will likely decrease by an average of 17.841 points. More technically stated, the coefficients explain that the movement per 1 unit of the variable is -17.841 points, but again, since our variable is simply a 0/1 variable, this means that this is the movement when playing in warmer (above 45 degrees) vs. extremely cold weather (below 45 degrees). Perhaps an easier way to view this data is to look at the descriptive statistics of margin, by this one predictor variable. These descriptive statistics are as follows: Descriptive Statistics: Margin by temp < 45 Variable Margin temp < 4 0 1 N 54 7 Mean 6.56 -11.29 Median 6.00 -6.00 TrMean 6.21 -11.29 StDev 11.16 10.48 Variable Margin temp < 4 0 1 SE Mean 1.52 3.96 Minimum -18.00 -31.00 Maximum 41.00 -3.00 Q1 -2.25 -18.00 Q3 14.00 -3.00 Again, this shows that if playing in the warmer temperatures, the average winning margin is 6.56, and if playing in the extremely cold (below 45) temperature, there is an average losing margin of 11.29 points (or a decrease of the 17.84 points.) 10 Now, we will run a fitted line plot in order to, once again, visualize this relationship: Regression Plot Margin = 6.55556 - 17.8413 temp < 45 S = 11.0949 R-Sq = 21.4 % R-Sq(adj) = 20.0 % 40 30 Margin 20 10 0 -10 -20 -30 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 temp < 45 RESIDUAL PLOTS AND CHECKING ASSUMPTIONS Now that we have a regression model, albeit not a perfect one, it is time to test our regression assumptions. In order to do so, we will run a histogram of residuals, a normal probability plot of residuals, and a residual vs. fitted plot: Histogram of the Residuals (response is Margin) Frequency 15 10 5 0 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Standardized Residual 11 Normal Probability Plot of the Residuals (response is Margin) Normal Score 2 1 0 -1 -2 -2 -1 0 1 2 3 Standardized Residual These first two graphs indicate that the errors appear to be independently normal (the errors are reasonably normally distributed), and there also appears to be a strong linear relationship here. Now we will look at the third graph: Residuals Versus the Fitted Values (response is Margin) Standardized Residual 3 2 1 0 -1 -2 -10 -5 0 5 Fitted Value Although this picture doesn’t show violations in the assumption regarding correlation of the errors or the assumption that they average to zero, it does show that the final regression assumption, constant variance, is strongly violated. Upon review of this graph, it is clear that this model violates the assumption that the variance of errors is constant across variables. We see that the variance in errors is much larger for the games where the temperature was above 45 degrees. This violation of homoscedasticity is often a warning that well-defined subgroups in the data are causing the problem. In this case, that is quite obvious, considering that the subgroups are simply extremely cold weather versus not extremely cold weather. In other words, while we may be able to predict, with more accuracy, the results for the Buccaneers when playing 12 in extremely cold weather, we have far less predictive power when the weather is not as extreme. CONCLUSION When we began this paper, we did so in an effort to prove that some of our variables systematically affected the Tampa Bay Buccaneers’ winning margin. Through a series of tests, we were able to reject our null hypothesis to show, in fact, that something was going on. Virtually all football fans have heard the argument that the Tampa Bay Buccaneers just cannot win in extremely cold weather. We were hoping, at the beginning of this project, to conclude much more. We were looking for trends, driven by altitudes, home field advantages, or opponents’ prior year success records, which could help us to predict the Buccaneers’ margins more accurately. In addition, we were hoping to find a more predictable nature between temperature and winning margin, to show a correlation between all temperatures and the Bucs margins (for example, that the Bucs were both worse in cold weather and actually better as the weather became warmer.) Unfortunately, no such trends appeared. In hindsight, if there had been such a trend to report, the hundreds of sports announcers and statisticians who are always desperately looking for something would likely have reported it. Although disappointed, we now know that there is no such predictive formula, at least not involving these variables. We have found evidence, however, to support that one known theory: if there’s going to be a blizzard, you should bet against the Buccaneers. 13