Uploaded by Rishi GR

Math IA FInal Draft

advertisement
Mathematics IA First Draft
Rishi Ganesh Ravichandran
Aim: Examining the performances of Quarterback Russell Wilson and
Running Back Marshawn Lynch in the 2014-2015 NFL season for the
Seattle Seahawks.
Rationale:
As a young child being brought up in the United States of America, I grew up watching all kinds of
sports, such as Cricket, Soccer, Basketball and Gridiron Football. With my father being a die-hard fan
of the Seattle Seahawks in the NFL, I also became a huge fan of Gridiron football and the Seattle
Seahawks, waking up in the wee hours of the morning just to catch the games after we moved to
Singapore.
Gridiron football fascinated me because it had multiple plays through which points could be scored,
namely Passing and Rushing of the bsall. The Quarterback mainly advanced the ball by passing while
the Running back advanced the ball through rushing. The eye-test and accounts by so called ‘TV
analysts’ state that rushing and passing offensive moves are completely independent of each other, but
I believed that I could find an underlying link, especially in the context of my favorite team.
The reason I chose the combination of Russell Wilson and Marshawn Lynch from the years 2013-2015
was because in my opinion, these two epitomized peak performances of a dual Quarterback-Running
Back duo as they made it to two consecutive Super Bowls from 2013-2015. Furthermore, the style of
offense that Seattle ran these years was a precursor for many of the modern offenses that NFL teams
run today and is a good representative set for NFL offenses today.
Introduction:
While doing the chapter ‘Statistics and Probability’ in the IB HL Mathematics course, my teacher talked
about the usage of probability distributions in sports. I realized that I could explore this area of
Mathematics to fulfil my natural curiosity, especially in the context of the modern NFL which is heavily
reliant on statistical analysis. Coaches and Bookies in the NFL use distributions to model the probability
of a player performing in a certain way to adjust line-ups and set the odds respectively.
Therefore, to examine the performances of Marshawn Lynch and Russell Wilson, it is important to
understand how players can impact a game in a substantial fashion i.e. through having a ‘good game’.
In layman’s terms, a good game would be defined as a performance where a player would have a
positive impact on winning. When a player has multiple good games, said player is an impact player.
Seeing as to how we now have the advantage of hindsight wisdom, I will be using the game log actual
occurrence data of both players from the 2013-14 and the 2014-15 season in my exploration to validate
the significance of my findings.
Aim:
Examining the performances of Quarterback Russell Wilson and Running Back Marshawn Lynch in
the 2014-2015 season for the Seattle Seahawks.
Background Information:
Before delving into the exploration, the reader must understand some basic Gridiron Football concepts.
The goal of the game is to advance the football as far as one can into the opponents’ half for better
scoring positions. One can advance the ball by Passing (throwing the ball forward to a receiver) or
Rushing (person in possession of the ball advancing the ball by running further into the opponeuts’ half)
the football.
Passing the football is generally the role of the Quarterback (QB), while rushing the football is generally
carried out by the Running Back (RB). Most modern NFL offenses try to balance their offensive
approaches to avoid repetition of plays (which make offenses easier to defend).
Keeping this in mind, QBs and RBs can impact the ballgame by having a ‘good game’. We can define
the good-game parameter for Marshawn Lynch and Russell Wilson as the following.
For QB Russell Wilson, a ‘good game’ shall be defined as more than a 95-passer rating (a.k.a
quarterback rating) (a measure of the performance of passers, primarily quarterbacks, in gridiron
football)
For RB Marshawn Lynch, a ‘good game’ shall be defined as more than 80 rushing yards in a game
(Running with the ball when starting from behind the line of scrimmage with an intent of gaining
yardage. While this usually means a running play, any offensive play that does not involve a forward
pass is a rush - also called a run.)
I will now provide a high-level overview of my methodology. Coaches around the league use previous
season performances to forecast future performance capabilities, therefore we shall model the data sets
for both Marshawn Lynch and Russell Wilson during the 2013-14 season using the Poisson and Normal
distribution models. Subsequently, we shall see which of our models best fit the reality of the 2014-15
season using the chi-square hypothesis test for best fit and the T hypothesis test. I will further examine
whichever model is chosen to see if I can observe any underlying links in between the Marshawn Lynch
and Russell Wilson data sets.
Playoff performances are more difficult to forecast because the nature of each game varies wildly, and
factors like ability to perform in close game situations and stress in high-stakes games are difficult to
quantify. Therefore, we shall restrict the investigation to the NFL 2013-14 regular season.
Tables 2 and 3 below are logs of Marshawn Lynch and Russell Wilson’s games for the 2013-2014
season. These statistics were taken from an online football database.
Table 1: Marshawn Lynch game log (2013-2014)
Table 2: Russell Wilson game log (2013-2014)
Mathematical Investigation:
As stated before, forecasting performances for both players can be done through the usage of valid
probability distributions. We shall examine the Normal and Poisson distribution models for both
players’ 2013-14 season to see which one better adheres to reality.
First, we shall observe the data sets under the Poisson distribution model. Tables 3 and 4 respectively
show the data set calculations for Marshawn Lynch and Russell Wilson from the previous NFL season.
Poisson Distribution Model
Table 3: Marshawn Lynch game log data calculations
From Table 3, we can calculate the following (Calculation 1):
Average/Mean rushing yards for Marshawn Lynch is given by the formula:
πœ‡πœ‡Μ‚ 1 =
πœ‡πœ‡Μ‚ 1 =
∑𝑁𝑁
𝑖𝑖=1 𝑓𝑓𝑖𝑖 βˆ™π‘¦π‘¦π‘–π‘–
1257
16
𝑁𝑁
= 78.56
𝑓𝑓𝑖𝑖 βˆ™ 𝑦𝑦𝑖𝑖 : Rushing yards for a
particular game
πœ‡πœ‡Μ‚ 1 : Mean number of rushing
yards
The variance of the data set is given by the formula:
𝜎𝜎1 2 =
∑(π‘Œπ‘Œ − πœ‡πœ‡Μ‚ 1 )2
𝑁𝑁
From Table 5:
2
∑(π‘Œπ‘Œ − πœ‡πœ‡Μ‚ 1 ) = 15641.94
𝜎𝜎1 2 =
15641.94
16
= 977.62
𝜎𝜎1 2 : Variance of the data set
Y: Rushing Yards in a game
N: Total number of games played
𝜎𝜎1 : Standard deviation
The standard deviation is given as the square root of the variance. This leads us to:
𝜎𝜎1 = √977.62 = 31.27
Now that the data set has been analyzed, we can use the calculated mean ‘πœ‡πœ‡Μ‚ 1 ’ to model a Poisson
distribution for the data set. The Poisson distribution probability density function is given by the
following formula:
πœ‡πœ‡Μ‚ 1 𝑦𝑦 βˆ™ 𝑒𝑒 −πœ‡πœ‡οΏ½1
𝑃𝑃(π‘Œπ‘Œ = 𝑦𝑦) =
𝑦𝑦!
𝑃𝑃(π‘Œπ‘Œ = 𝑦𝑦) =
y: Plausible rushing yards in a game
in the 2013-14 season
78.56𝑦𝑦 βˆ™ 𝑒𝑒 −78.56
𝑦𝑦!
Our goal is to find out the probability of Marshawn Lynch having a ‘good game’. A ‘good game' had
been defined as when Lynch has more than 80 rushing yards in a game.
Therefore, the following calculation can be conducted:
π‘Œπ‘Œ~𝑃𝑃𝑃𝑃𝑃𝑃(πœ‡πœ‡Μ‚ 1 )
The TiNspire cx graphic display calculator was used in this calculation. To find the Poisson cumulative
distribution frequency, the following is entered into the
calculator in the format as denoted below
poissCDF(πœ‡πœ‡Μ‚ 1 , lower bound, upper bound)
π’‘π’‘π’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆ : Probability of Marshawn
Lynch having a ‘good game’ under
Poisson distribution
𝑃𝑃(π‘Œπ‘Œ ≥ 80) = 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝(78.56,80,10000) = 0.450
π’‘π’‘π’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆ = 0.450
Now, we have two defined outcomes for Marshawn Lynch: Either he has a ‘good game’ or he does not
have a ‘good game’. This situation allows for the usage of the binomial distribution.
𝑋𝑋~𝐡𝐡(𝑁𝑁, 𝑝𝑝𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 )
The expected value for the Binomial distribution is given by the
formula: 𝐸𝐸(𝑋𝑋) = 𝑁𝑁 βˆ™ 𝑝𝑝𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔
Therefore, through the expected number of games Marshawn
X: Number of good games for
Marshawn Lynch (Poisson
distribution)
Lynch will perform in is given by:
𝑬𝑬(𝑿𝑿) = 𝟏𝟏𝟏𝟏 βˆ™ 𝟎𝟎. πŸ’πŸ’πŸ’πŸ’πŸ’πŸ’ ≈ πŸ•πŸ• π’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆ
Now, we shall conduct a similar calculation under the Poisson model for Russell Wilson, whose
analyzed data is shown in Table 4 below.
Table 4: Russell Wilson game log data calculations
By similar calculation to calculation 1, we can derive the following (Calculation 2):
Average/Mean Passer rating for Russell Wilson is given by:
πœ‡πœ‡Μ‚ 2 = 100.03
𝜎𝜎22 : Variance of the data set
Z: Quarterback Rating in a game
𝜎𝜎2 : Standard deviation
The variance of the data set is given by:
πœ‡πœ‡Μ‚ 2 : Mean passer rating for
𝜎𝜎22 = 886.44
Russell Wilson
The standard deviation of the data set is given by:
𝜎𝜎2 = 29.77
To model the Poisson distribution for Russell Wilson, we can use the calculated mean πœ‡πœ‡Μ‚ 2 :
𝑃𝑃(𝑍𝑍 = 𝑧𝑧) =
𝑃𝑃(𝑍𝑍 = 𝑧𝑧) =
οΏ½2 𝑧𝑧 βˆ™ 𝑒𝑒 −πœ‡πœ‡οΏ½2
πœ‡πœ‡
z: Plausible passer rating in a game
𝑧𝑧!
100.03𝑧𝑧 βˆ™ 𝑒𝑒 −100.03
𝑧𝑧!
As the Poisson distribution model is being used, the variance and the mean are both the same at 100.03
yards for this distribution.
Our goal is to find out the probability of Russell Wilson having a ‘good game’. A ‘good game' had been
defined as when Wilson has more than a 95-passer rating in a game.
Therefore, the following calculation can be conducted:
οΏ½2 οΏ½
𝑍𝑍~π‘ƒπ‘ƒπ‘ƒπ‘ƒπ‘ƒπ‘ƒοΏ½πœ‡πœ‡
By similar calculation like in Marshawn Lynch
(calculation 1):
𝑃𝑃(𝑍𝑍 ≥ 95) = 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝(100.03,95,10000) = 0.701
π’‘π’‘π’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆ : Probability of Russell Wilson
performing under Poisson
distribution
π’‘π’‘π’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆ = 0.701
Like the calculation for Marshawn Lynch, there are 2 defined outcomes, therefore, a Binomial
distribution can be used.
π‘Šπ‘Š~𝐡𝐡(𝑛𝑛, 𝑝𝑝𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 )
W: Number of good games for
Russell Wilson (Poisson distribution)
The expected value of ‘good games’ for Russell Wilson is given below, using similar calculation as was
done for Marshawn Lynch
𝑬𝑬(𝑾𝑾) = 𝟏𝟏𝟏𝟏 βˆ™ 𝟎𝟎. πŸ•πŸ•πŸ•πŸ•πŸ•πŸ• ≈ 𝟏𝟏𝟏𝟏 π’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆ
Normal Distribution Model
Now that we have the expected number of games in which Russell Wilson and Marshawn Lynch will
perform under the Poisson distribution, we can move onto the Normal Distribution model.
From the prior calculation, we have the calculated arithmetic mean for the Marshawn Lynch data set,
‘πœ‡πœ‡Μ‚ 1 ’, and the standard deviation ‘𝜎𝜎1 ’. Therefore, we can model a normal distribution function for the
data set, as shown below. (Calculation 3)
π‘Œπ‘Œ~𝑁𝑁(πœ‡πœ‡Μ‚ 1 , 𝜎𝜎1 2 )
πœ‡πœ‡Μ‚ 1 = 78.56
𝜎𝜎1 = 31.27
The TiNspire cx graphic display calculator was used in this calculation. To find the Normal cumulative
distribution frequency, the following is entered into the calculator in the format as denoted below.
normCDF(lower bound, upper bound, πœ‡πœ‡Μ‚ 1 , 𝜎𝜎1 2 )
𝑃𝑃(π‘Œπ‘Œ ≥ 80) = 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛(80, 10000, 78.56, 31.27) = 0.482
π’‘π’‘π’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆ = 0.482
π’‘π’‘π’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆ : Probability of Marshawn
Lynch having a good game under
Normal Distribution
Therefore, the probability of Marshawn Lynch having a ‘good game’ under the Normal distribution
model is given by 0.482.
Like the Poisson distribution calculation, we can use the binomial distribution to find the expected
number of games Marshawn Lynch will perform in under this model.
𝑋𝑋2 ~𝐡𝐡(𝑁𝑁, 𝑝𝑝𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 )
The expected value for the Binomial distribution is given by
𝑋𝑋2 : Number of good games by
Marshawn Lynch (Normal Distribution)
the formula: 𝐸𝐸(𝑋𝑋2 ) = 𝑁𝑁 βˆ™ 𝑝𝑝𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔
Therefore, through the expected number of games Marshawn Lynch will perform in is given by:
𝑬𝑬(π‘Ώπ‘ΏπŸπŸ ) = 𝟏𝟏𝟏𝟏 βˆ™ 𝟎𝟎. πŸ’πŸ’πŸ’πŸ’πŸ’πŸ’ ≈ πŸ–πŸ– π’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆ
We can now analyze the Russell Wilson Data set using the normal distribution model. (Calculation 4)
By similar calculation to Calculation 3, we can derive the following normal distribution function using
πœ‡πœ‡Μ‚ 2 and 𝜎𝜎22
𝑍𝑍~𝑁𝑁(πœ‡πœ‡Μ‚ 2 , 𝜎𝜎22 )
πœ‡πœ‡Μ‚ 2 = 100.03
𝜎𝜎2 = 29.77
Like the previous calculation for the Marshawn Lynch Normal Distribution data set:
𝑃𝑃(π‘Œπ‘Œ ≥ 80) = 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛(95, 10000, 100.03, 29.77)
= 0.567
𝑝𝑝𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 : Probability of Russell
Wilson having a good game
under Normal Distribution
π’‘π’‘π’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆπ’ˆ = 0.567
We can use the binomial distribution to find the expected number of games Russell Wilson will perform
in under this model.
π‘Šπ‘Š2 ~𝐡𝐡(𝑛𝑛, 𝑝𝑝𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 )
π‘Šπ‘Š2 : Number of good games for Russell Wilson
(Normal distribution)
Therefore, through the expected number of games Russell Wilson will perform in is given by:
𝐸𝐸(π‘Šπ‘Š2 ) = 16 βˆ™ 0.567 ≈ 9 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔
Therefore, we can see that the probabilities between the Normal and the Poisson distribution for the
performance of the players differs. Under the Poisson distribution, the probability of Russell Wilson is
much higher than the probability of Marshawn Lynch performing. However, under the Normal
distribution, the probabilities are much closer (numerically). This is because the mean and the variance
are the same under the Poisson distribution, but under the Normal Distribution, the mean and variance
are different.
Therefore, to compare which model would fit better, we must validate our model with the real
performances by both the players in the 2014-15 NFL Season. The game logs for both players are
included below.
Table 5: Russell Wilson game log (2014-2015) data calculations
‘Zβ‚‚’ is the passer rating for a given game in the 2014-2015 season
Variance for the data in Table 5 is calculated in the same way it was in Table 4
𝜎𝜎3 2 = 641.42
Table 6: Marshawn Lynch game log (2014-2015) data calculations
‘π‘Œπ‘Œ2 ’ is the rushing yards for a given game in the 2014-15 NFL season
Variance for the data in Table 5 is calculated in the same way it was in Table 3
𝜎𝜎4 2 = 871.48
To observe which probability distribution model is a better fit, we shall use the Chi square test for best
fit. The Chi square test is used to observe whether the observed value of a given phenomenon is
significantly different from the expected value. Effectively, we can see how well the theoretical
distribution fits the empirical distribution for both the Poisson and Normal distribution models.
Calculation 5: Chi-square best fit set for Russell Wilson data sets.
Poisson distribution:
Our null hypothesis 𝐻𝐻₀, is that the performances of Russell Wilson is that his performances follow a
Poisson distribution, which was outlined in Calculation 2. Our alternative hypothesis π»π»π‘Žπ‘Ž is that the
performances do not follow the Poisson distribution. We can categorize his performances as him either
having a ‘good game’ or him not having a ‘good game’. Based on this, we can construct table 7 below
Table 7: Comparing outcome of the Poisson distribution model for Russell Wilson.
Outcome
No Good Game
Good Game
Actual Occurrence
6
10
Expected Occurrence
5
11
Games
Based on the data from table 7, we can construct our χ2 statistic as given below:
χ2 =
(6 − 5)2 (10 − 11)2
+
5
11
χ2 = 0.291
𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 π‘œπ‘œπ‘œπ‘œ 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 = 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 π‘œπ‘œπ‘œπ‘œ 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 − 1
𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 π‘œπ‘œπ‘œπ‘œ 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 = 2 − 1 = 1
For the purpose of practicality, I will be taking a 0.10 significance level.
Therefore, we can calculate the p-value i.e. the probability that the chi square statistic is more extreme
than the calculated value
𝑝𝑝 − 𝑣𝑣𝑣𝑣𝑣𝑣 = 𝑃𝑃(πœ’πœ’ 2 > 0.291) = 0.59
This is significantly larger than our significance level. Therefore, our Null hypothesis cannot be
rejected, and the Poisson distribution is applicable.
Normal distribution:
Our null hypothesis Hβ‚€, is that the performances of Russell Wilson follow a Normal distribution, which
was outlined in Calculation 2. Our alternative hypothesis π»π»π‘Žπ‘Ž is that the performances do not follow the
Normal distribution. Therefore, we can construct Table 8 below based on the previous categorization
Table 8: Comparing outcome of the Normal distribution model for Russell Wilson.
Outcome
No Good Game
Good Game
Actual Occurrence
6
10
Expected Occurrence
7
9
Games
By similar calculation to the previous part of this calculation, the chi-square statistic is as given below:
χ2 =
(6 − 7)2 (10 − 9)2
+
7
10
χ2 = 0.243
The degrees of freedom will remain 1, and the significance level will also remain 0.10
𝑝𝑝 − 𝑣𝑣𝑣𝑣𝑣𝑣 = 𝑃𝑃(πœ’πœ’ 2 > 0.243) = 0.622
This value is larger than our significance level. Therefore, our Null hypothesis cannot be rejected, and
the Normal distribution is applicable.
However, it can be observed that the p-value for the Normal distribution model is higher than the pvalue for the Poisson distribution model. Therefore, the logical conclusion would be that the Normal
Distribution data set is a better fit for Russell Wilson’s performances.
Calculation 6: Chi-square best fit set for Marshawn Lynch data sets.
Poisson distribution:
Our null hypothesis Hβ‚€, is that the performances of Marshawn Lynch is that his performances follow a
Poisson distribution, which was outlined in Calculation 1. Our alternative hypothesis π»π»π‘Žπ‘Ž is that the
performances do not follow the Poisson distribution. We can also categorize his performances as him
either having a ‘good game’ or him not having a ‘good game’. Based on this categorization, we can
construct table 9 below.
Table 9: Comparing outcome of the Poisson distribution model for Marshawn Lynch.
Outcome
No Good Game
Good Game
Actual Occurrence
8
8
Expected Occurrence
9
7
Games
Based on the data from table 9, we can construct our χ2 statistic as given below:
χ2 =
(8 − 9)2 (8 − 7)2
+
9
7
χ2 = 0.254
The degrees of freedom will remain 1, and the significance level will also remain 0.10
𝑝𝑝 − 𝑣𝑣𝑣𝑣𝑣𝑣 = 𝑃𝑃(πœ’πœ’ 2 > 0.254) = 0.614
The p value is larger than our significance level. Therefore, our Null hypothesis cannot be rejected, and
the Poisson distribution is applicable.
Normal distribution:
Our null hypothesis Hβ‚€, is that the performances of Marshawn Lynch follow a Normal distribution,
which was outlined in Calculation 3. Our alternative hypothesis π»π»π‘Žπ‘Ž is that the performances do not
follow the Normal distribution. Therefore, we can construct Table 10 below based on the previous
categorization.
Table 10: Comparing outcome of the Normal distribution model for Marshawn Lynch.
Outcome
No Good Game
Good Game
Actual Occurrence
8
8
Expected Occurrence
9
7
Games
Based on the data from table 9, we can construct our χ2 statistic as given below:
χ2 =
(8 − 8)2 (8 − 8)2
+
8
8
χ2 = 0
The degrees of freedom will remain 1, and the significance level will also remain 0.10.
𝑝𝑝 − 𝑣𝑣𝑣𝑣𝑣𝑣 = 𝑃𝑃(πœ’πœ’ 2 > 0) = 1
This value is larger than our significance level. Therefore, our Null hypothesis cannot be rejected, and
the Normal distribution is applicable.
However, surprisingly, the Normal distribution model is a perfect fit for the performances of Marshawn
Lynch.
Therefore, the Chi square test of best fit suggests that the performances in a game for Russell Wilson
and Marshawn Lynch are both normally distributed. Furthermore, we can say that the performances of
both players from the previous season are similar to their performances in the 2014-2015 season, thus
validating our previously made assertion that the performances of a player’s previous season can be
used to model his/her next one (to an extent).
However, there remains a small flaw with this test. 16 games is a relatively small sample size when
considering that both Russell Wilson and Marshawn Lynch have played numerous games together over
the course of their careers. Therefore, we must ensure that the performances of both Marshawn Lynch
and Russell Wilson come from the same population as an additional validation test.
Therefore, the T-value test will be used, which indicates how significant the differences between the
two groups (2013-2014 and 2014-2015 seasons for both players) are and whether the observed
differences between the two data sets are based on chance. If the calculated T value is less than the
expected T value from the T-distribution tables, there is sufficient evidence that there are no significant
differences between the 2013-14 and 2014-15 seasons for both Marshawn Lynch and Russell Wilson.
As in the case of both players, the size of the data sets is the same for the 2013-14 and the 2014-15 NFL
seasons, the Equal variance t-test will be used.
Calculation 7: T-value test for the Russell Wilson data sets to ensure uniformity of the model.
π»π»π‘œπ‘œ : πœ‡πœ‡Μ‚ 2 = πœ‡πœ‡3
π»π»π‘Žπ‘Ž : πœ‡πœ‡Μ‚ 2 ≠ πœ‡πœ‡3
𝑑𝑑 =
πœ‡πœ‡Μ‚ 2 − πœ‡πœ‡Μ‚ 3
οΏ½(𝑁𝑁 − 1) × πœŽπœŽ1 − (𝑁𝑁 − 1) × πœŽπœŽ3
𝑁𝑁 + 𝑁𝑁 − 2
2
𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 π‘œπ‘œπ‘œπ‘œ 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 = 𝑁𝑁 + 𝑁𝑁 − 2
N: number of games in the data sets.
2
×
1
οΏ½1 + 1
𝑁𝑁 𝑁𝑁
Substituting the values from the 2013-2014 Normal distribution model and 2014-2015 seasons from the
Russell Wilson data sets gives us:
𝑑𝑑 =
100.3 − 95.1
οΏ½(16 − 1) × 886.44 − (15 − 1) × 641.42
16 + 16 − 2
2
= 0.019
2
×
1
οΏ½1 + 1
16 16
𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 π‘œπ‘œπ‘œπ‘œ 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 = 16 + 16 − 2 = 30
Using the Degrees of Freedom value 30, and a significance level of 0.05, The T-value distribution tables
give us a value of 1.697. Since 0.019 < 1.697 and the calculated t-value is much smaller than the t-
value from the distribution tables, we can surmise that the differences between the modelled distribution
and what actually occurred during the 2014-2015 season with regards to Russell Wilson’s performances
were not due to intrinsic differences between the data sets, and they did in fact come from the same
population.
Calculation 8: T-value test for the Marshawn Lynch data sets to ensure uniformity of the model.
The T-value formula will be the similar to the one was used in Calculation 7, and the degrees of freedom
value will also remain 30 as the Marshawn Lynch data set sizes were equal to that of the Russell Wilson
data set sizes.
𝑇𝑇 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 =
78.56 − 81.625
2
2
οΏ½(16 − 1) × 977.62 − (15 − 1) × 871.48
16 + 16 − 2
𝑇𝑇 π‘£π‘£π‘Žπ‘Žπ‘™π‘™π‘™π‘™π‘™π‘™ = 0.0094
×
1
οΏ½1 + 1
16 16
As the Degrees of freedom and the significance level remain the same as in Calculation 7, the T-value
distribution tables give us a value of 1.697 again. We can once again observe that the calculated T-value
is significantly lower than the one from the distribution tables. Therefore, we can also surmise that that
the differences between the modelled distribution and what actually occurred during the 2014-2015
season with regards to Marshawn Lynch’s performances were not due to intrinsic differences between
the data sets, and they did in fact come from the same population.
Now, we have clarified that the performances of Marshawn Lynch and Russell Wilson can be modelled
best though a normal distribution. However, in our validation tests, we have benefitted from the
advantage of hindsight as we were able to make comparisons between that actual occurrences and the
expected occurrences of our models. To look at the examined phenomenon from a different angle, I
have decided to take a perspective similar to that of the Seattle Seahawks coach right after the 2013-14
season, as he would not have the advantage of hindsight wisdom that we have currently.
Based on the eye-test, we have established that the prevailing NFL opinion was that Russell Wilson and
Marshawn Lynch’s performances in games was independent of each other due to them controlling
different facets of the offense. Calculation below elucidates the phenomenon of independence.
𝑃𝑃(𝐴𝐴 ∩ 𝐡𝐡) = 𝑃𝑃(𝐴𝐴)𝑃𝑃(𝐡𝐡)
𝑃𝑃�𝑀𝑀𝑀𝑀𝑔𝑔𝑔𝑔 ∩ 𝑅𝑅𝑅𝑅𝑛𝑛−𝑔𝑔𝑔𝑔 οΏ½ = 0.482 × 0.433 = 0.2087
𝑃𝑃�𝑀𝑀𝑀𝑀𝑛𝑛−𝑔𝑔𝑔𝑔 ∩ 𝑅𝑅𝑅𝑅𝑔𝑔𝑔𝑔 οΏ½ = 0.518 × 0.567 = 0.2937
𝑃𝑃�𝑀𝑀𝑀𝑀𝑔𝑔𝑔𝑔 ∩ 𝑅𝑅𝑅𝑅𝑔𝑔𝑔𝑔 οΏ½ = 0.567 × 0.482 = 0.2734
This information can be represented in the form of a Venn Diagram, as shown below:
𝑃𝑃�𝑀𝑀𝑀𝑀𝑔𝑔𝑔𝑔 ∪ 𝑅𝑅𝑅𝑅𝑔𝑔𝑔𝑔 οΏ½ = 0.2087 + 0.2734 + 0.2937
= 0.2087 + 0.2734 + 0.2937
= 0.78
The probability of 0.78 is relatively high. Thus, we can say that the probability of the Seattle Seahawks
offense, which relies heavily on the performances of the two players, being effective is also relatively
high. This allows bookmakers and Coaches to forecast the effectiveness of the offense and plan
accordingly.
However, is there a link that can be drawn in between the performance models of Marshawn Lynch and
Russell Wilson derived from the 2013-14 season? To see if
this was the case, I decided to construct a Pearson’s
correlation coefficient.
Calculation 8: A test for correlation between the data sets
ρYZ =
𝐢𝐢𝐢𝐢𝐢𝐢(π‘Œπ‘Œ, 𝑍𝑍)
𝜎𝜎1 𝜎𝜎2
𝐢𝐢𝐢𝐢𝐢𝐢(π‘Œπ‘Œ, 𝑍𝑍) =
ρYZ : Pearson’s correlation coefficient
𝐢𝐢𝐢𝐢𝐢𝐢(π‘Œπ‘Œ, 𝑍𝑍): Co-variance of the
performances of Marshawn Lynch
and Russell Wilson from the 2013-14
season
∑(π‘Œπ‘Œ − πœ‡πœ‡Μ‚ 1 ) × (𝑍𝑍 − πœ‡πœ‡Μ‚ 2 )
𝑁𝑁 − 1
Co-variance of the datasets was calculated by substituting the values from Tables 3 and 4
𝜎𝜎2 : Standard deviation of Russell Wilson data set
𝐢𝐢𝐢𝐢𝐢𝐢(π‘Œπ‘Œ, 𝑍𝑍) = −221.996
ρYZ =
𝜎𝜎1 : Standard deviation of Marshawn Lynch data set
−221.996
29.77 × 31.27
πœ‡πœ‡Μ‚ 1 : Mean of Marshawn Lynch data set
ρYZ = −0.24
Thus, we reach a surprising result. It appears
πœ‡πœ‡Μ‚ 2 : Mean of Russell Wilson data set
N: Number of values
that there is a minor correlation in between the Marshawn Lynch and Russell Wilson data set models.
The result of calculation 8 tells us that a completely independent relationship between rushing and
passing offenses in the Seattle Seahawks is not completely correct, and therefore, the probability given
by the Venn Diagram is only an approximation.
Furthermore, we can use this data to construct a bi-variate normal distribution for both Marshawn Lynch
and Russell Wilson’s performances, seeing as to how the models we have adopted for both players
suggest that their performances can be approximated by a normal distribution. The bi-variate normal
distribution in this case will be made up of two random variables: Rushing Yards in a game by
Marshawn Lynch (Y) and Passer Rating for Russell Wilson (Z), which have been defined earlier in the
investigation.
I have sourced a paper which gives us the joint probability density function for the bi-variate distribution
of Y and Z. The form of the equation for the joint probability density function is given below as:
π‘“π‘“π‘Œπ‘Œπ‘Œπ‘Œ (𝑦𝑦, 𝑧𝑧) =
Where:
1
2πœ‹πœ‹οΏ½1 − ρYZ 2 × πœŽπœŽ1 × πœŽπœŽ2
× π‘’π‘’
−
π‘Žπ‘Ž
2(1−ρYZ 2 )
2
(𝑦𝑦 − πœ‡πœ‡οΏ½ 1 )2 2ρYZ �𝑦𝑦 − πœ‡πœ‡οΏ½ 1 ��𝑧𝑧 − πœ‡πœ‡οΏ½ 2 οΏ½ �𝑧𝑧 − πœ‡πœ‡οΏ½ 2 οΏ½
π‘Žπ‘Ž =
−
+
𝜎𝜎1 2
𝜎𝜎2 2
𝜎𝜎1 𝜎𝜎2
Substituting the values (which have been defined earlier in the investigation) gives us the equation
below for the probability density function.
π‘Žπ‘Ž
π‘“π‘“π‘Œπ‘Œπ‘Œπ‘Œ (𝑦𝑦, 𝑧𝑧) = (1.76 × 10−4 ) × π‘’π‘’ −1.885
Where:
π‘Žπ‘Ž =
(𝑦𝑦 − 78.56)2 0.48(𝑦𝑦 − 78.56)(𝑧𝑧 − 100.3) (𝑧𝑧 − 100.3)2
+
+
930.91
𝜎𝜎 2
886.25
However, this only gives us the joint probability density function. We require the joint probability
cumulative distribution function (cdf) to arrive at
possibly meaningful result. A joint cdf function
require a calculation of the form below:
π‘Ÿπ‘Ÿ
𝑠𝑠
s,t: lower and upper limits for the ‘y’
variable
would
r,q: lower and upper limits for the ‘z’
variable
𝑑𝑑
𝐹𝐹(𝑦𝑦, 𝑧𝑧) = οΏ½ οΏ½ π‘“π‘“π‘Œπ‘Œπ‘Œπ‘Œ (𝑦𝑦, 𝑧𝑧) 𝑑𝑑𝑑𝑑 𝑑𝑑𝑑𝑑
π‘žπ‘ž
a
Based on the above formula, we can derive probabilities for when either or both Marshawn Lynch and
Russell Wilson perform in a given game. The usage of double integrals was not taught in class and was
a difficult concept to grasp, so I decided to use Wolfram Alpha to expedite the process
Calculation 9: Showing the probability divergence without independence
Case 1: Marshawn Lynch has a ‘good game’ and Russell Wilson doesn’t
95
𝑓𝑓(𝑦𝑦, 𝑧𝑧) = οΏ½ οΏ½
0
296
80
π‘Žπ‘Ž
(1.76 × 10−4 ) × π‘’π‘’ −1.885 𝑑𝑑𝑑𝑑 𝑑𝑑𝑑𝑑
Using Wolfram Alpha, the probability is 0.2443
Case 2: Russell Wilson has a ‘good game’ and Marshawn Lynch doesn’t
𝑓𝑓(𝑦𝑦, 𝑧𝑧) = οΏ½
158.3
95
80
π‘Žπ‘Ž
οΏ½ (1.76 × 10−4 ) × π‘’π‘’ −1.885 𝑑𝑑𝑑𝑑 𝑑𝑑𝑑𝑑
0
Using Wolfram Alpha, the probability is 0.3104
Case 3: Both Russell Wilson and Marshawn Lynch have ‘good games’
𝑓𝑓(𝑦𝑦, 𝑧𝑧) = οΏ½
158.3
95
οΏ½
296
80
π‘Žπ‘Ž
(1.76 × 10−4 ) × π‘’π‘’ −1.885 𝑑𝑑𝑑𝑑 𝑑𝑑𝑑𝑑
Using Wolfram Alpha, the probability is 0.2299
The reason the upper bound values of 158.3 and 296 were chosen was for logical reasons. 296 is the
highest number of rushing yards ever completed in a game while 158.3 represents the highest possible
computed figure for the passer rating statistic.
Using the above values, we can construct another Venn Diagram:
If we compare the 2 Venn diagrams that we have constructed for the performances of Russell Wilson
and Marshawn Lynch, we can see that the individual probabilities for only Marshawn Lynch and Russell
Wilson having ‘good games’ are slightly higher using the Bi-variate normal distribution when compared
to assumption of complete independence. This is because of the slight negative relationship of the
performances of both players (through the modelling of the 2013-2014 performances of the players).
However, the overall probability of Marshawn Lynch and/or Russell Wilson performing in a game
remains approximately the same because the negative correlation coefficient still has a relatively small
number. Therefore, independence would still give a close approximation, but based on the examination
of the data set, using a bivariate distribution model would yield more in-depth results.
Looking at this phenomenon from a different perspective allows us to understand how coaches can
examine similar relationships and probabilities across multiple players will allow them to determine the
optimal line- ups for the Seattle Seahawks for the next season based on historical evidence.
Conclusion
Based on our examination of the performances of Russell Wilson and Marshawn Lynch from 20132015, we saw that we could use the previous season’s data to model a close approximation of the player
performance for the next season. Using the Chi-square and T value test, we validated that the Normal
distribution model would be a close fit for their individual performances during the time period.
Subsequently, to look at the data sets from a different angle, we adopted the perspective of Coaches
during the time period right after the 2013-14 season and before the 2014-2015 season and examining
the data set models for the performances led us to determine that Marshawn Lynch and Russell Wilson’s
performances were not completely independent of each other, but that there was a slight negative
correlation between the two, based on which we were able to construct a Bi-variate normal cdf, which
showed us that the probabilities of both/either player having a ‘good game’ would be different from
when complete independence was assumed. The probabilities of individual players performing in the
game were higher than if they both performed in the same game.
Significance
However, is there any possible reason as to why complete independence was not observed i.e. a result
contrary to NFL opinion? If one reflects on it, Offenses tend to subconsciously lean towards the plays
that allow the team to advance up the field. When the passing offense isn’t working effectively (Russell
Wilson), teams will tend to favor rushing the ball (Marshawn Lynch) and vice versa. However, the
correlation coefficient doesn’t have a larger value likely because teams still must vary their offensive
approach to prevent predictability. This is why joint performances are still relatively difficult to forecast.
The result I obtained was also significant because I realized that more detailed statistical analysis can
be conducted with even more players by using multivariate distributions, which will allow coaches to
optimize the best performing lineups for games.
Limitations
We defined the ‘good game’ parameter for this investigation as something rigid. In reality, each football
game differs, and as someone who watches games regularly, a definition of a good game can vary from
game to game. Therefore, quantifying this parameter is a lot more difficult than our assumption. In line
with this, defining different degrees of a ‘good game’ could’ve led to more categories on the Chi-square
best fit test (which would increase the accuracy of our test result).
Furthermore, probability in sports is something that is extremely difficult to model, as there are a lot of
external factors that affect each game, such as talent level of opponents, the defensive strategies the
opponents adopt against that Seattle Seahawks offense and the general intangibles such as team morale,
fitness, travel time etc.. It is impossible to quantify the effects of these factors, which is why models
have a relatively high degree of diverging from reality.
Extensions and Links to other fields
We can use other probability distributions. In this investigation, we were limited to the Normal and
Poisson distributions. In reality, there may be other distributions that may be an even better such as the
Log normal distribution (which restricts the minimum possible value of the random variable to zero,
thus improving accuracy) or the gamma distribution (due to it’s longer tail accounting for the
possibilities of higher values such as record breaking performances).
The mathematics we used was predominantly based on probability and statistics, but the data analysed
can be extended to the mathematical fields of estimation and hypothesis testing. If we were to look for
alternative applications of this are of mathematics, the same methodology can be used by soccer coaches
to model player performances in a time period, and to see which player combination line-ups can give
the highest probability of the whole team performing in the same game. Similarly, it could also be
applied to other sports by coaches looking to determine what their best team is before the start of the
season, through deeper analysis beyond the eye-test.
Evaluation
As stated, before in the limitations section, probability in sports is difficult to forecast due to the varying
nature of games. Until reality occurs, the best we can get to it is through an approximation that may not
always hold true.
As stated before, the calculation done in this investigation can be used by coaches during play calling
for either a pass or a rush depending on which player appears to be performing and how many games
till date they have performed in. I have used standard level statistics such as calculations of the mean
and variance, higher level probability through the usage of Poisson and Normal distribution models,
and mathematics beyond the scope of the syllabus such as the Chi-square test, the t-value test and
bivariate normal distributions, and I benefitted greatly from the usage of online software such as
Desmos and the TiNspire-cx calculator. Therefore, I can conclusively state that I can understand the
importance sports statistics can have on making important team decisions (and is perhaps why many
sports teams rely on analytics today to optimize performances of the team). However, Sports, like
Gridiron football, are in the end, based on a lot more factors than just numbers, which is what makes
them so exciting.
Bibliography
Data and Education Resources
“4.2
-
Bivariate
Normal
Distribution
|
STAT
505.”
Accessed
February
10,
2020.
https://online.stat.psu.edu/stat505/lesson/4/4.2.
“2017 NFL Rushing | Pro-Football-Reference.Com.” Accessed February 10, 2020. https://www.profootball-reference.com/years/2017/rushing.htm.
“Bivariate Normal Distribution -- from Wolfram MathWorld.” Accessed February 10, 2020.
http://mathworld.wolfram.com/BivariateNormalDistribution.html.
“Chi-Square
Goodness
of
Fit
Test
-
Statistics
Solutions.”
Accessed
February
10,
2020.
https://www.statisticssolutions.com/chi-square-goodness-of-fit-test/.
“Marshawn Lynch 2013 Game Log | Pro-Football-Reference.Com.” Accessed February 10, 2020.
https://www.pro-football-reference.com/players/L/LyncMa00/gamelog/2013/.
“Marshawn Lynch 2014 Game Log | Pro-Football-Reference.Com.” Accessed February 10, 2020.
https://www.pro-football-reference.com/players/L/LyncMa00/gamelog/2014/.
“NFL Player Stats | NFL.Com.” Accessed February 10, 2020. http://www.nfl.com/stats/player.
“Rush
(Gridiron
Football)
-
Wikipedia.”
Accessed
February
10,
2020.
https://en.wikipedia.org/wiki/Rush_(gridiron_football).
“Russell Wilson 2013 Game Log | Pro-Football-Reference.Com.” Accessed February 10, 2020.
https://www.pro-football-reference.com/players/W/WilsRu00/gamelog/2013/.
“Russell Wilson 2014 Game Log | Pro-Football-Reference.Com.” Accessed February 10, 2020.
https://www.pro-football-reference.com/players/W/WilsRu00/gamelog/2014/.
“T-Test Definition.” Accessed February 10, 2020. https://www.investopedia.com/terms/t/t-test.asp.
“What Is Passing Yards? Definition from SportingCharts.Com.” Accessed February 10, 2020.
https://www.sportingcharts.com/dictionary/nfl/passing-yards/.
“What Is Receiving Yards? Definition from SportingCharts.Com.” Accessed February 10, 2020.
https://www.sportingcharts.com/dictionary/nfl/receiving-yards/.
Online Calculators:
“Double
Integral
Calculator:
Wolfram|Alpha.”
Accessed
February
10,
2020.
https://www.wolframalpha.com/calculators/double-integral-calculator.
“Normal Distribution.” Accessed February 10, 2020.
https://www.desmos.com/calculator/2kmx0enkkz.
“Poisson
Distribution
Formula.”
Accessed
February
10,
2020.
https://www.desmos.com/calculator/qo0442muda.
“Quick
P
Value
from
Chi-Square
Score
Calculator.”
Accessed
February
10,
2020.
https://www.socscistatistics.com/pvalues/chidistribution.aspx.
“T-Value Calculator | Good Calculators.” Accessed February 10, 2020. https://goodcalculators.com/studentt-value-calculator/.
Video Tutorials
Maths Resources, director. Joint Probability Distributions for Continuous Random Variables - Worked
Example. YouTube, YouTube, 14 Nov. 2015, www.youtube.com/watch?v=Om68Hkd7pfw.
Stepbil, director. Joint PDF #3 - Deriving Joint Cumulative Distribution Function from Joint PDF. YouTube,
YouTube, 8 Jan. 2011, www.youtube.com/watch?v=EPjnUF952B8.
Download