Math IA FInal Draft

Mathematics IA First Draft Rishi Ganesh Ravichandran Aim: Examining the performances of Quarterback Russell Wilson and Running Back Marshawn Lynch in the 2014-2015 NFL season for the Seattle Seahawks. Rationale: As a young child being brought up in the United States of America, I grew up watching all kinds of sports, such as Cricket, Soccer, Basketball and Gridiron Football. With my father being a die-hard fan of the Seattle Seahawks in the NFL, I also became a huge fan of Gridiron football and the Seattle Seahawks, waking up in the wee hours of the morning just to catch the games after we moved to Singapore. Gridiron football fascinated me because it had multiple plays through which points could be scored, namely Passing and Rushing of the bsall. The Quarterback mainly advanced the ball by passing while the Running back advanced the ball through rushing. The eye-test and accounts by so called ‘TV analysts’ state that rushing and passing offensive moves are completely independent of each other, but I believed that I could find an underlying link, especially in the context of my favorite team. The reason I chose the combination of Russell Wilson and Marshawn Lynch from the years 2013-2015 was because in my opinion, these two epitomized peak performances of a dual Quarterback-Running Back duo as they made it to two consecutive Super Bowls from 2013-2015. Furthermore, the style of offense that Seattle ran these years was a precursor for many of the modern offenses that NFL teams run today and is a good representative set for NFL offenses today. Introduction: While doing the chapter ‘Statistics and Probability’ in the IB HL Mathematics course, my teacher talked about the usage of probability distributions in sports. I realized that I could explore this area of Mathematics to fulfil my natural curiosity, especially in the context of the modern NFL which is heavily reliant on statistical analysis. Coaches and Bookies in the NFL use distributions to model the probability of a player performing in a certain way to adjust line-ups and set the odds respectively. Therefore, to examine the performances of Marshawn Lynch and Russell Wilson, it is important to understand how players can impact a game in a substantial fashion i.e. through having a ‘good game’. In layman’s terms, a good game would be defined as a performance where a player would have a positive impact on winning. When a player has multiple good games, said player is an impact player. Seeing as to how we now have the advantage of hindsight wisdom, I will be using the game log actual occurrence data of both players from the 2013-14 and the 2014-15 season in my exploration to validate the significance of my findings. Aim: Examining the performances of Quarterback Russell Wilson and Running Back Marshawn Lynch in the 2014-2015 season for the Seattle Seahawks. Background Information: Before delving into the exploration, the reader must understand some basic Gridiron Football concepts. The goal of the game is to advance the football as far as one can into the opponents’ half for better scoring positions. One can advance the ball by Passing (throwing the ball forward to a receiver) or Rushing (person in possession of the ball advancing the ball by running further into the opponeuts’ half) the football. Passing the football is generally the role of the Quarterback (QB), while rushing the football is generally carried out by the Running Back (RB). Most modern NFL offenses try to balance their offensive approaches to avoid repetition of plays (which make offenses easier to defend). Keeping this in mind, QBs and RBs can impact the ballgame by having a ‘good game’. We can define the good-game parameter for Marshawn Lynch and Russell Wilson as the following. For QB Russell Wilson, a ‘good game’ shall be defined as more than a 95-passer rating (a.k.a quarterback rating) (a measure of the performance of passers, primarily quarterbacks, in gridiron football) For RB Marshawn Lynch, a ‘good game’ shall be defined as more than 80 rushing yards in a game (Running with the ball when starting from behind the line of scrimmage with an intent of gaining yardage. While this usually means a running play, any offensive play that does not involve a forward pass is a rush - also called a run.) I will now provide a high-level overview of my methodology. Coaches around the league use previous season performances to forecast future performance capabilities, therefore we shall model the data sets for both Marshawn Lynch and Russell Wilson during the 2013-14 season using the Poisson and Normal distribution models. Subsequently, we shall see which of our models best fit the reality of the 2014-15 season using the chi-square hypothesis test for best fit and the T hypothesis test. I will further examine whichever model is chosen to see if I can observe any underlying links in between the Marshawn Lynch and Russell Wilson data sets. Playoff performances are more difficult to forecast because the nature of each game varies wildly, and factors like ability to perform in close game situations and stress in high-stakes games are difficult to quantify. Therefore, we shall restrict the investigation to the NFL 2013-14 regular season. Tables 2 and 3 below are logs of Marshawn Lynch and Russell Wilson’s games for the 2013-2014 season. These statistics were taken from an online football database. Table 1: Marshawn Lynch game log (2013-2014) Table 2: Russell Wilson game log (2013-2014) Mathematical Investigation: As stated before, forecasting performances for both players can be done through the usage of valid probability distributions. We shall examine the Normal and Poisson distribution models for both players’ 2013-14 season to see which one better adheres to reality. First, we shall observe the data sets under the Poisson distribution model. Tables 3 and 4 respectively show the data set calculations for Marshawn Lynch and Russell Wilson from the previous NFL season. Poisson Distribution Model Table 3: Marshawn Lynch game log data calculations From Table 3, we can calculate the following (Calculation 1): Average/Mean rushing yards for Marshawn Lynch is given by the formula: 𝜇𝜇̂ 1 = 𝜇𝜇̂ 1 = ∑𝑁𝑁 𝑖𝑖=1 𝑓𝑓𝑖𝑖 ∙𝑦𝑦𝑖𝑖 1257 16 𝑁𝑁 = 78.56 𝑓𝑓𝑖𝑖 ∙ 𝑦𝑦𝑖𝑖 : Rushing yards for a particular game 𝜇𝜇̂ 1 : Mean number of rushing yards The variance of the data set is given by the formula: 𝜎𝜎1 2 = ∑(𝑌𝑌 − 𝜇𝜇̂ 1 )2 𝑁𝑁 From Table 5: 2 ∑(𝑌𝑌 − 𝜇𝜇̂ 1 ) = 15641.94 𝜎𝜎1 2 = 15641.94 16 = 977.62 𝜎𝜎1 2 : Variance of the data set Y: Rushing Yards in a game N: Total number of games played 𝜎𝜎1 : Standard deviation The standard deviation is given as the square root of the variance. This leads us to: 𝜎𝜎1 = √977.62 = 31.27 Now that the data set has been analyzed, we can use the calculated mean ‘𝜇𝜇̂ 1 ’ to model a Poisson distribution for the data set. The Poisson distribution probability density function is given by the following formula: 𝜇𝜇̂ 1 𝑦𝑦 ∙ 𝑒𝑒 −𝜇𝜇�1 𝑃𝑃(𝑌𝑌 = 𝑦𝑦) = 𝑦𝑦! 𝑃𝑃(𝑌𝑌 = 𝑦𝑦) = y: Plausible rushing yards in a game in the 2013-14 season 78.56𝑦𝑦 ∙ 𝑒𝑒 −78.56 𝑦𝑦! Our goal is to find out the probability of Marshawn Lynch having a ‘good game’. A ‘good game' had been defined as when Lynch has more than 80 rushing yards in a game. Therefore, the following calculation can be conducted: 𝑌𝑌~𝑃𝑃𝑃𝑃𝑃𝑃(𝜇𝜇̂ 1 ) The TiNspire cx graphic display calculator was used in this calculation. To find the Poisson cumulative distribution frequency, the following is entered into the calculator in the format as denoted below poissCDF(𝜇𝜇̂ 1 , lower bound, upper bound) 𝒑𝒑𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈 : Probability of Marshawn Lynch having a ‘good game’ under Poisson distribution 𝑃𝑃(𝑌𝑌 ≥ 80) = 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝(78.56,80,10000) = 0.450 𝒑𝒑𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈 = 0.450 Now, we have two defined outcomes for Marshawn Lynch: Either he has a ‘good game’ or he does not have a ‘good game’. This situation allows for the usage of the binomial distribution. 𝑋𝑋~𝐵𝐵(𝑁𝑁, 𝑝𝑝𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 ) The expected value for the Binomial distribution is given by the formula: 𝐸𝐸(𝑋𝑋) = 𝑁𝑁 ∙ 𝑝𝑝𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 Therefore, through the expected number of games Marshawn X: Number of good games for Marshawn Lynch (Poisson distribution) Lynch will perform in is given by: 𝑬𝑬(𝑿𝑿) = 𝟏𝟏𝟏𝟏 ∙ 𝟎𝟎. 𝟒𝟒𝟒𝟒𝟒𝟒 ≈ 𝟕𝟕 𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈 Now, we shall conduct a similar calculation under the Poisson model for Russell Wilson, whose analyzed data is shown in Table 4 below. Table 4: Russell Wilson game log data calculations By similar calculation to calculation 1, we can derive the following (Calculation 2): Average/Mean Passer rating for Russell Wilson is given by: 𝜇𝜇̂ 2 = 100.03 𝜎𝜎22 : Variance of the data set Z: Quarterback Rating in a game 𝜎𝜎2 : Standard deviation The variance of the data set is given by: 𝜇𝜇̂ 2 : Mean passer rating for 𝜎𝜎22 = 886.44 Russell Wilson The standard deviation of the data set is given by: 𝜎𝜎2 = 29.77 To model the Poisson distribution for Russell Wilson, we can use the calculated mean 𝜇𝜇̂ 2 : 𝑃𝑃(𝑍𝑍 = 𝑧𝑧) = 𝑃𝑃(𝑍𝑍 = 𝑧𝑧) = �2 𝑧𝑧 ∙ 𝑒𝑒 −𝜇𝜇�2 𝜇𝜇 z: Plausible passer rating in a game 𝑧𝑧! 100.03𝑧𝑧 ∙ 𝑒𝑒 −100.03 𝑧𝑧! As the Poisson distribution model is being used, the variance and the mean are both the same at 100.03 yards for this distribution. Our goal is to find out the probability of Russell Wilson having a ‘good game’. A ‘good game' had been defined as when Wilson has more than a 95-passer rating in a game. Therefore, the following calculation can be conducted: �2 � 𝑍𝑍~𝑃𝑃𝑃𝑃𝑃𝑃�𝜇𝜇 By similar calculation like in Marshawn Lynch (calculation 1): 𝑃𝑃(𝑍𝑍 ≥ 95) = 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝(100.03,95,10000) = 0.701 𝒑𝒑𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈 : Probability of Russell Wilson performing under Poisson distribution 𝒑𝒑𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈 = 0.701 Like the calculation for Marshawn Lynch, there are 2 defined outcomes, therefore, a Binomial distribution can be used. 𝑊𝑊~𝐵𝐵(𝑛𝑛, 𝑝𝑝𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 ) W: Number of good games for Russell Wilson (Poisson distribution) The expected value of ‘good games’ for Russell Wilson is given below, using similar calculation as was done for Marshawn Lynch 𝑬𝑬(𝑾𝑾) = 𝟏𝟏𝟏𝟏 ∙ 𝟎𝟎. 𝟕𝟕𝟕𝟕𝟕𝟕 ≈ 𝟏𝟏𝟏𝟏 𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈 Normal Distribution Model Now that we have the expected number of games in which Russell Wilson and Marshawn Lynch will perform under the Poisson distribution, we can move onto the Normal Distribution model. From the prior calculation, we have the calculated arithmetic mean for the Marshawn Lynch data set, ‘𝜇𝜇̂ 1 ’, and the standard deviation ‘𝜎𝜎1 ’. Therefore, we can model a normal distribution function for the data set, as shown below. (Calculation 3) 𝑌𝑌~𝑁𝑁(𝜇𝜇̂ 1 , 𝜎𝜎1 2 ) 𝜇𝜇̂ 1 = 78.56 𝜎𝜎1 = 31.27 The TiNspire cx graphic display calculator was used in this calculation. To find the Normal cumulative distribution frequency, the following is entered into the calculator in the format as denoted below. normCDF(lower bound, upper bound, 𝜇𝜇̂ 1 , 𝜎𝜎1 2 ) 𝑃𝑃(𝑌𝑌 ≥ 80) = 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛(80, 10000, 78.56, 31.27) = 0.482 𝒑𝒑𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈 = 0.482 𝒑𝒑𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈 : Probability of Marshawn Lynch having a good game under Normal Distribution Therefore, the probability of Marshawn Lynch having a ‘good game’ under the Normal distribution model is given by 0.482. Like the Poisson distribution calculation, we can use the binomial distribution to find the expected number of games Marshawn Lynch will perform in under this model. 𝑋𝑋2 ~𝐵𝐵(𝑁𝑁, 𝑝𝑝𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 ) The expected value for the Binomial distribution is given by 𝑋𝑋2 : Number of good games by Marshawn Lynch (Normal Distribution) the formula: 𝐸𝐸(𝑋𝑋2 ) = 𝑁𝑁 ∙ 𝑝𝑝𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 Therefore, through the expected number of games Marshawn Lynch will perform in is given by: 𝑬𝑬(𝑿𝑿𝟐𝟐 ) = 𝟏𝟏𝟏𝟏 ∙ 𝟎𝟎. 𝟒𝟒𝟒𝟒𝟒𝟒 ≈ 𝟖𝟖 𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈 We can now analyze the Russell Wilson Data set using the normal distribution model. (Calculation 4) By similar calculation to Calculation 3, we can derive the following normal distribution function using 𝜇𝜇̂ 2 and 𝜎𝜎22 𝑍𝑍~𝑁𝑁(𝜇𝜇̂ 2 , 𝜎𝜎22 ) 𝜇𝜇̂ 2 = 100.03 𝜎𝜎2 = 29.77 Like the previous calculation for the Marshawn Lynch Normal Distribution data set: 𝑃𝑃(𝑌𝑌 ≥ 80) = 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛(95, 10000, 100.03, 29.77) = 0.567 𝑝𝑝𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 : Probability of Russell Wilson having a good game under Normal Distribution 𝒑𝒑𝒈𝒈𝒈𝒈𝒈𝒈𝒈𝒈 = 0.567 We can use the binomial distribution to find the expected number of games Russell Wilson will perform in under this model. 𝑊𝑊2 ~𝐵𝐵(𝑛𝑛, 𝑝𝑝𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 ) 𝑊𝑊2 : Number of good games for Russell Wilson (Normal distribution) Therefore, through the expected number of games Russell Wilson will perform in is given by: 𝐸𝐸(𝑊𝑊2 ) = 16 ∙ 0.567 ≈ 9 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 Therefore, we can see that the probabilities between the Normal and the Poisson distribution for the performance of the players differs. Under the Poisson distribution, the probability of Russell Wilson is much higher than the probability of Marshawn Lynch performing. However, under the Normal distribution, the probabilities are much closer (numerically). This is because the mean and the variance are the same under the Poisson distribution, but under the Normal Distribution, the mean and variance are different. Therefore, to compare which model would fit better, we must validate our model with the real performances by both the players in the 2014-15 NFL Season. The game logs for both players are included below. Table 5: Russell Wilson game log (2014-2015) data calculations ‘Z₂’ is the passer rating for a given game in the 2014-2015 season Variance for the data in Table 5 is calculated in the same way it was in Table 4 𝜎𝜎3 2 = 641.42 Table 6: Marshawn Lynch game log (2014-2015) data calculations ‘𝑌𝑌2 ’ is the rushing yards for a given game in the 2014-15 NFL season Variance for the data in Table 5 is calculated in the same way it was in Table 3 𝜎𝜎4 2 = 871.48 To observe which probability distribution model is a better fit, we shall use the Chi square test for best fit. The Chi square test is used to observe whether the observed value of a given phenomenon is significantly different from the expected value. Effectively, we can see how well the theoretical distribution fits the empirical distribution for both the Poisson and Normal distribution models. Calculation 5: Chi-square best fit set for Russell Wilson data sets. Poisson distribution: Our null hypothesis 𝐻𝐻₀, is that the performances of Russell Wilson is that his performances follow a Poisson distribution, which was outlined in Calculation 2. Our alternative hypothesis 𝐻𝐻𝑎𝑎 is that the performances do not follow the Poisson distribution. We can categorize his performances as him either having a ‘good game’ or him not having a ‘good game’. Based on this, we can construct table 7 below Table 7: Comparing outcome of the Poisson distribution model for Russell Wilson. Outcome No Good Game Good Game Actual Occurrence 6 10 Expected Occurrence 5 11 Games Based on the data from table 7, we can construct our χ2 statistic as given below: χ2 = (6 − 5)2 (10 − 11)2 + 5 11 χ2 = 0.291 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑜𝑜𝑜𝑜 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 = 𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 𝑜𝑜𝑜𝑜 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 − 1 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑜𝑜𝑜𝑜 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 = 2 − 1 = 1 For the purpose of practicality, I will be taking a 0.10 significance level. Therefore, we can calculate the p-value i.e. the probability that the chi square statistic is more extreme than the calculated value 𝑝𝑝 − 𝑣𝑣𝑣𝑣𝑣𝑣 = 𝑃𝑃(𝜒𝜒 2 > 0.291) = 0.59 This is significantly larger than our significance level. Therefore, our Null hypothesis cannot be rejected, and the Poisson distribution is applicable. Normal distribution: Our null hypothesis H₀, is that the performances of Russell Wilson follow a Normal distribution, which was outlined in Calculation 2. Our alternative hypothesis 𝐻𝐻𝑎𝑎 is that the performances do not follow the Normal distribution. Therefore, we can construct Table 8 below based on the previous categorization Table 8: Comparing outcome of the Normal distribution model for Russell Wilson. Outcome No Good Game Good Game Actual Occurrence 6 10 Expected Occurrence 7 9 Games By similar calculation to the previous part of this calculation, the chi-square statistic is as given below: χ2 = (6 − 7)2 (10 − 9)2 + 7 10 χ2 = 0.243 The degrees of freedom will remain 1, and the significance level will also remain 0.10 𝑝𝑝 − 𝑣𝑣𝑣𝑣𝑣𝑣 = 𝑃𝑃(𝜒𝜒 2 > 0.243) = 0.622 This value is larger than our significance level. Therefore, our Null hypothesis cannot be rejected, and the Normal distribution is applicable. However, it can be observed that the p-value for the Normal distribution model is higher than the pvalue for the Poisson distribution model. Therefore, the logical conclusion would be that the Normal Distribution data set is a better fit for Russell Wilson’s performances. Calculation 6: Chi-square best fit set for Marshawn Lynch data sets. Poisson distribution: Our null hypothesis H₀, is that the performances of Marshawn Lynch is that his performances follow a Poisson distribution, which was outlined in Calculation 1. Our alternative hypothesis 𝐻𝐻𝑎𝑎 is that the performances do not follow the Poisson distribution. We can also categorize his performances as him either having a ‘good game’ or him not having a ‘good game’. Based on this categorization, we can construct table 9 below. Table 9: Comparing outcome of the Poisson distribution model for Marshawn Lynch. Outcome No Good Game Good Game Actual Occurrence 8 8 Expected Occurrence 9 7 Games Based on the data from table 9, we can construct our χ2 statistic as given below: χ2 = (8 − 9)2 (8 − 7)2 + 9 7 χ2 = 0.254 The degrees of freedom will remain 1, and the significance level will also remain 0.10 𝑝𝑝 − 𝑣𝑣𝑣𝑣𝑣𝑣 = 𝑃𝑃(𝜒𝜒 2 > 0.254) = 0.614 The p value is larger than our significance level. Therefore, our Null hypothesis cannot be rejected, and the Poisson distribution is applicable. Normal distribution: Our null hypothesis H₀, is that the performances of Marshawn Lynch follow a Normal distribution, which was outlined in Calculation 3. Our alternative hypothesis 𝐻𝐻𝑎𝑎 is that the performances do not follow the Normal distribution. Therefore, we can construct Table 10 below based on the previous categorization. Table 10: Comparing outcome of the Normal distribution model for Marshawn Lynch. Outcome No Good Game Good Game Actual Occurrence 8 8 Expected Occurrence 9 7 Games Based on the data from table 9, we can construct our χ2 statistic as given below: χ2 = (8 − 8)2 (8 − 8)2 + 8 8 χ2 = 0 The degrees of freedom will remain 1, and the significance level will also remain 0.10. 𝑝𝑝 − 𝑣𝑣𝑣𝑣𝑣𝑣 = 𝑃𝑃(𝜒𝜒 2 > 0) = 1 This value is larger than our significance level. Therefore, our Null hypothesis cannot be rejected, and the Normal distribution is applicable. However, surprisingly, the Normal distribution model is a perfect fit for the performances of Marshawn Lynch. Therefore, the Chi square test of best fit suggests that the performances in a game for Russell Wilson and Marshawn Lynch are both normally distributed. Furthermore, we can say that the performances of both players from the previous season are similar to their performances in the 2014-2015 season, thus validating our previously made assertion that the performances of a player’s previous season can be used to model his/her next one (to an extent). However, there remains a small flaw with this test. 16 games is a relatively small sample size when considering that both Russell Wilson and Marshawn Lynch have played numerous games together over the course of their careers. Therefore, we must ensure that the performances of both Marshawn Lynch and Russell Wilson come from the same population as an additional validation test. Therefore, the T-value test will be used, which indicates how significant the differences between the two groups (2013-2014 and 2014-2015 seasons for both players) are and whether the observed differences between the two data sets are based on chance. If the calculated T value is less than the expected T value from the T-distribution tables, there is sufficient evidence that there are no significant differences between the 2013-14 and 2014-15 seasons for both Marshawn Lynch and Russell Wilson. As in the case of both players, the size of the data sets is the same for the 2013-14 and the 2014-15 NFL seasons, the Equal variance t-test will be used. Calculation 7: T-value test for the Russell Wilson data sets to ensure uniformity of the model. 𝐻𝐻𝑜𝑜 : 𝜇𝜇̂ 2 = 𝜇𝜇3 𝐻𝐻𝑎𝑎 : 𝜇𝜇̂ 2 ≠ 𝜇𝜇3 𝑡𝑡 = 𝜇𝜇̂ 2 − 𝜇𝜇̂ 3 �(𝑁𝑁 − 1) × 𝜎𝜎1 − (𝑁𝑁 − 1) × 𝜎𝜎3 𝑁𝑁 + 𝑁𝑁 − 2 2 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑜𝑜𝑜𝑜 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 = 𝑁𝑁 + 𝑁𝑁 − 2 N: number of games in the data sets. 2 × 1 �1 + 1 𝑁𝑁 𝑁𝑁 Substituting the values from the 2013-2014 Normal distribution model and 2014-2015 seasons from the Russell Wilson data sets gives us: 𝑡𝑡 = 100.3 − 95.1 �(16 − 1) × 886.44 − (15 − 1) × 641.42 16 + 16 − 2 2 = 0.019 2 × 1 �1 + 1 16 16 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑜𝑜𝑜𝑜 𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹 = 16 + 16 − 2 = 30 Using the Degrees of Freedom value 30, and a significance level of 0.05, The T-value distribution tables give us a value of 1.697. Since 0.019 < 1.697 and the calculated t-value is much smaller than the t- value from the distribution tables, we can surmise that the differences between the modelled distribution and what actually occurred during the 2014-2015 season with regards to Russell Wilson’s performances were not due to intrinsic differences between the data sets, and they did in fact come from the same population. Calculation 8: T-value test for the Marshawn Lynch data sets to ensure uniformity of the model. The T-value formula will be the similar to the one was used in Calculation 7, and the degrees of freedom value will also remain 30 as the Marshawn Lynch data set sizes were equal to that of the Russell Wilson data set sizes. 𝑇𝑇 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 = 78.56 − 81.625 2 2 �(16 − 1) × 977.62 − (15 − 1) × 871.48 16 + 16 − 2 𝑇𝑇 𝑣𝑣𝑎𝑎𝑙𝑙𝑙𝑙𝑙𝑙 = 0.0094 × 1 �1 + 1 16 16 As the Degrees of freedom and the significance level remain the same as in Calculation 7, the T-value distribution tables give us a value of 1.697 again. We can once again observe that the calculated T-value is significantly lower than the one from the distribution tables. Therefore, we can also surmise that that the differences between the modelled distribution and what actually occurred during the 2014-2015 season with regards to Marshawn Lynch’s performances were not due to intrinsic differences between the data sets, and they did in fact come from the same population. Now, we have clarified that the performances of Marshawn Lynch and Russell Wilson can be modelled best though a normal distribution. However, in our validation tests, we have benefitted from the advantage of hindsight as we were able to make comparisons between that actual occurrences and the expected occurrences of our models. To look at the examined phenomenon from a different angle, I have decided to take a perspective similar to that of the Seattle Seahawks coach right after the 2013-14 season, as he would not have the advantage of hindsight wisdom that we have currently. Based on the eye-test, we have established that the prevailing NFL opinion was that Russell Wilson and Marshawn Lynch’s performances in games was independent of each other due to them controlling different facets of the offense. Calculation below elucidates the phenomenon of independence. 𝑃𝑃(𝐴𝐴 ∩ 𝐵𝐵) = 𝑃𝑃(𝐴𝐴)𝑃𝑃(𝐵𝐵) 𝑃𝑃�𝑀𝑀𝑀𝑀𝑔𝑔𝑔𝑔 ∩ 𝑅𝑅𝑅𝑅𝑛𝑛−𝑔𝑔𝑔𝑔 � = 0.482 × 0.433 = 0.2087 𝑃𝑃�𝑀𝑀𝑀𝑀𝑛𝑛−𝑔𝑔𝑔𝑔 ∩ 𝑅𝑅𝑅𝑅𝑔𝑔𝑔𝑔 � = 0.518 × 0.567 = 0.2937 𝑃𝑃�𝑀𝑀𝑀𝑀𝑔𝑔𝑔𝑔 ∩ 𝑅𝑅𝑅𝑅𝑔𝑔𝑔𝑔 � = 0.567 × 0.482 = 0.2734 This information can be represented in the form of a Venn Diagram, as shown below: 𝑃𝑃�𝑀𝑀𝑀𝑀𝑔𝑔𝑔𝑔 ∪ 𝑅𝑅𝑅𝑅𝑔𝑔𝑔𝑔 � = 0.2087 + 0.2734 + 0.2937 = 0.2087 + 0.2734 + 0.2937 = 0.78 The probability of 0.78 is relatively high. Thus, we can say that the probability of the Seattle Seahawks offense, which relies heavily on the performances of the two players, being effective is also relatively high. This allows bookmakers and Coaches to forecast the effectiveness of the offense and plan accordingly. However, is there a link that can be drawn in between the performance models of Marshawn Lynch and Russell Wilson derived from the 2013-14 season? To see if this was the case, I decided to construct a Pearson’s correlation coefficient. Calculation 8: A test for correlation between the data sets ρYZ = 𝐶𝐶𝐶𝐶𝐶𝐶(𝑌𝑌, 𝑍𝑍) 𝜎𝜎1 𝜎𝜎2 𝐶𝐶𝐶𝐶𝐶𝐶(𝑌𝑌, 𝑍𝑍) = ρYZ : Pearson’s correlation coefficient 𝐶𝐶𝐶𝐶𝐶𝐶(𝑌𝑌, 𝑍𝑍): Co-variance of the performances of Marshawn Lynch and Russell Wilson from the 2013-14 season ∑(𝑌𝑌 − 𝜇𝜇̂ 1 ) × (𝑍𝑍 − 𝜇𝜇̂ 2 ) 𝑁𝑁 − 1 Co-variance of the datasets was calculated by substituting the values from Tables 3 and 4 𝜎𝜎2 : Standard deviation of Russell Wilson data set 𝐶𝐶𝐶𝐶𝐶𝐶(𝑌𝑌, 𝑍𝑍) = −221.996 ρYZ = 𝜎𝜎1 : Standard deviation of Marshawn Lynch data set −221.996 29.77 × 31.27 𝜇𝜇̂ 1 : Mean of Marshawn Lynch data set ρYZ = −0.24 Thus, we reach a surprising result. It appears 𝜇𝜇̂ 2 : Mean of Russell Wilson data set N: Number of values that there is a minor correlation in between the Marshawn Lynch and Russell Wilson data set models. The result of calculation 8 tells us that a completely independent relationship between rushing and passing offenses in the Seattle Seahawks is not completely correct, and therefore, the probability given by the Venn Diagram is only an approximation. Furthermore, we can use this data to construct a bi-variate normal distribution for both Marshawn Lynch and Russell Wilson’s performances, seeing as to how the models we have adopted for both players suggest that their performances can be approximated by a normal distribution. The bi-variate normal distribution in this case will be made up of two random variables: Rushing Yards in a game by Marshawn Lynch (Y) and Passer Rating for Russell Wilson (Z), which have been defined earlier in the investigation. I have sourced a paper which gives us the joint probability density function for the bi-variate distribution of Y and Z. The form of the equation for the joint probability density function is given below as: 𝑓𝑓𝑌𝑌𝑌𝑌 (𝑦𝑦, 𝑧𝑧) = Where: 1 2𝜋𝜋�1 − ρYZ 2 × 𝜎𝜎1 × 𝜎𝜎2 × 𝑒𝑒 − 𝑎𝑎 2(1−ρYZ 2 ) 2 (𝑦𝑦 − 𝜇𝜇� 1 )2 2ρYZ �𝑦𝑦 − 𝜇𝜇� 1 ��𝑧𝑧 − 𝜇𝜇� 2 � �𝑧𝑧 − 𝜇𝜇� 2 � 𝑎𝑎 = − + 𝜎𝜎1 2 𝜎𝜎2 2 𝜎𝜎1 𝜎𝜎2 Substituting the values (which have been defined earlier in the investigation) gives us the equation below for the probability density function. 𝑎𝑎 𝑓𝑓𝑌𝑌𝑌𝑌 (𝑦𝑦, 𝑧𝑧) = (1.76 × 10−4 ) × 𝑒𝑒 −1.885 Where: 𝑎𝑎 = (𝑦𝑦 − 78.56)2 0.48(𝑦𝑦 − 78.56)(𝑧𝑧 − 100.3) (𝑧𝑧 − 100.3)2 + + 930.91 𝜎𝜎 2 886.25 However, this only gives us the joint probability density function. We require the joint probability cumulative distribution function (cdf) to arrive at possibly meaningful result. A joint cdf function require a calculation of the form below: 𝑟𝑟 𝑠𝑠 s,t: lower and upper limits for the ‘y’ variable would r,q: lower and upper limits for the ‘z’ variable 𝑡𝑡 𝐹𝐹(𝑦𝑦, 𝑧𝑧) = � � 𝑓𝑓𝑌𝑌𝑌𝑌 (𝑦𝑦, 𝑧𝑧) 𝑑𝑑𝑑𝑑 𝑑𝑑𝑑𝑑 𝑞𝑞 a Based on the above formula, we can derive probabilities for when either or both Marshawn Lynch and Russell Wilson perform in a given game. The usage of double integrals was not taught in class and was a difficult concept to grasp, so I decided to use Wolfram Alpha to expedite the process Calculation 9: Showing the probability divergence without independence Case 1: Marshawn Lynch has a ‘good game’ and Russell Wilson doesn’t 95 𝑓𝑓(𝑦𝑦, 𝑧𝑧) = � � 0 296 80 𝑎𝑎 (1.76 × 10−4 ) × 𝑒𝑒 −1.885 𝑑𝑑𝑑𝑑 𝑑𝑑𝑑𝑑 Using Wolfram Alpha, the probability is 0.2443 Case 2: Russell Wilson has a ‘good game’ and Marshawn Lynch doesn’t 𝑓𝑓(𝑦𝑦, 𝑧𝑧) = � 158.3 95 80 𝑎𝑎 � (1.76 × 10−4 ) × 𝑒𝑒 −1.885 𝑑𝑑𝑑𝑑 𝑑𝑑𝑑𝑑 0 Using Wolfram Alpha, the probability is 0.3104 Case 3: Both Russell Wilson and Marshawn Lynch have ‘good games’ 𝑓𝑓(𝑦𝑦, 𝑧𝑧) = � 158.3 95 � 296 80 𝑎𝑎 (1.76 × 10−4 ) × 𝑒𝑒 −1.885 𝑑𝑑𝑑𝑑 𝑑𝑑𝑑𝑑 Using Wolfram Alpha, the probability is 0.2299 The reason the upper bound values of 158.3 and 296 were chosen was for logical reasons. 296 is the highest number of rushing yards ever completed in a game while 158.3 represents the highest possible computed figure for the passer rating statistic. Using the above values, we can construct another Venn Diagram: If we compare the 2 Venn diagrams that we have constructed for the performances of Russell Wilson and Marshawn Lynch, we can see that the individual probabilities for only Marshawn Lynch and Russell Wilson having ‘good games’ are slightly higher using the Bi-variate normal distribution when compared to assumption of complete independence. This is because of the slight negative relationship of the performances of both players (through the modelling of the 2013-2014 performances of the players). However, the overall probability of Marshawn Lynch and/or Russell Wilson performing in a game remains approximately the same because the negative correlation coefficient still has a relatively small number. Therefore, independence would still give a close approximation, but based on the examination of the data set, using a bivariate distribution model would yield more in-depth results. Looking at this phenomenon from a different perspective allows us to understand how coaches can examine similar relationships and probabilities across multiple players will allow them to determine the optimal line- ups for the Seattle Seahawks for the next season based on historical evidence. Conclusion Based on our examination of the performances of Russell Wilson and Marshawn Lynch from 20132015, we saw that we could use the previous season’s data to model a close approximation of the player performance for the next season. Using the Chi-square and T value test, we validated that the Normal distribution model would be a close fit for their individual performances during the time period. Subsequently, to look at the data sets from a different angle, we adopted the perspective of Coaches during the time period right after the 2013-14 season and before the 2014-2015 season and examining the data set models for the performances led us to determine that Marshawn Lynch and Russell Wilson’s performances were not completely independent of each other, but that there was a slight negative correlation between the two, based on which we were able to construct a Bi-variate normal cdf, which showed us that the probabilities of both/either player having a ‘good game’ would be different from when complete independence was assumed. The probabilities of individual players performing in the game were higher than if they both performed in the same game. Significance However, is there any possible reason as to why complete independence was not observed i.e. a result contrary to NFL opinion? If one reflects on it, Offenses tend to subconsciously lean towards the plays that allow the team to advance up the field. When the passing offense isn’t working effectively (Russell Wilson), teams will tend to favor rushing the ball (Marshawn Lynch) and vice versa. However, the correlation coefficient doesn’t have a larger value likely because teams still must vary their offensive approach to prevent predictability. This is why joint performances are still relatively difficult to forecast. The result I obtained was also significant because I realized that more detailed statistical analysis can be conducted with even more players by using multivariate distributions, which will allow coaches to optimize the best performing lineups for games. Limitations We defined the ‘good game’ parameter for this investigation as something rigid. In reality, each football game differs, and as someone who watches games regularly, a definition of a good game can vary from game to game. Therefore, quantifying this parameter is a lot more difficult than our assumption. In line with this, defining different degrees of a ‘good game’ could’ve led to more categories on the Chi-square best fit test (which would increase the accuracy of our test result). Furthermore, probability in sports is something that is extremely difficult to model, as there are a lot of external factors that affect each game, such as talent level of opponents, the defensive strategies the opponents adopt against that Seattle Seahawks offense and the general intangibles such as team morale, fitness, travel time etc.. It is impossible to quantify the effects of these factors, which is why models have a relatively high degree of diverging from reality. Extensions and Links to other fields We can use other probability distributions. In this investigation, we were limited to the Normal and Poisson distributions. In reality, there may be other distributions that may be an even better such as the Log normal distribution (which restricts the minimum possible value of the random variable to zero, thus improving accuracy) or the gamma distribution (due to it’s longer tail accounting for the possibilities of higher values such as record breaking performances). The mathematics we used was predominantly based on probability and statistics, but the data analysed can be extended to the mathematical fields of estimation and hypothesis testing. If we were to look for alternative applications of this are of mathematics, the same methodology can be used by soccer coaches to model player performances in a time period, and to see which player combination line-ups can give the highest probability of the whole team performing in the same game. Similarly, it could also be applied to other sports by coaches looking to determine what their best team is before the start of the season, through deeper analysis beyond the eye-test. Evaluation As stated, before in the limitations section, probability in sports is difficult to forecast due to the varying nature of games. Until reality occurs, the best we can get to it is through an approximation that may not always hold true. As stated before, the calculation done in this investigation can be used by coaches during play calling for either a pass or a rush depending on which player appears to be performing and how many games till date they have performed in. I have used standard level statistics such as calculations of the mean and variance, higher level probability through the usage of Poisson and Normal distribution models, and mathematics beyond the scope of the syllabus such as the Chi-square test, the t-value test and bivariate normal distributions, and I benefitted greatly from the usage of online software such as Desmos and the TiNspire-cx calculator. Therefore, I can conclusively state that I can understand the importance sports statistics can have on making important team decisions (and is perhaps why many sports teams rely on analytics today to optimize performances of the team). However, Sports, like Gridiron football, are in the end, based on a lot more factors than just numbers, which is what makes them so exciting. Bibliography Data and Education Resources “4.2 - Bivariate Normal Distribution | STAT 505.” Accessed February 10, 2020. https://online.stat.psu.edu/stat505/lesson/4/4.2. “2017 NFL Rushing | Pro-Football-Reference.Com.” Accessed February 10, 2020. https://www.profootball-reference.com/years/2017/rushing.htm. “Bivariate Normal Distribution -- from Wolfram MathWorld.” Accessed February 10, 2020. http://mathworld.wolfram.com/BivariateNormalDistribution.html. “Chi-Square Goodness of Fit Test - Statistics Solutions.” Accessed February 10, 2020. https://www.statisticssolutions.com/chi-square-goodness-of-fit-test/. “Marshawn Lynch 2013 Game Log | Pro-Football-Reference.Com.” Accessed February 10, 2020. https://www.pro-football-reference.com/players/L/LyncMa00/gamelog/2013/. “Marshawn Lynch 2014 Game Log | Pro-Football-Reference.Com.” Accessed February 10, 2020. https://www.pro-football-reference.com/players/L/LyncMa00/gamelog/2014/. “NFL Player Stats | NFL.Com.” Accessed February 10, 2020. http://www.nfl.com/stats/player. “Rush (Gridiron Football) - Wikipedia.” Accessed February 10, 2020. https://en.wikipedia.org/wiki/Rush_(gridiron_football). “Russell Wilson 2013 Game Log | Pro-Football-Reference.Com.” Accessed February 10, 2020. https://www.pro-football-reference.com/players/W/WilsRu00/gamelog/2013/. “Russell Wilson 2014 Game Log | Pro-Football-Reference.Com.” Accessed February 10, 2020. https://www.pro-football-reference.com/players/W/WilsRu00/gamelog/2014/. “T-Test Definition.” Accessed February 10, 2020. https://www.investopedia.com/terms/t/t-test.asp. “What Is Passing Yards? Definition from SportingCharts.Com.” Accessed February 10, 2020. https://www.sportingcharts.com/dictionary/nfl/passing-yards/. “What Is Receiving Yards? Definition from SportingCharts.Com.” Accessed February 10, 2020. https://www.sportingcharts.com/dictionary/nfl/receiving-yards/. Online Calculators: “Double Integral Calculator: Wolfram|Alpha.” Accessed February 10, 2020. https://www.wolframalpha.com/calculators/double-integral-calculator. “Normal Distribution.” Accessed February 10, 2020. https://www.desmos.com/calculator/2kmx0enkkz. “Poisson Distribution Formula.” Accessed February 10, 2020. https://www.desmos.com/calculator/qo0442muda. “Quick P Value from Chi-Square Score Calculator.” Accessed February 10, 2020. https://www.socscistatistics.com/pvalues/chidistribution.aspx. “T-Value Calculator | Good Calculators.” Accessed February 10, 2020. https://goodcalculators.com/studentt-value-calculator/. Video Tutorials Maths Resources, director. Joint Probability Distributions for Continuous Random Variables - Worked Example. YouTube, YouTube, 14 Nov. 2015, www.youtube.com/watch?v=Om68Hkd7pfw. Stepbil, director. Joint PDF #3 - Deriving Joint Cumulative Distribution Function from Joint PDF. YouTube, YouTube, 8 Jan. 2011, www.youtube.com/watch?v=EPjnUF952B8.

Math IA FInal Draft

Related documents

Products

Support

Math IA FInal Draft

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib