Statistical Reasoning in Sports: Review for Final (Chapters 1–7) 1. The model we use for athletic PERFORMANCE is: PERFORMANCE = ABILITY + RANDOM CHANCE 2. Give examples of 2 categorical variables and 2 numerical variables. For each variable, name at least two different ways to graph the distribution of the variable. Categorical: Outcome of a game, outcome of a shot. Graphs: pie charts, bar charts. Numerical: points scored, 100m sprint time. Graphs: dotplots, histograms, boxplots. 3. Suppose that a basketball player has the ABILITY to make 78% of his free-throws. a) Explain what it means for the player to have an ABILITY of 78%. If the player were to take many, many free-throws, he would make about 78% of them. b) Which PERFORMANCE would be more surprising for this player, taking 10 shots and making 90% or taking 100 shots and making 90%? Explain. It would be more surprising to make 90% in 100 shots, because with a larger number of attempts (n = 100) we would expect his PERFORMANCE to be closer to his ABILITY of 78%. In a small number of attempts (n = 10), it isn’t that surprising to have a PERFORMANCE much better than his ABILITY. c) In his games last week, the player only made 9 of his 17 free-throws. To see if this poor PERFORMANCE gives convincing evidence that his ABILITY to make free-throws has gone down, what hypotheses should we test? H 0 : The player’s ABILITY to make free-throws is still 78% H a : The player’s ABILITY to make free-throws is less than 78% d) What was this player’s free-throw percentage in his games last week? Use this value as the test statistic. Test statistic: Percentage of made shots = 9/17 = 53% e) Describe how to use random numbers to simulate the distribution of the test statistic, assuming that the player’s ABILITY to make free-throws is 78%. Let 1–78 represent a made free-throw and 79–100 represent a missed free-throw. Randomly generate 17 numbers from 1–100 and calculate what percentage are between 1 and 78. Record this value on a dotplot and repeat many times. Dot Plot Collection f) Here are the results of1 100 trials of the simulation. Explain what information is provided by the dotplot. 50 60 70 80 90 sim prop of Made Shots Simulated Percentage The dotplot shows the possible percentages of made shots in 17 attempts that could occur by RANDOM CHANCE, assuming that the player has an ABILITY of 78%. g) Use the results of the simulation to estimate the p-value and make an appropriate conclusion. Because only 2 times in 100 trials did the shooter make 53% or less of his shots, the p-value is approximately 2%. Conclusion: Since the p-value is small, we have convincing evidence that his ABILITY to make free-throws has decreased from 78%. 4. Suppose you wanted to know if pitchers throw pitches with more speed when they do a full wind-up compared to when they do not pitch from a full wind-up (called pitching from the stretch). Pitching from the stretch usually occurs when the pitcher is trying to get the ball to the catcher as soon as possible so that base runners have a smaller chance of stealing a base. To investigate, you conduct an experiment where each of the 10 pitchers on a baseball team will go through the following routine: Warm up normally Throw 20 pitches with a full wind-up and 20 pitches without a full wind-up, in random order. Record the average speed with a wind-up and the average speed without a wind-up. a) Why is it better to investigate this question using an experiment rather than simply using available data? In an experiment, we can control other sources of variability so that the only thing that changes is the explanatory variable (type of wind-up). This allows us to make cause-andeffect conclusions. b) What are the explanatory and response variables in this experiment? Explanatory: Type of wind-up, Response: speed of pitch c) Why is it important to randomize the order of the pitches? If a pitcher is more warmed up for the second set of 20 pitches and these are all using the same type of delivery, it may make that type of delivery look better than it really is. Randomizing the order makes sure one type of delivery isn’t favored. d) Explain the concept of control and how it should be used in this experiment. Control is keeping every other variable constant except for the type of delivery. For example, the pitching mound, the catcher, the ball, the clothing the pitcher is wearing, the distance from which the pitch is thrown, the radar gun used, etc. e) Will the pitchers be blind in this experiment? Explain how this could limit the conclusions we can make. No, since the pitcher will know what type of delivery he is using. If a pitcher thinks he will throw harder with a full wind up, his belief might actually make him throw faster. Then, we won’t know if it was his belief or the type of delivery that caused the pitches to be faster. f) Explain how you know this is paired data. The data is paired since each pitcher is doing both treatments. g) State the hypotheses we are interested in testing. H 0 : Pitchers have the same ABILITY to throw fast with a full wind-up and without a full wind-up. H a : Pitchers have a greater ABILITY to throw fast with a full wind-up than without a full wind-up. h) Describe a Type I and a Type II error in the context of these hypotheses. I: Deciding that pitchers have a greater ABILITY to throw fast with a full wind-up when they really don’t. II: Not deciding that pitchers have a greater ABILITY to throw fast with a full wind-up when they really do. i) After the experiment, the difference in average speed was calculated for each pitcher (full – not full) and the mean difference was 2.3 miles per hour. Interpret this value. On average, pitchers were 2.3 mph faster when throwing from a full wind-up. j) Describe how to simulate the distribution of the mean difference. For each pitcher, put his two average speeds on notecards, shuffle them and randomly assign one to the full wind-up and one to the non-wind-up and subtract (full – not full). Then find the simulated difference for each pitcher and the mean simulated difference for all the pitchers. Record and repeat many times. k) Suppose that we carried out the experiment, tested the hypotheses, and got a p-value of 0.03. Based on this p-value, what would be an appropriate conclusion? Because the p-value is less than 5%, we have convincing evidence that pitchers have a greater ABILITY to throw fast from a full wind-up. l) Using the same 10 pitchers, explain how to redesign the experiment so that the resulting data would be unpaired. Is this design better or worse than the experiment described earlier? Explain. We could randomly choose 5 pitchers to throw using a full wind-up and make the other 5 pitchers throw without a full wind-up. This isn’t as good because we aren’t controlling for the differences in the ABILITIES of the individual pitchers. 5. The table below shows several different variables for the 1997 Dallas Cowboys. Game # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Opponent Outcome Location Pittsburgh Steelers Arizona Cardinals Philadelphia Eagles Chicago Bears New York Giants Washington Redskins Jacksonville Jaguars Philadelphia Eagles San Francisco 49ers Arizona Cardinals Washington Redskins Green Bay Packers Tennessee Oilers Carolina Panthers Cincinnati Bengals New York Giants W L W W L L W L L W W L L L L L A A H H A A H A A H H A H H A H Points Allowed 7 25 20 3 20 21 22 13 17 6 14 45 27 23 31 20 a) Use a two-way table to summarize the outcomes of their games (win, loss) at home and away. Home Away Total Win 5 1 6 Loss 3 7 10 Total 8 8 16 b) Make a graph to compare the outcomes at home and away. Briefly describe what you see. The 1997 Cowboys won a much higher percentage of games at home than on the road. c) If we wanted to know if the Cowboys have a greater ABILITY to win at home, what hypotheses should we test? H 0 : The 1997 Cowboys have the same ABILITY to win at home and on the road. H a : The 1997 Cowboys have a greater ABILITY to win at home than on the road. d) Calculate the difference in their winning percentage at home and on the road. Use this value as the test statistic. Test statistic: Difference in proportion (or percentage) of wins = 5/8 – 1/8 = 50% e) If you find convincing evidence that the Cowboys have a greater ABILITY to win at home, can we determine the cause? Explain. No, since there are many differences between games at home and games on the road. For example, better fans, better facilities, less travel, comfortable beds, etc. f) Describe how to simulate the distribution of the test statistic, assuming that the Cowboys have the same ABILITY to win at home and on the road. On 16 note cards, write W on 6 of them and L on 10 of them to represent their 6 wins and 10 losses. Shuffle the cards and divide them at random into two stacks of 8 (for the 8 home games and the 8 road games). Find the winning percentage in each pile and subtract (home – away). Repeat many times. Measures Scrambled Collection 1Dot Plot g) Here are the results of 100 trials of thefrom simulation. What information is provided by the dotplot? -80 -60 -40 -20 0 20 40 60 80 simdiff percentage (home – away) Simulated difference in winning The dotplot shows the differences in winning percentage that could occur by RANDOM CHANCE, assuming that the Cowboys had the same ABILITY to win at home and on the road. h) Use the results to estimate and interpret the p-value and make an appropriate conclusion. Since there are 7 dots at 50% or higher, the p-value is approximately 7%. Interpretation: Assuming that the 1997 Cowboys’ ABILITY to win is the same at home and on the road, there is a 7% chance that they would get a difference of 50% or more by RANDOM CHANCE alone. Conclusion: Since the p-value isn’t less than 5%, we do not have convincing evidence that the 1997 Cowboys have a greater ABILITY to win at home. i) If your conclusion was in error, which type of error did you commit? Explain. Type 2. We didn’t say that Cowboys ABILITY to win was greater at home, when in reality their ABILITY to win could be greater at home. j) Make a histogram of the Cowboys’ points allowed and briefly describe the shape. The shape is unimodal and slightly skewed to the right. k) Make boxplots to compare the points allowed for the 8 home games and the 8 away games. Briefly compare the distributions. The distribution of points allowed on the road is skewed to the right, while the distribution of points allowed at home is skewed to the left. The median is slightly larger on the road, and the distribution on the road is also slightly more spread out. Neither distribution seems to have outliers. l) Based on the boxplot, do you think the mean or median points allowed will be larger for the away games? Explain. Since the distribution is skewed to the right, the mean should be larger than the median. m) In one away game, the Cowboys gave up 45 points. Was this game an outlier in the away distribution? Justify. PA for away games (in order): 7, 13, 17, 20, 21, 25, 31, 45 Q1 = 15, Q3 = 28, IQR = 13, outliers > 28 + 1.5(13) = 47.5. Since 45 < 47.5, it isn’t an outlier. n) Explain how you know this isn’t paired data. Because the opponents at home weren’t the same as the opponents on the road, so there is no reasonable way to pair the home and away games. o) Suppose that we wanted to test the following hypotheses. Explain what is meant by the Cowboys’ ABILITY to play defense at home. H 0 : The 1997 Cowboys have the same ABILITY to play defense at home and on the road. H a : The 1997 Cowboys have a greater ABILITY to play defense at home than on the road. If the Cowboys could play an infinite number of games at home in the same conditions, their average number of points allowed would equal their ABILITY to play defense at home. p) Calculate the difference in means (away – home). Use this as the test statistic. Test statistic: Difference in mean points allowed (away – home) = 22.375 – 16.875 = 5.5 q) Describe how to simulate the test statistic, assuming that the Cowboys have the same ABILITY to play defense at home and on the road. On 16 note cards, write the points allowed for each of the 16 games. Shuffle the cards and divide them at random into two stacks of 8 (for the 8 home games and the 8 road games). Find the mean points allowed in each pile and subtract (away – home). Repeat many times. r) After conducting a simulation of the test statistic, it was determined that the p-value was approximately 18%. Decide if each of the following conclusions is correct. If the conclusion is incorrect, explain what is wrong. 1. Because the p-value is large, we have convincing evidence that the Cowboys have the same ABILITY to play defense at home and on the road. Not correct. Even if the p-value is large, we never have convincing evidence that the null hypothesis is true. We only can have convincing evidence to support the alternative hypothesis. 2. Because the p-value is large, we have convincing evidence that the Cowboys play better defense at home than on the road. Not correct. If the p-value is large, the difference in PERFORMANCE could be due to RANDOM CHANCE. This means there is not convincing evidence that the Cowboys play better defense at home. 3. Because the p-value is large, we do not have convincing evidence that the Cowboys play better defense at home than on the road. Correct. s) It is also possible to use the difference of medians as a test statistic instead of the difference in means. In what circumstances would it be better to use the difference in medians? Why? If there was strong skewness or outliers, it would be better to use the difference in medians because medians are more resistant to unusual values than means. t) Explain what it would mean if the outcomes of Cowboys’ games were independent. If the outcomes of their games were independent, then the outcomes of previous games do not affect the outcomes of future games. u) To see if the Cowboys have a greater ABILITY to win following a win, what hypotheses should we test? H 0 : The Cowboys’ ABILITY to win was the same following a win and following a loss (their outcomes were independent) H a : The Cowboys’ ABILITY to win was higher following a win than following a loss (they were a streaky team) v) Calculate the value of the longest streak. Use this value as the test statistic. Longest streak = 5. w) Explain how to simulate the distribution of the test statistic, assuming that the outcomes of their games were independent. On 16 note cards, write W on 6 of them and L on 10 of them. Shuffle and lay them down in random order. Count the longest streak of W’s or L’s. Record on a dotplot and repeat many times. x) Here are the results of 100 trials of the simulation. Is the evidence that the Cowboys were streaky statistically significant at the 5% level of significance? Explain how you know and what this means. Dot Plot Measures from Scrambled Collection 2 0 2 4 6 8 Simulated longest streak streak Since there are 43 dots that are 5 or greater, the p-value is approximately 43%. Because this is larger than 5%, the evidence is not statistically significant. That is, there isn’t convincing evidence that the Cowboys were a streaky team. y) Calculate and interpret the MAD for the Cowboys’ points allowed. Mean = 19.625, MAD = 7.22. On average, the number of points allowed was 7.22 points from the mean of 19.625. z) At home, the SD of points allowed was 8.5 points. Interpret the value 8.5. At home, the Cowboys’ PERFORMANCES were typically about 8.5 points from their ABILITY. aa) On the road, the SD of points allowed was 11.7 points. Is this the observed or true SD of points allowed on the road for the Cowboys? Explain. This is the Cowboys’ observed SD. To find their true SD, they would need to play an infinite number of games. bb) To see if the Cowboys’ defense is more consistent at home, what hypotheses should we test? What test statistic should we use? What is the value of the test statistic? H 0 : The Cowboys’ true standard deviation is the same at home and on the road. H a : The Cowboys’ true standard deviation is smaller at home than on the road. Test Statistic: Difference of observed standard deviations (away – home) = 11.7 – 8.5 = 3.2 cc) Explain how to simulate the distribution of the test statistic, assuming that the Cowboys were equally consistent at home and on the road. For the 8 home games, write the deviations from the mean points allowed at home on a separate notecard. Do the same for the 8 road games. Shuffle the cards and divide them at random into two stacks of 8 (for the 8 home games and the 8 road games). Find the standard deviation for each pile and subtract (away – home). Repeat many times. Plot dd) HereMeasures are thefrom results of 100 trials 1of the simulation. Use the results to estimate and Dot interpret the p-value Scrambled Collection and make an appropriate conclusion. -10 -8 -6 -4 -2 0 2 4 6 8 10 Simulated difference in standard deviations diffSD of points allowed (away – home) Since there are 26 dots that are 3.2 or greater, the p-value is approximately 26%. Interpretation: Assuming that the true standard deviations for the Cowboys are the same at home and on the road, there is a 26% chance of getting a difference in observed standard deviations of 3.2 or more by RANDOM CHANCE. Conclusion: Since the p-value is large, we do not have convincing evidence that the Cowboys have a smaller true standard deviation at home. ee) If you were to remove the 45 from the away distribution, what would happen to the mean and standard deviation? Explain. Since this was the highest value, removing it would lower the mean. Also, since it was the farthest value from the mean, removing it would make the average deviation from the mean less, so the standard deviation would also decrease.