Chapter 8 Standardized Scores and Normal Distributions Objectives Students will be able to: 1) Understand what a standard score measures and how to calculate it 2) Recognize when standardization can be used to compare values 3) Use the normal distribution model to estimate observations falling within certain standard deviations of the mean 4) Use the standard normal table to find the percentage of data falling below a given value in a normal distribution 5) Use the standard normal table in reverse to determine percentile values of a distribution Fantasy Sports • What are fantasy sports? – In fantasy sports “owners” in a league draft teams of actual players and use their real-life statistics to measure the PERFORMANCE of each fantasy team. – Example: In fantasy baseball each “owner” combines the statistics of the players on his or her team using variables such as batting average and home runs for hitters and wins and strikeouts for pitchers. – Each team is then ranked according to each variable. • How do you determine which players you should draft onto your fantasy team? – Obviously you want good players, but it can be a bit tricky because the variables used to measure the players’ PERFORMANCES are on very different scales. – For example, in 2009 Ichiro (Mariners) had a 0.352 batting average and Adam Dunn (Nationals) hit 38 home runs. Both are outstanding PERFORMANCES, but which was better? And what about their PERFORMANCES in other categories? • To solve the problem of evaluating PERFORMANCES that are measured on different scales, we need to learn how to standardize these PERFORMANCES so they will be on the same scale. • Let’s look at how we can do this… Standardized Scores This dotplot shows the batting averages of all MLB players with at least 300 plate appearances in 2009. Ichiro’s batting average is the red dot (0.352). • This dotplot shows the number of home runs hit by the same group of MLB players. Adam Dunn’s home run total is the red dot (38). • The average batting average of all players in the sample was .270. • Ichiro’s batting average of .352 was .082 above average. • The average number of home runs for those players is 15.3 • Adam Dunn’s home run total of 38 was 22.7 home runs above average. • Does this mean Dunn’s PERFORMANCE was better because he was 22.7 above average, as opposed to Ichiro being .082 above average? • We should not compare 22.7 and .082 because batting average and home runs are measured with different units and the spread of their distributions is very different. • What we can do though is convert these values so we use standard deviation as a common unit of measure. This will allow us to make fair comparisons. • By doing this, we will be able to see how far each PERFORMANCE is above their respective mean, in terms of standard deviations. • What we will do is measure how far the PERFORMANCE is above the mean, and then divide it by the standard deviation. • The standard deviation of the batting average distribution was 0.029. As previously mentioned, Ichiro’s PERFORMANCE was .082 above the mean. .082/.029= 2.83 standard deviations above the mean. • The standard deviation of the home run distribution was 9.9. As previously mentioned, Dunn’s PERFORMANCE was 22.7 home runs above the mean. 22.7/9.9= 2.29 standard deviations above the mean. • Now that we are on the same scale we can compare. • Ichiro’s PERFORMANCE was 2.83 standard deviations above the mean. Dunn’s PERFORMANCE was 2.29 standard deviations above the mean. Which PERFORMANCE was better? • Ichiro’s PERFORMANCE was better since it was farther above the mean than Dunn’s PERFORMANCE. • Let’s formalize some of what we just talked about… • A standardized score or z-score measures how many standard deviations a PERFORMANCE is above or below the mean. • When a z-score is positive, the PERFORMANCE is above the mean; when a z-score is negative, the PERFORMANCE is below the mean. • When would it be good for a z-score to be negative? – One example is when looking at a quarterback’s interceptions. A negative value would indicate that a quarterback’s PERFORMANCE for throwing interceptions is below the mean. Hypothetical situation: Let’s say Dave Carucci played high school football. Let’s say he threw 0.5 interceptions per game. Let’s say that the average interceptions per game for BCSL National quarterbacks was 1. Let’s also say the standard deviation for this distribution was 0.4. Calculate and interpret Carucci’s standardized score for his interceptions per game. • Interpretation: Carucci’s interceptions per game PERFORMANCE was 1.25 standard deviations below the mean. • For this distribution, this is a good thing!!! Using Standardized Scores to Compare Across Eras • In 1927 Babe Ruth set the MLB single season home run record with 60 home runs. This record has been broken 3 times: – 1961 Roger Maris hit 61 home runs – 1998 Mark McGwire hit 70 home runs – 2001 Barry Bonds hit 73 home runs • Is it fair to say Barry Bonds had the best PERFORMANCE? Keep in mind the unit of measure is the same. • There are many differences amongst different eras in baseball: the quality of batters, the quality of pitchers, dimensions of ballparks, equipment, possible use of PED’s, etc… • As a result, in certain years it may have been easier to hit a home run than other years. • To make a fair comparison, we need to see how these PERFORMANCES compare relative to other hitters in the same era. Let’s find the standardized scores for these record-setting PERFORMANCES. • All four PERFORMANCES are noteworthy (all are more than 3 standard deviations from the mean!). However, Babe Ruth’s appears to be the most outstanding by a large margin, relative to the players of his era. • Even though Barry Bond’s has his name in the record books, this data suggests Babe Ruth may still be considered the single-season home run champ, relatively speaking. Back to Fantasy… • At the beginning of this chapter, we used zscores to compare Ichiro’s batting average PERFORMANCE to Dunn’s home run PERFORMANCE. • Even though Ichiro’s PERFORMANCE was better, it doesn’t necessarily mean he would be a better player to draft on our fantasy team. • Typical fantasy baseball leagues use five variables for hitters: – Batting average, home runs, RBIs, stolen bases, runs scored • To estimate the fantasy value for a player, we must measure the player’s PERFORMANCE in each category and combine these measurements in a reasonable way. • To do this, find a player’s standardized score for each variable and then add those scores together. • Overall, Ichiro’s sum of 4.39 was higher than Dunn’s sum of 3.79, making him a slightly more valuable fantasy player than Dunn in 2009. • Here is the distribution of total fantasy values for all 284 eligible players. • The most valuable player was Pujols, with a value of 11.4. • Note: This approach of adding the z-scores will only work when being above average is better for each category. • Example: If you used the number of strikeouts for hitters as a variable, then being above average is not a good thing. • In situations like this, you should subtract the z-scores for a category where being above average is a bad thing. • What would be considered an unusual zscore? • Let’s look at the distribution of the z-scores for the different hitting variables we have been discussing… Pg 271 • As you can see, it is fairly unusual to have a PERFORMANCE that is more than 2 standard deviations away from the mean and quite rare to have one more than 3 standard deviations from the mean, especially in a roughly symmetric distribution. The 68-95-99.7 Rule • In general, when a distribution of PERFORMANCES is roughly symmetric, unimodal, and bell-shaped: – Approximately 68% of the observations will be within 1 standard deviation of the mean. – Approximately 95% of the observations will be within 2 standard deviations of the mean. – Approximately 99.7% of the observations will be within 3 standard deviations of the mean. • Here is a visual summary: (pg 273) • Note: Distributions that are skewed or bimodal may not follow this rule very well. • Here is an example of Barry Sanders rushing yards per game for his career. (pg 273) The Normal Distribution • The Normal distribution is a mathematical model that is often used to describe distributions of data that are symmetric, unimodal, and bell-shaped. • The graph of a Normal distribution is called a Normal curve. • All Normal curves are symmetric, unimodal, and bell-shaped. • Here is a histogram of Peyton Manning’s first 176 regular season games (1998-2008). • A Normal curve is on top of the histogram. • The bars of the histogram represent 100% of Manning’s PERFORMANCES. • Therefore, the area under the Normal curve also needs to represent 100% of Manning’s PERFORMANCES. • This leads us to two important facts about the Normal curve: – 1) The total area under any Normal curve is equal to 1 (100%). – 2) The expected proportion of PEFORMANCES between two values is equal to the area under the Normal curve between the same two values. • Example: To estimate what proportion of Manning’s PERFORMANCES were less than 180 yards, we would calculate the area under the Normal curve to the left of 180, as illustrated below. Using the Normal Table • To estimate what proportion of Manning’s PERFORMANCES were below 180 yards, we first have to standardize the value 180. This distribution has a mean of 259 and a standard deviation of 74. • A z-score of -1.07 means a PERFORMANCE of 180 passing yards is 1.07 standard deviations below the mean. • Since this PERFORMANCE isn’t exactly 1, 2, or 3 standard deviations below the mean, we cannot use the 68-95-99.7 rule. • Instead, we have to use the standard Normal table which lists the proportion of PERFORMANCES that are less than a given standardized score in a Normal distribution. • This table is in the back of your book. • We have to look up the z-score of -1.07 on the table. • The table on the left has the negative scores. • Since the score starts out as “-1.0”, go to the z column and go down until you reach -1.0. • Since there is a 7 in the hundredths place (.07), go to the right until you reach the .07 column. You should now be located in the spot that has -1.0 to the left, and .07 on the top. • This value should be .1423. • The area under the standard Normal curve to the left of a z-score of -1.07 is equal to 0.1423. • This means we expect about 14.23% of Manning’s PERFORMANCES to be less than 180 yards. • Let’s try another one. Estimate what proportion of Manning’s PERFORMANCES were below 350 yards. Remember, the mean is 259 yards and the standard deviation is 74 yards. • Since the z-score is 1.23, look for 1.2 in the z column and .03 on top. • This should give us an area of .8907, meaning 89.07% of Manning’s PERFORMANCES were below 350 yards. • Now let’s say we want to know what proportion of games Manning had at least 290 passing yards. We will start this problem the same way. First, get the z-score, then look up that score on the standard Normal table. • The z-score is .42. • This gives us an area of .6628 on the standard Normal table. • .6628 is the proportion of PERFORMANCES that are less than a z-score of .42. • However, we want the proportion of games that are at least 290 passing yards (meaning 290 yards and above). What should we do? • Since we know the area under the curve has to be 1, we need to subtract .6628 from 1. This will give us the proportion of PERFORMANCES that are greater than a zscore of .42. • 1-.6628= .3372. • This means that we expect about 33.72% of Manning’s PERFORMANCES to be 290 yards or more. Here’s an illustration: • Let’s try another one. Find what proportion of games Manning had at least 150 yards passing. (mean 259 yards; SD 74 yards) • Z-score: -1.47. • Area under the standard Normal curve to the left of -1.47: .0708 • 1-.0708= .9292 • 92.92% of Manning’s PERFORMANCES are at least 150 yards. • How would we go about finding a proportion that was between two PERFORMANCES? • Let’s say we want to find the proportion of Manning’s PERFORMANCES that were between 180 yards and 290 yards. What do we do?!?! • Here is a picture of what we want: • Here is what we have to do: We need to first calculate the area to the left of 290. Then we need to calculate the area to the left of 180. Then we subtract those values. (still using mean 259, SD 74) Area under the curve: .6628 Area under the curve: .1423 .6628-.1423=.5205 52.05% of Manning’s PERFORMANCES were between 180 and 290 yards. • Let’s try another. What proportion of Manning’s PERFORMANCES were between 125 yards and 275 yards? Area under the curve: .5871 Area under the curve: .0351 .5871-.0351=.552 55.2% of Manning’s PERFORMANCES were between 125 yards and 275 yards. Using the Normal Distribution in Reverse • In 2008, the distribution of batting averages for MLB players with at least 300 plate appearances was approximately Normal with a mean of 0.272 and a standard deviation of 0.027. • Suppose a player gets a salary bonus if his batting average is in the top 10% of all players. How well must a player hit for his batting average to be in the top 10%? • We need to find the boundary between the lowest 90% of the distribution and the highest 10%. • The boundary value is called the 90th percentile, because 90% of the values fall below it. • We know that the area under the curve is .90. Therefore, we want to look at the interior of the standard Normal table for a proportion closest to 0.9000 and get the z-score associated with this proportion. • The closest value is 0.8997. This corresponds to a z-score of 1.28. This means the 90th percentile is 1.28 standard deviations above the mean. • Now let’s find the batting average associated with this z-score. • Let’s try another one. In 1970, the rebounding average for all ABA players with a minimum of 15 games played was 5.2 rebounds per game, with a standard deviation of 1.6. • Say Jackie Moon, owner of the Flint Tropics, wants to release his center, Vakidis, if he finishes in the bottom 15% of the league in rebounding. • How many rebounds per game must Vakidis average to finish in the bottom 15% of the league in rebounding? • Start by looking up .1500 in the table and finding the corresponding z-score. – The closest value is .1492, which is a z-score of -1.04. • Now, substitute in and solve for the rebounding average. Technology and the Normal Distribution • The TI-84 calculator can be your best friend when it comes to performing calculations involving Normal distributions. • Let’s say the number of points Anne scores in a basketball game is approximately Normally distributed, with a mean of 13.1 and a standard deviation of 4.2. Let’s use this information to solve a few problems… • In what proportion of games do you expect Anne to score between 10 and 15 points? – Let’s do this manually first, and then use the TI-84 calculator… Area under the curve: .2296 Area under the curve: .6736 Proportion= .6736-.2296=.444 44.4% To do this on the TI-84, we use the normalcdf command. 1) Press 2nd-DISTR (VARS key) 2) Select the second option: normalcdf 3) Area=normalcdf(lower boundary, upper boundary, mean, standard deviation) – Normalcdf(10, 15, 13.1, 4.2); Press enter – We see we have the same result as when we calculate it manually. • Try another one. In what proportion of Anne’s games do you expect her to score between 9 and 18 points? – normalcdf(9, 18, 13.1, 4.2) – 71.38% • If you are just looking for the area below a certain value, use -9999 as your lower boundary. • If you are just looking for the area above a certain value, use 9999 as your upper boundary. • Let’s try some of these… • Remember Anne’s mean is 13.1 points, with a standard deviation of 4.2 • What proportion of games can you expect Anne to score more than 20 points? – Normalcdf(20, 9999, 13.1, 4.2)= 5.02% • What proportion of games can you expect Anne to score less than 17 points? – Normalcdf(-9999, 17, 13.1, 4.2)= 82.34% • On the iPad, myNormal Calculator can perform similar operations. • You need to select iPhone only when searching for the app. • When using the app, you need to perform the extra step of calculating the z-scores. Using the TI-84 to Calculate Percentiles • To find percentiles in a Normal distribution, use the invNorm command, found in 2nd-DISTR. • Boundary=(area to the left of the boundary, mean, standard deviation) • Let’s find the 25th percentile of Anne’s distribution of points scored. – invNorm(.25, 13.1, 4.2)= 10.27. – Meaning, in about 25% of her games, Anne will score less than 10.27 points. • Let’s try two more. • In about 60% of games, how many points will Anne score less than? – invNorm(.60, 13.1, 4.2)= 14.16 points • How many points will Anne need to score to be in the upper 15% of points scored? – invNorm(.85, 13.1, 4.2)= 17.45 points