Chapter 2 Modeling Distributions of Data Objectives SWBAT: 1) Find and interpret the percentile of an individual value within a distribution of data. 2) Find and interpret the standardized score (z-score) of an individual value within a distribution of data. 3) Use the 68-95-99.7 rule to estimate areas (proportions of values) in a Normal distribution. 4) Use Table A or technology to find (i) the proportion of z-values in a specified interval, or (ii) a z-score from a percentile in the standard Normal distribution. 5) Use Table A or technology to find (i) the proportion of values in a specified interval, or (ii) the value that corresponds to a given percentile in any Normal distribution. The Normal Distribution • A Normal distribution is described by a Normal density curve (bell-shaped curve). • Any particular Normal distribution is completely specified by two numbers: • The mean of the Normal distribution is at the center of the symmetric Normal curve. Important Properties of a Normal Distribution • The Normal distribution is roughly symmetric, unimodal, and bellshaped. • The mean, median, and mode have roughly the same value, which is located exactly in the center of the distribution. • The total area under the Normal curve equals 1. The area to the left of the mean is .5 and the area to the right of the mean is .5. • Data that lie beyond two standard deviations from the mean are rare, and data that lie beyond three standard deviations from the mean are very rare. Outliers are considered values falling below a z-score of -2.68 (area is .0037) or above a z-score of 2.68 (area is .9963). • Many variables approximate the Normal distribution. However, even if a data set is skewed, the sample from that data set will likely approximate the Normal distribution. The 68-95-99.7 Rule (aka the Empirical Rule) The 68-95-99.7 Rule In the Normal distribution with mean µ and standard deviation σ: • Approximately 68% of the observations fall within σ of µ. • Approximately 95% of the observations fall within 2σ of µ. • Approximately 99.7% of the observations fall within 3σ of µ. Example: Suppose a sample of scores yields a mean of 100 and a standard deviation of 15. Assume that the distribution is Normal. Approximately what percent of scores should fall between 85 and 115? (Hint: Draw a diagram first!) 85 and 115 are both one standard deviation from the mean, so the percent of scores that fall between 85 and 115 is approximately 68% Let’s try some more with the same distribution… What percent of scores should fall: a) Between 70 and 130? 95% c) Between 70 and 115? 13.5%+68%=81.5% e) Less than 70? 2.5% b) Between 55 and 145? 99.7% d) Greater than 115? 13.5%+2.5%=16% • We know approximately what percent of data fall exactly 1, 2, and 3 standard deviations from the mean. However, how can we find a percent if a value does not fall exactly 1, 2, or 3 standard deviations from the mean. • The first thing we have to do is standardize our score(s). This is referred to as finding the z-score. The Standard Normal Distribution All Normal distributions are the same if we measure in units of size σ from the mean µ as center. The standard Normal distribution is the Normal distribution with mean 0 and standard deviation 1. If a variable x has any Normal distribution N(µ,σ) with mean µ and standard deviation σ, then the standardized variable z= has the standard Normal distribution, N(0,1). x -m s Some notes on z-scores: • Z-scores can be negative, positive, or 0. • A positive z-score would indicate the original value is above the mean. For example, a z-score of 1.24 would mean that the score is 1.24 standard deviations above the mean. • A negative z-score would indicate the original value is below the mean. For example, a z-score of -0.27 would mean that the score is 0.27 standard deviations below the mean. • A z-score of 0 would indicate that the original value was the same as the mean. • Question: Are all negative z-scores bad? Example: A distribution is approximately Normal with a mean of 26 and a standard deviation of 7. Calculate and interpret the z-score for a value of 21. The value 21 is 0.71 standard deviations below the mean. • Example: Seth recently took two tests in school. On his history test he scored a 75. The class average on the test was a 63 and the test had a standard deviation of 3. On his biology test, Seth scored a 81. The class average on the test was a 76 and the test had a standard deviation of 7. Relatively speaking, on which test did he perform better? He performed better on his history test as he was more standard deviations above that respective class mean. Steps to Use the standard Normal table 1) Draw the Normal curve. Make sure to identify the mean and standard deviation. 2) Standardize the score(s) of interest. 3) Plot the score(s), draw a vertical line(s), and shade the area of interest. 4) Look up the z-score on the standard Normal table. • If the area of interest is shaded to the left, the value in the table is the desired area. • If the area of interest is shaded to the right, we need to subtract the area in the table from 1. • If the area of interest is shaded between two z-scores, we need to look up the area for both z-scores and subtract. Note for the AP test: For all Normal distribution problems (and other distributions which we will get to) you need to do 3 things: 1) state the distribution and identify values of interest 2) show work 3) answer the question Example: A data set is Normally distributed with a mean of 259 and a standard deviation of 74. Find the area under the curve less than a score of 180. • We have to look up the z-score of -1.07 on the table. • Since the score starts out as “-1.0”, go to the z column and go down until you reach -1.0. • Since there is a 7 in the hundredths place (.07), go to the right until you reach the .07 column. You should now be located in the spot that has -1.0 to the left, and .07 on the top. This value should be .1423. Example: A data set is Normally distributed with a mean of 26 and a standard deviation of 2.4. Find the area under the curve more than a score of 29. N(26, 2.4) A z-score of 1.25 on the table gives an area of .8944. However, this is the area to the LEFT of the score. We want the area to the right of the score. Since the area under the Normal curve totals 1, we need to subtract .8944 from 1. This gives us our desired area, which is .1056. Example: In the 2008 Wimbledon tennis tournament, Rafael Nadal averaged 115 miles per hour on his first-serves. Assuming that the distribution of his first-serve speeds is Normal with a standard deviation of 6 mph, find what proportion of his first-serves you would expect to be between 110 and 125 mph. N(115, 6) Look up both areas and subtract. .9525-.2033=.7492 Using the Normal Distribution in Reverse • In 2008, the distribution of batting averages for MLB players with at least 300 plate appearances was approximately Normal with a mean of 0.272 and a standard deviation of 0.027. • Suppose a player gets a salary bonus if his batting average is in the top 10% of all players. How well must a player hit for his batting average to be in the top 10%? • We need to find the boundary between the lowest 90% of the distribution and the highest 10%. • The boundary value is called the 90th percentile, because 90% of the values fall below it. • A percentile describes location in a distribution, like quartiles. • N(.272, .027) • We know that the area under the curve is .90. Therefore, we want to look at the interior of the standard Normal table for a proportion closest to 0.9000 and get the z-score associated with this proportion. • The closest value is 0.8997. This corresponds to a z-score of 1.28. This means the 90th percentile is 1.28 standard deviations above the mean. • Now let’s find the batting average associated with this z-score. How can we do these calculations on the TI-84? • To find areas: normalcdf(lower, upper, mean, SD) • 2nd, DISTR, 2: normalcdf • To find boundaries: invNorm(area to left, mean, SD) • 2nd, DISTR, 3: invNorm • Note: mean and SD default to 0,1 if not entered. • Note: You must show your three steps! • State distribution and identify values of interest • Show work • Answer Example: Suppose that Clayton Kershaw of the LA Dodgers throws his fastball with a mean velocity of 94 miles per hour (mph) and a standard deviation of 2 mph, and that the distribution if his fastball speeds can be modeled by a Normal distribution. a) About what proportion of his fastballs will travel at least 100 mph? N(94, 2) normalcdf (100, 100000, 94, 2) Lower bound: 100 Upper bound: 100000 Mean: 94 SD: 2 Approximately .0013 of his fastballs will travel at least 100 mph. Note: (a) and (b) in the notes are the same question. c) About what proportion of his fastballs will travel less than 90 mph? N(94, 2) normalcdf (0, 90, 94, 2) Lower bound: 0 (use 0 because a pitch cannot be negative mph) Upper bound: 90 Mean: 94 SD: 2 Approximately .0228 of his fastballs will travel less than 90 mph. d) About what proportion of his fastballs will travel between 93 and 95 mph? N(94, 2) normalcdf (93, 95, 94, 2) Lower bound: 93 Upper bound: 95 Mean: 94 SD: 2 Approximately .3829 of his fastballs will travel between 93 and 95 mph. e) What is the 30th percentile of Kershaw’s distribution of fastball velocities? N(94, 2) invNorm(.3, 94, 2) area to left: .30 mean: 94 SD: 2 The 30th percentile is 92.9512 mph. f) What fastball velocities would be considered low outliers for Kershaw? N(94, 2) The values would fall below a z-score of -2.68 (area of .0037) On the calculator: invNorm(.0037, 94, 2) Area to the left: .0037 Mean: 94 SD: 2 Same answer!!! Fastballs below 88.64 mph would be considered outliers. g) Suppose that a different pitcher’s fastballs have a mean velocity of 92 mph and 40% of his fastballs go less than 90 mph. What is his standard deviation of his fastball velocities, assuming his distribution of velocities can be modeled by a Normal distribution? N(92, ?) Use your table and work backwards to find the z-score associated with .40. Z=-0.25. Now substitute into our equation. To check: use invNorm (.4, 92, 8) and you should get 90. Normal Distribution Calculations We can answer a question about areas in any Normal distribution by standardizing and using Table A or by using technology. How To Find Areas In Any Normal Distribution Step 1: State the distribution and the values of interest. Draw a Normal curve with the area of interest shaded and the mean, standard deviation, and boundary value(s) clearly identified. Step 2: Perform calculations—show your work! Do one of the following: (i) Compute a z-score for each boundary value and use Table A or technology to find the desired area under the standard Normal curve; or (ii) use the normalcdf command and label each of the inputs. Step 3: Answer the question. Working Backwards: Normal Distribution Calculations Sometimes, we may want to find the observed value that corresponds to a given percentile. There are again three steps. How To Find Values From Areas In Any Normal Distribution Step 1: State the distribution and the values of interest. Draw a Normal curve with the area of interest shaded and the mean, standard deviation, and unknown boundary value clearly identified. Step 2: Perform calculations—show your work! Do one of the following: (i) Use Table A or technology to find the value of z with the indicated area under the standard Normal curve, then “unstandardize” to transform back to the original distribution; or (ii) Use the invNorm command and label each of the inputs. Step 3: Answer the question. Assessing Normality The Normal distributions provide good models for some distributions of real data. Many statistical inference procedures are based on the assumption that the population is approximately Normally distributed. A Normal probability plot provides a good assessment of whether a data set follows a Normal distribution. Interpreting Normal Probability Plots If the points on a Normal probability plot lie close to a straight line, the plot indicates that the data are Normal. Systematic deviations from a straight line indicate a non-Normal distribution. Outliers appear as points that are far away from the overall pattern of the plot. Transforming Data Effect of Adding (or Subtracting) a Constant Adding the same number a to (subtracting a from) each observation: • adds a to (subtracts a from) measures of center and location (mean, median, quartiles, percentiles), but • Does not change the shape of the distribution or measures of spread (range, IQR, standard deviation). Effect of Multiplying (or Dividing) by a Constant Multiplying (or dividing) each observation by the same number b: • multiplies (divides) measures of center and location (mean, median, quartiles, percentiles) by b • multiplies (divides) measures of spread (range, IQR, standard deviation) by |b|, but • does not change the shape of the distribution