MM207 Statistics Welcome to the Unit 6 Seminar Wednesday, March 7, 2012 8 to 9 PM ET The Normal Shape • This is a histogram for a distribution of 300 natural births. The left vertical axis shows the number of births for each 4-day bin*. The right vertical axis shows relative frequencies * A bin is a group or class. The Normal Shape The distribution of the birth data has a fairly distinctive shape, which is easier to see if we overlay the histogram with a smooth curve. Characteristics of the Normal Curve • The distribution is single-peaked. Its mode, or most common birth date, is the due date. • The distribution is symmetric around its single peak; therefore, its median and mean are the same as its mode. The median is the due date because equal numbers of births occur before and after this date. The mean is also the due date because, for every birth before the due date, there is a birth the same number of days after the due date. • The distribution is spread out in a way that makes it resemble the shape of a bell, so we call it a “bell-shaped” distribution. • The total area under the curve is equal to 1.00 • The curve approaches the horizontal axis but never touches it Variation in Distributions Both distributions are normal and have the same mean of 75, but the distribution on the left has a larger standard deviation. When Can We Expect a Normal Distribution? A data set that satisfies the following four criteria is likely to be normally distributed 1. Most data values are clustered near the mean, giving the distribution a well-defined single peak. 2. Data values are spread evenly around the mean, making the distribution symmetric. 3. Larger deviations from the mean become increasingly rare, producing the tapering tails of the distribution. 4. Individual data values result from a combination of many different factors, such as genetic and environmental factors. An Example of a Normal Distribution Consider a Consumer Reports survey in which participants were asked how long they owned their last TV set before they replaced it. The variable of interest in this survey is replacement time for television sets. •Based on the survey, the distribution of replacement times has a mean of about 8.2 years, which we denote as µ (the Greek letter mu). •The standard deviation of the distribution is about 1.1 years, which we denote as σ (the Greek letter sigma). Television Replacement Distribution Making the reasonable assumption that the distribution of TV replacement times is approximately normal, we can picture it as shown “mu” = µ = 8.2 “sigma”= σ = 1.1 68-95-99.7 Rule or Empirical Rule This rule gives guidelines for the percentage of data values that will lie within 1, 2, and 3 standard deviations of the mean for any normal distribution. “mu” = µ = 8.2 “sigma”= σ = 1.1 That is from 7.1 years to 9.3 years That is from 6 years to 10.4 years That is from 4.9 years to 11.5 years Finding a Percentile On a visit to the doctor’s office, your fourth-grade daughter is told that her height is 1 standard deviation above the mean for her age and sex. What is her percentile for height? Assume that heights of fourth-grade girls are normally distributed. • • • • • Recall that a data value lies in the nth percentile of a distribution if n% of the data values are less than or equal to it (see Section 4.3). According to the 68-95-99.7 rule, 68% of the heights are within 1 standard deviation of the mean. Therefore, 34% of the heights (half of 68%) are between 0 and 1 standard deviation above the mean. We also know that, because the distribution is symmetric, 50% of all heights are below the mean. Therefore, 50% + 34% = 84% of all heights are less than 1 standard deviation above the mean (Figure 5.21). Your daughter is in the 84th percentile for heights among fourth-grade girls Finding a Percentile Interpretation: Find the percentile for 1 standard deviation above the mean for her age and sex. Assume that heights of fourth-grade girls are normally distributed. What is her percentile if she were 1 standard deviation BELOW the mean? Introduction to Standard Scores • Remember the Empirical Rule!!! • Sample Curve • μ = 500 • σ = 100 • How many Standard Deviations away from the mean is: • • • • • 300 800 250 500 650 200 300 400 500 600 700 800 -3 -2 -1 0 +1 +2 +3 Computing Standard Scores • The number of standard deviations a data value lies above or below the mean is called its standard score (or z-score), defined by data value – mean standard deviation z = standard score = = (x – µ) / σ • The standard score is positive for data values above the mean and negative for data values below the mean. Getting More Precise Standard Scores and Percentiles Once we know the standard score of a data value, the properties of the normal distribution allow us to find its percentile in the distribution. This is usually done with a standard score table. (In eText see “chapter BM” for Back matter to get Appendix A on pages 446-447) Example 1 Example 2 A college admissions test is scaled so that scores have a mean of 500 and a standard deviation of 100.(You will use StatCrunch, but you must understand theory.) Finding Z Scores from Percentiles i.e. (working backwards) Example: Given the mean cholesterol level of 178 and the standard deviation of 41, What cholesterol level corresponds to the 90th percentile? The 90th percentile would be on the POSITIVE Z table since it is larger than the 50th percentile. Right? Go to that table and SCAN the body looking for the value closest to .9000 (the 90th percentile). Move your fingers back to the left to get the x.y part of the Z xcore. Move you finger up to see the .0w part of the score. Now add these values to make the score x.yz. All z scores have 2 digits to the right of the decimal. So moving to left we get 1.2; moving up we see .08; add these gives 1.2 + .08 = 1.28 as our z score. Finding Z Scores from Percentiles i.e. (working backwards) Example: Given the mean cholesterol level of 178 and the standard deviation of 41, What cholesterol level corresponds to the 90th percentile? Now z = 1.28. Thus, the 90th percentile is about 1.28 standard deviations above the mean. Finally, give this z score in terms of the problem application or the x value. Use the formula z = (x – µ) / σ and solve for x. You can do the algebra or just trust me that is x = µ + (z)* σ For our problem, 178 + (1.28 * 41) = 230.48 Therefore, A cholesterol level of about 230.48 or 230 corresponds to the 90th percentile. The Central Limit Theorem Suppose we take many random samples of size n for a variable with any distribution (not necessarily a normal distribution) and record the distribution of the means of each sample. Then, 1. The distribution of means will be approximately a normal distribution for large sample sizes. n>30 is magic number 2. The mean of the distribution of means approaches the population mean, µ, for large sample sizes. 3. The standard deviation of the distribution of means approaches σ/√n for large sample sizes, where σ is the standard deviation of the population. The Interpretation of the Central Limit Theorem If you have a group of size n, instead of one individual selection (like the problems we did earlier) the only difference in working the problem is how you COMPUTE the Z Score. Use the formula z = (given sample mean – µ) / [σ/√n ] Also, see Example 1 of Section 5.3 called Predicting Test Score. Be sure to notice the difference in part a and part b. In part a, you have ONE person and in part b you have a group of 100 people. CLT Demonstrated (Figure 5.26) The Value of the Central Limit Theorem • The Central Limit Theorem allows us to say something about the mean of a group if we know the mean, µ, and the standard deviation, σ, of the entire population. This can be useful, but it turns out that the opposite application is far more important. • Two major activities of statistics are making estimates of population means and testing claims about population means. Is it possible to make a good estimate of the population mean knowing only the mean of a much smaller sample? • As you can probably guess, being able to answer this type of question lies at the heart of statistical sampling, especially in polls and surveys. The Central Limit Theorem provides the key to answering such questions. Computing Probabilities in MSL EASY! Example 1 Note: the icon here is not data for the problem but a standard scores for the specific distribution given in this problem. Example 1-Part a: find the percentage of scores greater than 1866. Choose Calculator -> Normal then put in the mean, st dev and value in question, 1866. Be sure to choose => for “greater than or equal”. Click Compute to get the graph and answer shown in the second picture below. The answer is .15865 which as a percentage is 15.865 and rounds to 15.87% with two decimals as asked for in the question. Example 1 - Part c: find the percentage of scores between 1389 and 2184. There are 3 steps to this one. To get area between you must subtract the Area of the LEFTMOST value FROM the area of the RIGHTMOST value. Compute percentage less than 2184; compute percentage of 1389; then Subtract. From below you see .97724986 - .30853754 = .66871232 Which is 66.871% and rounded to 2 places is 66.87%. Example 2 - Part c: find the probability the mean blood pressure is less than 111 for a sample of 280 women. Since n > 1, this is Central Limit Theorem. Be sure to compute new Standard deviation as sigma / sqrt(n) before plugging into StatCrunch. Standard deviation = 13.22 / sqrt(280) = 13.22 / 16.73 = .79 ~.8 Questions?