Chapter 18: Sampling Distribution Models How can surveys conducted at essentially the same time by organizations asking the same questions get different results? You probably easily accept that observations vary but how much variability among polls should we expect to see? The proportions vary from sample to sample because the samples are composed of different people. The Central Limit Theorem for Sample Proportions We want to imagine the results from all the random samples of the same size that we didn’t take. What would the histogram of all the sample proportions look like? The center of the histogram will be at the true proportion in the population, and we call that p. How about the shape of the histogram? To figure this out, we could actually simulate a bunch of random samples that we didn’t really draw. It should be no surprise that we don’t get the same proportion for each sample we draw. Each Ù p comes from a different simulated sample. If you create a histogram of all the proportions from all the possible samples it would be called the sampling distribution of the proportions. It ends up that the sampling distribution of the proportions is centered at the true proportion in the population (p) and is unimodal and symmetric in shape. It looks like we can use the Normal model for the histogram of the sample proportions. Sampling proportions vary from sample to sample and that is one of the very powerful ideas of this course. A sampling distribution model for how a sample proportion varies from sample to sample allows us to quantify that variation and to talk about how likely it is that we’d observe a sample proportion in any particular interval. (for example, if you would take several different 10 person random samples asking whether or not people believed in ghosts, how likely is it that one group will get 10% while another will get 90%??? Is there a reason why these results occurred?? Was the survey biased??? What’s going on!?) To use the Normal model we need two parameters – mean and standard deviation. We will put the mean, m, at the center of the histogram, p. Once we know the mean p, we can find the standard deviation with the formula: s ( pˆ ) = SD( pˆ ) = pq . n As long as the number of samples is reasonably large, we can model the distribution of sample proportions with a probability model that is N(p, pq ). n And don’t forget that once you have a Normal model, you can use the 68-95-99.7 rule or look up other probabilities using a table or calculator. This is what we mean by sampling error – it’s not really an error at all! Its just variability you’d expect to see from one sample to another. A better term would be sampling variability. The Normal model becomes a better and better representation of the distribution of the sample proportions as the sample size gets bigger. (formally, we say the claim is true in the limit as n grows) Assumptions & Conditions To use a model, we usually must make some assumptions. To use the sampling distribution model for sample proportions, we need two assumptions: 1. The Independence Assumption – the sampled values must be independent of each other 2. The Sample Size Assumption – the sample size n must be large enough Conditions: 1. Randomization Condition – if your data come from an experiment, subjects should have been randomly assigned to treatments. If you have a survey, your sample should be a simple random sample of the population. The sampling method should not be biased and the data should be representative of the population. 2. 10% condition – the sample size n must be no larger than 10% of the population 3. Success/Failure Condition – the sample size has to be big enough so that we expect at least 10 successes and at least 10 failures. The last two conditions seem to contradict each other and as always, you would be correct to say that a larger sample is better. It’s just that if the sample were more than 10% of the population we’d need to use different methods to analyze the data. Luckily, this is not usually a problem in practice. A Sampling Distribution Model for a Proportion We’ve simulated repeated samples and looked at a histogram of the sample proportions. We modeled that histogram with a Normal model. We model it because this model will give us insight into how much the sample proportion can vary from sample to sample. The fact that the sample proportion is a random variable taking on a different value in each random sample and that we can say something specific about the distribution of those values is a fundamental insight that will be used in each of the next four chapters. No longer is a proportion something we just compute for a set of data. We now see it as a random variable quantity that has a probability distribution, and thanks to Laplace we have a model for that distribution. We call that the sampling distribution model for the proportion. SAMPLING DISTRIBUTION MODEL FOR A PROPORTION Provided that the sampled values are independent and the sample size is large enough, the sampling distribution of pˆ is modeled by a Normal model with mean m( pˆ ) = p and standard deviation pq . s ( pˆ ) = SD( pˆ ) = n Sampling models are what make statistics work. They inform us about the amount of variation we should expect when we sample. The sampling model quantifies the variability, telling us how surprising any sample proportion is. Rather than thinking about the sample proportion as a fixed quantity calculated from our data, we now think of it as a random variable – our value is just one of the many we might have seen had we chosen a different random sample. Example: The centers for disease control and prevention report that 22% of 18 year old women in the US have a body mass index of 25 or higher – a value considered by the national heart lung and blood institute to be associated with increased health risk. As part of a routine health check at a large college, the physical education department usually requires students to come in to be measured and weighed. This year, the department decided to try out a selfreport system. It asked 200 randomly selected female students to report their heights and weights. Only 31 of these students has BMI’s greater than 25. Is this proportion of high BMI students unusually small??? Example: Suppose in a large class of introductory statistics students, the professor has each person toss a coin 16 times and calculate the proportion of his or her tosses that were heads. Suppose the class repeats the coin-tossing experiment. a) The students toss the coins 25 times each now. Use the 68-95-99.7 rule to describe the sampling distribution model. b) Confirm that you can use a Normal model here. c) They increase the number of tosses to 64 each. Draw and label the appropriate sampling distribution model. d) Explain how the sampling distribution model changes as the number of tosses increases. End of Day 1: Complete pages 432-433 #2, 5, 6, 8, 9 Example: Suppose that about 13% of the population is left-handed. A 200-seat auditorium has been built with 15 left seats, seats that have the built-in desk on the left rather than the right arm of the chair. Question: In a class of 90 students, what’s the probability that there will not be enough seats for the lefthanded students? What About Quantitative Data? The good news is that we can do something similar when we talk about quantitative data! Simulating the Sampling Distribution of a Mean Please read/consult pages 420-421 for the simple simulation of dice tosses and their averages. Remember that the Large of Large Numbers says that as the sample size gets larger, each sample average is more likely to be closer to the population mean. The shape of the distribution becomes bell-shaped – and not just bell-shaped – it’s approaching the Normal model. But can we count on this happening for situations other than dice throws? It turns out that Normal models work well amazingly often. The Fundamental Theorem of Statistics The dice simulation may look like a special situation, but it turns out that what we saw with dice is true for means of repeated samples for almost every situation. And, there are almost no conditions at all!! The sampling distribution of any mean becomes more nearly Normal as the sample size grows. All we need is for the observations to be independent and collected with randomization. We don’t even care about the shape of the population distribution! This surprising fact is the result Laplace proved in a fairly general form in 1810. At the time, Laplace’s theorem caused quite a stir because it is so unintuitive. Laplace’s result is called the Central Limit theorem (CLT). Not only does the distribution of means of many random samples get closer and closer to a Normal model as the sample size grows, this is true regardless of the shape of the population distribution! It works better and faster the closer the population distribution is to a Normal model. And it works better for larger samples. The CENTRAL LIMIT THEOREM (CLT) The mean of a random sample is a random variable whose sampling distribution can be approximated by a Normal model. The larger the sample, the better the approximation will be. Assumptions & Conditions Independence Assumption: the sampled values must be independent of each other Sample Size Assumption: the sample size must be sufficiently large Randomization condition: the data values must be sampled randomly or the concept of a sampling distribution makes no sense 10% condition: when the sample is drawn without replacement, the sample size, n, should be no more than 10% of the population Large Enough Sample Condition: although the CLT tells us that a Normal model is useful in thinking about the behavior of sample means when the sample size is large enough, it doesn’t tell us how large a sample we need. The truth is, it depends; there’s no one-size-fits-all rule. If the population is unimodal and symmetric, even a fairly small sample is okay. If the population is strongly skewed, it can take a pretty large sample to allow use of a Normal model to describe the distribution of sample means. For now you’ll just need to think about your sample size in the context of what you know about the population, and then tell whether you believe the Large Enough Sample Condition has been met. But Which Normal Model should we use??? For means, our center is at the population mean. What about standard deviations though? Means vary less than the individual observations. Which would be more surprising, having one person in your stat class who is over 6’9” tall or having the mean of 100 students taking the course is over 6’9”??? The second situation just won’t happen because means have smaller standard deviations than individuals. THE SAMPLING DISTRIBUTION MODEL FOR A MEAN (CLT) When a random sample is drawn from any population with mean m and standard deviation s , its sample mean, y , has a sampling distribution with the same mean m but whose standard deviation is s . No matter what population the random sample comes from, n the shape of the sampling distribution is approximately Normal as long as the sample size is large enough. The larger the sample used, the more closely the Normal approximates the sampling distribution for the mean. So, when we have categorical data, we calculate a sample proportion, p̂ . The sampling distribution of this random variable has a Normal model with a mean at the true proportion p and a standard deviation of SD( p̂) = model in chapters 19-22). pq . (we’ll use this n When we have quantitative data, we calculate a sample mean, y . The sampling distribution of this random variable has a Normal model with a mean at the true mean, m , and a standard deviation of SD( y ) = s n . We’ll use this model in chapters 23-25. These are standard deviations of the statistics p̂ and y . They both have a square root of n in the denominator. That tells us that the larger the sample, the less either statistic will vary. Example: A college physical education department asked a random sample of 200 female students to self-report their heights and weights, but the percentage of students with BMI’s over 25 seemed suspiciously low. One possible explanation may be that the respondents ‘shaded’ their weights down a bit. The CDC reports that the mean weight of 18-year-old women is 143.74 lbs, with a standard deviation of 51.54 lbs, but these 200 randomly selected women reported a mean weight of only 140 lbs. Question: based on the Central Limit Theorem and the 68-95-99.7 Rule, does the mean weight in this sample seem exceptionally low, or might this just be random sample-to-sample variation? Example: The Centers for Disease Control and Prevention reports that the mean weight of adult men in the US is 190 lbs with a standard deviation of 59 lbs. Question: An elevator in our building has a weight limit of 10 persons or 2500 lbs. What’s the probability that if 10 men get on the elevator, they will overload its weight limit? Means vary less than individual data values. Averages should be more consistent for larger groups. It’s harder to get the standard deviation down because that nasty square root limits how much we can make a sample tell about the population. This is an example of something that’s known as the Law of Diminishing Returns. The Real World and the Model World BE CAREFUL!!! We have been slipping smoothly between the real world, in which we draw random samples of data, and a magical mathematical model world, in which we describe how the sample means and proportions we observe in the real world behave as random variables in all the random sample that we might have drawn. How we have two distributions to deal with. The first is the real-world distribution of the sample, which we might display with a histogram (for quantitative data) or a bar chart or table (for categorical data). The second is the math world sampling distribution model of the statistic, a Normal model based on the Central Limit Theorem. Don’t confuse the two. For example, don’t mistakenly think the CLT says that the data are Normally distributed as long as the sample is large enough. In fact, as samples get larger, we expect the distribution of the data to look more and more like the population from which they are drawn – skewed, bimodal, whatever – but not necessarily Normal. You can collect a sample of CEO salaries for the next 1000 years but the histogram will never look Normal. The Central Limit Theorem doesn’t talk about the distribution of the data from the sample. It talks about the sample means and sample proportions of many different random samples drawn from the same population. Just Checking: 1. Human gestation times have a mean of about 266 days, with a standard deviation of about 16 days. If we record the gestation times of a sample of 100 women, do we know that a histogram of the times will be well modeled by a Normal model? 2. Suppose we look at the average gestation times for a sample of 100 women. If we imagined all the possible random samples of 100 women we could take and looked at the histogram of all the sample means, what shape would it have? 3. Where would the center of that histogram be? 4. What would be the standard deviation of that histogram? Sampling Distribution Models At the heart is the idea that the statistic itself is a random variable. We can’t know what our statistic will be because it comes from a random sample. This sample-to-sample variability is what generates the sampling distribution. The sampling distribution shows us the distribution of possible values that the statistic could have had. The two basic truths about sampling distributions are: 1. Sampling distributions arise because samples vary. Each random sample will contain different cases and, so, a different value of the statistic. 2. Although we can always simulate a sampling distribution, the Central Limit Theorem saves us the trouble for means and proportions. Example: A tire manufacturer designed a new tread pattern for its all-weather tires. Repeated tests were conducted on cars of approximately the same weight traveling at 60 miles per hour. The tests showed that the new tread pattern enables the cars to stop completely in an average distance of 125 feet with a standard deviation of 6.5 feet and that the stopping distances are approximately normally distributed. (a) What is the 70th percentile of the distribution of stopping distances? (b) What is the probability that at least 2 cars out of 5 randomly selected cars in the study will stop in a distance that is greater than the distance calculated in part (a)? (c) What is the probability that a randomly selected sample of 5 cars in the study will have a mean stopping distance of at least 130 feet? Example: