Chapter 18 Sampling Distribution Models VOCABULARY Parameter – number that describes the population. This value is not known Statistic – number that can be computed from the sample data without making use of any unknown parameters Sampling distribution – the distribution of values taken by the statistic in all possible samples of the same size from the same population VOCABULARY CONTINUED Population proportion – describes the proportion for the entire population (p) Sample proportion – the proportion calculated for the sample taken p̂ SAMPLING DISTRIBUTION OF A SAMPLE PROPORTION Choose an SRS of size n from a large population with population proportion p having some characteristic of interest. Let p̂ be the proportion of the sample having that characteristic. Then: The sampling distribution of p̂ is approx. normal The mean of the sampling distribution is p The standard deviation of the sampling distribution is p 1 p n Assumptions and Conditions The Normal model gets better as the sample size gets bigger. Most models are useful only when specific assumptions are true. There are two assumptions in the case of the model for the distribution of sample proportions: 1. The sampled values must be independent of each other. 2. The sample size, n, must be large enough. Assumptions and Conditions (cont.) Assumptions are hard—often impossible—to check. That’s why we assume them. Still, we need to check whether the assumptions are reasonable by checking conditions that provide information about the assumptions. The corresponding conditions to check before using the Normal to model the distribution of sample proportions are the 10% Condition and the Success/Failure Condition. RULES 10% Condition: Use the recipe for the standard deviation of p̂ only when the population is at least 10 times as large as the sample. Success/Failure: We will use the normal approximation to the sampling distribution of p̂ for values of n and p that satisfy n 1 p 10 np 10 EXAMPLE You ask an SRS of 1500 first year college students whether they applied to any other college. There are over 1.7 million first year college students. 35% of all first year students applied to other colleges. What is the probability that your sample will give a result within 2 percentage points of this true value? P .33 pˆ .37 EXAMPLE CONTINUED Step 1: Calculate the mean and standard deviation Step 2: Standardize the scores EXAMPLE CONTINUED Step 1: Calculate the mean and standard deviation .35 .35 1 .35 1500 .0123 Step 2: Standardize the scores .33 - .35 z .0123 1.626 .37 .35 z .0123 1.626 EXAMPLE CONTINUED Step 3: Find the P 1.626 z 1.626 P z 1.626 P z 1.626 .9484 .0516 .8968 So almost 90% of all samples will give a result within 2 percentage points of the true value of the population VOCABULARY Parameters – the mean and standard deviation of a population and Statistics – the mean and standard deviation from the sample data x and s MEAN AND STANDARD DEVIATION OF A SAMPLE MEAN Suppose that x is the mean of an SRS of size n drawn from a large population with mean and standard deviation . Then the mean of the sampling distribution of x is and its standard deviation is . n The Fundamental Theorem of Statistics The sampling distribution of any mean becomes Normal as the sample size grows. All we need is for the observations to be independent and collected with randomization. We don’t even care about the shape of the population distribution! The Fundamental Theorem of Statistics is called the Central Limit Theorem (CLT). The CLT works better (and faster) the closer the population model is to a Normal itself. It also works better for larger samples. http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/ The Fundamental Theorem of Statistics (cont.) The Central Limit Theorem (CLT) The mean of a random sample has a sampling distribution whose shape can be approximated by a Normal model. The larger the sample, the better the approximation will be. Assumptions and Conditions The CLT requires remarkably few assumptions, so there are few conditions to check: 1. Random Sampling Condition: The data values must be sampled randomly or the concept of a sampling distribution makes no sense. 2. Independence Assumption: The sample values must be mutually independent. (When the sample is drawn without replacement, check the 10% condition…) 3. Large Enough Sample Condition: There is no one-sizefits-all rule. Standard Error (cont.) When we don’t know p or σ, we’re stuck, right? Nope. We will use sample statistics to estimate these population parameters. Whenever we estimate the standard deviation of a sampling distribution, we call it a standard error. Standard Error (cont.) For a sample proportion, the standard error is SE pˆ ˆˆ pq n For the sample mean, the standard error is s SE y n What Can Go Wrong? Don’t confuse the sampling distribution with the distribution of the sample. When you take a sample, you look at the distribution of the values, usually with a histogram, and you may calculate summary statistics. The sampling distribution is an imaginary collection of the values that a statistic might have taken for all random samples—the one you got and the ones you didn’t get. What Can Go Wrong? (cont.) Beware of observations that are not independent. The CLT depends crucially on the assumption of independence. You can’t check this with your data—you have to think about how the data were gathered. Watch out for small samples from skewed populations. The more skewed the distribution, the larger the sample size we need for the CLT to work.