Chapter 18 – Sampling Distribution Models How accurate is our sample? Sometimes different polls show different results for the same question. Since each poll samples a different group of people, we should expect some variation in the results. We could try drawing lots of samples and looking at the variation amongst those samples. Experiment: Simulating a sample A recent US Census Bureau study (source) reports that about 30% of Americans 25 or older have a Bachelor’s degree. Open up a blank Minitab worksheet and let’s generate some random data: Calc > Random Data > Bernoulli Enter 200 rows Store in Column C1-C20 Event Probability: .3 Proportion estimates for samples of size 5 We can treat each row as a sample and calculate the proportion of each sample using the mean. Samples of size 5: Calc > Row Statistics > Mean Input Variables: C1 – C5 Store result in: C21 Look at these sample proportions. Are they close to the population proportion of 30%? Draw a histogram of the sample proportions in C21 Proportion estimates for samples of size 10 Samples of size 10: Calc > Row Statistics > Mean Input Variables: C1 – C10 Store result in: C22 Look at these sample proportions. Are they close to the population proportion of 30%? Draw a histogram of the sample proportions in C22 Proportion estimates for samples of size 20 Samples of size 10: Calc > Row Statistics > Mean Input Variables: C1 – C20 Store result in: C23 Look at these sample proportions. Are they close to the population proportion of 30%? Draw a histogram of the sample proportions in C23 Sampling Distribution Model for a Proportion Our histogram of the sample proportions started to look like a Normal model The larger our sample size gets, the better the Normal model works Assumptions: Independence: sampled values must be independent of each other Sample Size: n must be large enough Conditions to check for assumptions Randomization Condition: Experiments should have treatments randomly assigned Survey samples should be a simple random sample or representative, unbiased sample otherwise 10% Condition: Sample size n must be no more than 10% of population Success/Failure Condition: Sample size needs to be large enough to expect at least 10 successes and 10 failures Sampling Distribution Model for a Proportion If the sampled values are independent and the sample size is large enough, The sampling distribution model of p is modeled by a Normal model with: ( p) p SD( p) pq n Example: Proportion of Vegetarians 7% of the US population is estimated to be vegetarian. If a random sample of 200 people resulted in 20 people reporting themselves as vegetarians, is this an unusually high proportion? Conditions: Randomization 10% condition Success/Failure Vegetarians Example continued Since our conditions were met, it’s ok to use a Normal model. p = 20/200 = .10 E( p) = p = .07 SD( p) pq (.07)(.93) .018 n 200 .10 .07 1.67 z= .018 This result is within 2 sd’s of mean, so not unusual 68-95-99.7 Rule with Vegetarians 68% 95% 98% -3σ -2σ -1σ p 1σ 2σ 3σ Sampling Distribution of a Mean Rolling dice simulation 10,000 individual rolls recorded Figure from DeVeaux, Intro to Stats Sampling Distribution of a Mean Roll 2 dice 10,000 times, average dice Figure from DeVeaux, Intro to Stats Sampling Distribution of a Mean Rolling 3 dice 10,000 times and averaging dice Figure from DeVeaux, Intro to Stats Sampling Distribution of a Mean Rolling 5 dice 10,000 times and averaging Figure from DeVeaux, Intro to Stats Sampling Distribution of a Mean Rolling 20 dice 10,000 times and averaging Once again, as sample size increases, Normal model appears Figure from DeVeaux, Intro to Stats Central Limit Theorem The sampling distribution of any mean becomes more nearly Normal as the sample size grows. The larger the sample, the better the approximation will be Observations need to be independent and collected with randomization. CLT Assumptions Assumptions: Independence: sampled values must be independent Sample Size: sample size must be large enough Conditions: Randomization 10% Condition Large enough sample Which Normal Model to use? The Normal Model depends on a mean and sd Sampling Distribution Model for a Mean When a random sample is drawn from any population with mean µ and standard deviation σ, its sample mean y has a sampling distribution with: Mean: µ Standard Deviation: n Example: CEO compensation 800 CEO’s Mean (in thousands) = 10,307.31 SD (in thousands) = 17,964.62 Samples of size 50 were drawn with: Mean = 10,343.93 SD = 2,483.84 Samples of size 100 were drawn with: Mean = 10,329.94 SD = 1,779.18 According to CLT, what should theoretical mean and sd be? Example from DeVeaux, Intro to Stats