Chapter 18 Sampling distribution models math2200 Sample proportion • Kerry v.s. Bush in 2004 – A Gallup Poll • 49% for Kerry – A Rasmussen Poll • 45.9% for Kerry – Why the answers are different? • Sample proportion estimates population proportion • There is randomness due to sampling Modeling the Distribution of Sample Proportions • Imagine what would happen to the sample proportions if we were to actually draw many samples. • What would the histogram of all the sample proportions look like? – The histogram of the sample proportions to center at the true proportion, p, in the population – The histogram is unimodal, symmetric, and centered at p. – A normal model? Model • Let X be the number of people voting for Bush in a sample of size n • Then X has a binomial model, Binomial(n,p) – p: the proportion of people for Bush in the entire population • When n is large, we can use normal approximation – Normal model with mean np and variance npq Modeling sample proportion • Sample proportion is X/n – Normal model with mean p and variance pq/n N p, pq n Example • Back to Kerry v.s. Bush – Assume that the population proportion voting for Kerry is 49% – X/n has a normal model with mean 0.49 and standard deviation 0.0158 (n=1000) – Then we know that both 49% and 45.9 % are reasonable to appear Conditions • Normal model is an approximation to the exact model – – 1. 2. 3. Use it only when n is large For example, if n=2, then X/n=0,0.5 or 1 Randomization Condition: The sample should be a simple random sample of the population. 10% Condition: If sampling has not been made with replacement, then the sample size, n, must be no larger than 10% of the population. Success/Failure Condition: The sample size has to be big enough so that both np and nq are greater than 10. A Sampling Distribution Model for a Proportion • Before we observe the value of the sample proportion, it is a random variable that has a distribution due to sampling variations. – This distribution is called the sampling distribution model for sample proportions. – We never actually take repeated samples from the same population and make a histogram. We only imagine or simulate them. – Still, sampling distribution models are important because • they act as a bridge from the real world of data to the imaginary model of the statistic and • enable us to say something about the population when all we have is data from the real world. An example • 13% of the population is left-handed. • A 200-seat school auditorium was built with 15 “leftie seats” • In a class of n=90 students, what’s the probability that there will NOT be enough seats for the left-handed students? • Let X be the number of left-handed students in the class • We want to find P(X>15) = P(X/n>0.167) • Check the conditions – n is large enough – randomization – 10% condition • The population should have more than 900 students – Success/failure condition • np=11.7>10, nq=78.3>10 • Normal model for X/n – Mean = 0.13 – Sd = sqrt(pq/n) = 0.035 • P(X/n>0.167) = 0.1446 Sample Mean • Sample means tend to normal when n is large Central limit theorem (CLT) • If the observations are drawn – independently – from the same population (distribution) the sampling distribution of the sample mean becomes normal as the sample size increases. • We do not need to know the population distribution. CLT • Suppose the population distribution has mean μand standard deviation σ • The sample mean has mean μand standard deviation σ/sqrt(n) • Let X1, …, Xn be n independently and identically distributed random variables – E(X1) = μ – Var(X1)= σ2 • Then as n increases, the distribution of (X1+…+Xn)/n tends to a normal model with mean μand standard deviation σ/sqrt(n) The Fundamental Theorem of Statistics The Central Limit Theorem (CLT) The mean of a random sample has a sampling distribution whose shape can be approximated by a Normal model. The larger the sample, the better the approximation will be. Example • Suppose the population distribution of adult weights has mean 175 pounds and sd 25 pounds – the shape is unknown • An elevator has a weight limit of 10 persons or 2000 pounds • What’s the probability that the 10 people who get on the elevator overload its weight limit? • Let Xi,i=1,2,…,10 be the weight of the ith person in the elevator • Then we want to know P(X1+…+X10>2000) = • From the CLT (check the requirement first), we know the distribution of is normal with mean 175 pounds and standard deviation • Then Standard error • Using the CLT, we know the distribution of sample proportion is pq N p, n • However, we do not know p in practice. • Using the CLT, we know the distribution of sample mean is N ( , ) n • However, we do not know and Standard Error • When we don’t know p or σ, we’re stuck, right? • Nope. We will use sample statistics to estimate these population parameters. • Whenever we estimate the standard deviation of a sampling distribution, we call it a standard error. Standard Error (cont.) • For a sample proportion, the standard error is SE ( pˆ ) pˆ qˆ n • For the sample mean, the standard error is s SE y n The Process Going Into the Sampling Distribution Model What Can Go Wrong? • Don’t confuse the sampling distribution with the distribution of the sample. – When you take a sample, you look at the distribution of the values, usually with a histogram, and you may calculate summary statistics. – The sampling distribution is an imaginary collection of the values that a statistic might have taken for all random samples—the one you got and the ones you didn’t get. What Can Go Wrong? (cont.) • Beware of observations that are not independent. – The CLT depends crucially on the assumption of independence. – You can’t check this with your data—you have to think about how the data were gathered. • Watch out for small samples from skewed populations. – The more skewed the distribution, the larger the sample size we need for the CLT to work. Summary • Sample proportions or sample means are statistics – They are random because samples vary – Their distribution can be approximated by normal using the CLT • Be aware of when the CLT can be used – n is large – If the population distribution is not symmetric, a much larger n is needed • The CLT is about the distribution of the sample mean, not the sample itself