Chapter 18 - WordPress.com

advertisement
Chapter 18
Sampling distribution models
math2200
Sample proportion
• Kerry v.s. Bush in 2004
– A Gallup Poll
• 49% for Kerry
– A Rasmussen Poll
• 45.9% for Kerry
– Why the answers are different?
• Sample proportion estimates population proportion
• There is randomness due to sampling
Modeling the Distribution of
Sample Proportions
• Imagine what would happen to the sample
proportions if we were to actually draw many
samples.
• What would the histogram of all the sample
proportions look like?
– The histogram of the sample proportions to center at
the true proportion, p, in the population
– The histogram is unimodal, symmetric, and centered
at p.
– A normal model?
Model
• Let X be the number of people voting for
Bush in a sample of size n
• Then X has a binomial model,
Binomial(n,p)
– p: the proportion of people for Bush in the
entire population
• When n is large, we can use normal
approximation
– Normal model with mean np and variance npq
Modeling sample proportion
• Sample proportion is X/n
– Normal model with mean p and variance pq/n

N  p,

pq 

n 
Example
• Back to Kerry v.s. Bush
– Assume that the population proportion voting
for Kerry is 49%
– X/n has a normal model with mean 0.49 and
standard deviation 0.0158 (n=1000)
– Then we know that both 49% and 45.9 % are
reasonable to appear
Conditions
•
Normal model is an approximation to the exact model
–
–
1.
2.
3.
Use it only when n is large
For example, if n=2, then X/n=0,0.5 or 1
Randomization Condition: The sample should be a
simple random sample of the population.
10% Condition: If sampling has not been made with
replacement, then the sample size, n, must be no
larger than 10% of the population.
Success/Failure Condition: The sample size has to be
big enough so that both np and nq are greater than
10.
A Sampling Distribution Model
for a Proportion
• Before we observe the value of the sample
proportion, it is a random variable that has a
distribution due to sampling variations.
– This distribution is called the sampling distribution model
for sample proportions.
– We never actually take repeated samples from the same
population and make a histogram. We only imagine or
simulate them.
– Still, sampling distribution models are important because
• they act as a bridge from the real world of data to the imaginary
model of the statistic and
• enable us to say something about the population when all we have
is data from the real world.
An example
• 13% of the population is left-handed.
• A 200-seat school auditorium was built
with 15 “leftie seats”
• In a class of n=90 students, what’s the
probability that there will NOT be enough
seats for the left-handed students?
• Let X be the number of left-handed
students in the class
• We want to find P(X>15) = P(X/n>0.167)
• Check the conditions
– n is large enough
– randomization
– 10% condition
• The population should have more than 900
students
– Success/failure condition
• np=11.7>10, nq=78.3>10
• Normal model for X/n
– Mean = 0.13
– Sd = sqrt(pq/n) = 0.035
• P(X/n>0.167) = 0.1446
Sample Mean
• Sample means tend to normal when n is
large
Central limit theorem (CLT)
• If the observations are drawn
– independently
– from the same population (distribution)
the sampling distribution of the sample
mean becomes normal as the sample size
increases.
• We do not need to know the population
distribution.
CLT
• Suppose the population distribution has mean
μand standard deviation σ
• The sample mean has mean μand standard
deviation σ/sqrt(n)
• Let X1, …, Xn be n independently and identically
distributed random variables
– E(X1) = μ
– Var(X1)= σ2
• Then as n increases, the distribution of
(X1+…+Xn)/n tends to a normal model with mean
μand standard deviation σ/sqrt(n)
The Fundamental Theorem of
Statistics
The Central Limit Theorem (CLT)
The mean of a random sample has a
sampling distribution whose shape can be
approximated by a Normal model. The
larger the sample, the better the
approximation will be.
Example
• Suppose the population distribution of
adult weights has mean 175 pounds and
sd 25 pounds
– the shape is unknown
• An elevator has a weight limit of 10
persons or 2000 pounds
• What’s the probability that the 10 people
who get on the elevator overload its weight
limit?
• Let Xi,i=1,2,…,10 be the weight of the ith
person in the elevator
• Then we want to know
P(X1+…+X10>2000) =
• From the CLT (check the requirement first),
we know the distribution of
is normal
with mean 175 pounds and standard
deviation
• Then
Standard error
• Using the CLT, we know the distribution of
sample proportion is

pq 
N  p,


n 
• However, we do not know p in practice.
• Using the CLT, we know the distribution of
sample mean is N (  ,  )
n
• However, we do not know  and 
Standard Error
• When we don’t know p or σ, we’re stuck,
right?
• Nope. We will use sample statistics to
estimate these population parameters.
• Whenever we estimate the standard
deviation of a sampling distribution, we call
it a standard error.
Standard Error (cont.)
• For a sample proportion, the standard
error is
SE ( pˆ ) 
pˆ qˆ
n
• For the sample mean, the standard error is
s
SE  y  
n
The Process Going Into the
Sampling Distribution Model
What Can Go Wrong?
• Don’t confuse the sampling distribution
with the distribution of the sample.
– When you take a sample, you look at the
distribution of the values, usually with a
histogram, and you may calculate summary
statistics.
– The sampling distribution is an imaginary
collection of the values that a statistic might
have taken for all random samples—the one
you got and the ones you didn’t get.
What Can Go Wrong? (cont.)
• Beware of observations that are not
independent.
– The CLT depends crucially on the assumption
of independence.
– You can’t check this with your data—you have
to think about how the data were gathered.
• Watch out for small samples from skewed
populations.
– The more skewed the distribution, the larger
the sample size we need for the CLT to work.
Summary
• Sample proportions or sample means are
statistics
– They are random because samples vary
– Their distribution can be approximated by normal
using the CLT
• Be aware of when the CLT can be used
– n is large
– If the population distribution is not symmetric, a much
larger n is needed
• The CLT is about the distribution of the sample
mean, not the sample itself
Download