Inference with Proportions

advertisement
Inference with Proportions I
Sampling Distributions
and
Confidence Intervals
Parameter
• A number that describes the
population
• Symbols we will use for parameters
include
m - mean
s – standard deviation
p – proportion
Statistic
• A number that that can be computed from
sample data
• Some statistics we will use include
x – sample mean
s – standard deviation
p – sample proportion
This variability is
called sampling
variability
• The observed value of the statistic
depends on the particular sample selected
from the population and it will vary from
sample to sample.
Let’s explore what happens with in distributions
of sample proportions (p). Have students
perform the following experiment.
This is a statistic!
•Toss a penny 20 times and record the number
of heads.
In this case, we will use
•Calculate
the
proportion
of
heads
and
mark
it
on
The
dotplot
is
a
partial
graph
of
the
What would happen to the dotplot if we
number
ofboard.
successes
in
the
sample
the dot
plot
on
the
sampling
distribution
of
all
sample
ˆ
p flipped the penny 50 times and
proportions
sample
size
20.
n
recorded
the of
proportion
of heads?
What shape do you think the dot plot will
have?
Sampling Distribution of p
The distribution that would be formed by
considering the value of a sample statistic for
every possible different sample of a given size
from a population.
Suppose we have a population of six students:
Alice, Ben, Charles, Denise, Edward, & Frank
will keep
We are We
interested
in the
the population
proportionsmall
of females.
so that
can find ALL
the
This is called
thewe
parameter
of interest
possible samples of a given size.
What is the proportion of females?
p = 1/3
Let’s select samples of two from this population.
How many different samples are possible?
6C2
=15
Find the 15 different samples that are possible
and find the sample proportion of the number
of females in each sample.
Ben & Frank
Alice & Ben
.5
Charles & Denise
Alice & Charles
.5
Charles & Edward
Alice & Denise
1
Charles & Frank
Alice & Edward
.5
Denise of
& Edward
the mean
the
Alice & Frank How does
.5
Denise & Frank
Ben & Charles
0
sampling
distribution
& Frank
Ben & Denise compare
.5 to theEdward
population
Ben & Edward
0
parameter
(p)?
0
.5
0
0
.5
.5
0
Find the mean and standard deviation of these
sample proportions.
1
m pˆ 
and
s pˆ  0.29814
3
Six Students Continued . . .
Let’s select samples of three from this
population.
How many different samples are possible?
6C3
= 20
Find the mean and standard deviation of these
sample proportions.
1
m pˆ 
3
and s pˆ  0.2108
General Properties for Sampling
Distributions of p
Rule 1:
m pˆ  p
Rule 2: s pˆ 
p (1  p )
n
Let’s verify Rule 2.
Does the formula equal the standard deviation
for samples of size 2 (s = .29814)?
NO -
σ pˆ 
 
1 2
1 to use this formula to
So
in order
3 –3
the standard
 0.29814
calculate
deviation of
2
3
the sampling distribution,
WHY?
p 1  p 
s pˆ 
Correction
factor
– multiply by
n
We are sampling more than 10% of ourN population!
n
we MUST
be sure
that
sample
If we use the correction
factor,
we will
that
N see
1 our
size is less than 10% of the
we are correct.
population!
σ pˆ 
 
1 2
3 3  6  2  0.29814
2
6 1
General Properties for Sampling
Distributions of p
Rule 1:
m pˆ  p
Rule 2: s pˆ 
p (1  p )
n
This rule is exact if the population is infinite, and is
approximately correct if the population is finite and
no more than 10% of the population is included in
the sample
Chip Activity:
• Select three samples of size 5, 10, and
20 and record the number of blue chips.
• Place your proportions on the
appropriate dotplots.
What do you notice about
these distributions?
In the fall of 2008, there were 18,516 students
enrolled at California Polytechnic State University,
San Luis Obispo. Of these students, 8091 (43.7%)
were female. We will use a statistical software
package to simulate sampling from this Cal Poly
population.
We will generate 500 samples of each of the
following sample sizes: n = 10, n = 25, n = 50, n = 100
and compute the proportion of females for each
sample.
The following histograms display
the distributions of the sample
proportions for the 500 samples of
each sample size.
What
do
notice
What
do you
youhistograms
notice about
about
Are these
thethe
standard
deviation
of
shape
of the
these
centered
around
true
these
distributions?
distributions?
proportion
p = .437?
The development of viral hepatitis after a blood
transfusion can cause serious complications for
a patient. The article “Lack of Awareness
Results in Poor Autologous Blood Transfusions” (Health
Care Management, May 15, 2003) reported that
hepatitis occurs in 7% of patients who receive blood
transfusions during heart surgery. We will simulate
sampling from a population of blood recipients.
We will generate 500 samples of each of the following
sample sizes: n = 10, n = 25, n = 50, n = 100 and
compute the proportion of people who contract
hepatitis for each sample.
The following histograms display the distributions of
the sample proportions for the 500 samples of each
sample size.
Are these
histogram
s centered
around
the true
proportion
p = .07?
What
happens to
the shape
of these
histograms
as the
sample size
increases?
General Properties Continued . . .
Rule 3: When n is large and p is not too
near 0 or 1, the sampling distribution of p
is approximately normal.
The farther the value of p is from 0.5, the larger n must
be for the sampling distribution of p to be approximately
normal.
A conservative rule of thumb:
If np > 10 and n (1 – p) > 10, then a normal
distribution provides a reasonable
approximation to the sampling distribution of p.
Why does np > 10 ensure an approximate normal
distribution?
In a binomial distribution, we will investigate what happens
to the probability histogram as the sample size increases.
Suppose n = 100
10 and
20
30
40
50
60
80
90
andp p==0.1
0.1
70
Let’s
normal
Whatdraw
doesanp
equal?
curve over the
histogram.
Why do we need to also check n(1 – p)?
Consider what the histogram looks like
when n = 10 and p = .9.
We must also check that the upper tail
will spread out into an approximate normal
curve.
Here’s an algebraic proof . . .
If a binomial distribution can be approximated by a
normal curve, then the minimum and maximum values of 0
and n MUST lie within 3 standard deviations of the
mean.
Simplifying
this
inequality
s

np
(1  p )
mLet’s
 npsimplify
Recall:
gives us the following:
this inequality.
Therefore:Since 0 < p < 1, we can substitute the
The
conservative
approach
uses
values
0
and
1
into
these
inequalities
0  np  3 np (1  p ) Square
ANDbothnp  3 np (1  p )  n
10 instead
of
9. needed to be
to find the
largest
value
sides. 3 np (1  p )  n (1  p )
3 np (1  p )  np
within 3 standard
deviations of the
Divide both
sides by
np.
mean.
9np (1  p )  n 2 p 2
9np (1  p )  n 2 (1  p )2
9(1  p )  np
9p  n (1  p )
9  np
9  n (1  p )
Blood Transfusions Revisited . . .
Let p = proportion of patients who contract
hepatitis after a blood transfusion
p = .07
To answer this question, we
must consider the sampling
Suppose a newdistribution
bloodp screening
is
= 6/200
.03
of p.=procedure
believed to reduce the incident rate of hepatitis.
Blood screened using this procedure is given to
n = 200 blood recipients. Only 6 of the 200
patients contract hepatitis. Does this result
indicate that the true proportion of patients who
contract hepatitis when the new screening is
used is less than 7%?
Blood Transfusions Revisited . . .
Let p = .07
p = 6/200 = .03
Is the sampling distribution approximately
normal?
Yes, we can
np = 200(.07) = 14 > 10
use a normal
n(1-p) = 200(.93) = 186 > 10
approximation.
What is the mean and standard deviation of the
sampling distribution?
m pˆ  .07
.07(.93)
s pˆ 
 .018
200
Blood Transfusions Revisited . . .
m pˆ  .07
Let p = .07
p = 6/200 = .03
.07(.93)
s pˆ 
 .018
200probability
This small
tells us that it is
unlikely
thatscreening
a sample
This
new
Does this result indicate that the
true
proportion ofappears
.03 or
procedure
proportion of patients who contract
hepatitis
smaller
would be
to
yield
a smaller
when the new screening is used is less
than
7%?
observed.
P(p < .03) =
incidence rate for
hepatitis.
Normalcdf(-1099,.03,.07,.018) = .0132
Confidence Intervals
Suppose we wanted to estimate the
proportion of blue candies in a VERY large
bowl.
How might we go about estimating this
proportion?
We
Wecould
wouldtake
haveaa
sample
candies and
sampleof
proportion
or a
compute
statistic
– athe
single
proportion
ofthe
blue
value for
candiesestimate.
in our sample.
Point Estimate
• A single number (a statistic) based
on sample data that is used to
estimate a population characteristic
• But not always close to the
Different samples may
“point”
refers
to the
population characteristic
due
to
produce
different
on a number
statistics.
sampling variationsingle value
line.
Population characteristic
Suppose we wanted to estimate the
proportion of blue candies in a VERY large
bowl.
We could take a sample of candies and
compute the proportion of blue candies in
our sample.
How
much
confidence
Would
you
have more
do
you have in
confidence
if the
your
point
estimate?
answer were an
interval?
Confidence intervals
A confidence interval (CI) for a population
characteristic is an interval of plausible values
for the characteristic.
primary goalsoofthat,
a confidence
interval
ItThe
is constructed
with a chosen
degree
is to estimate
unknown
of confidence,
the an
actual
valuepopulation
of the
characteristic.
characteristic will
be between the lower and
upper endpoints of the interval.
Rate your confidence
0 – 100%
does it(%)
mean
toyou
be within
10 years?
HowWhat
confident
are
that you
can ...
Guess my age within 10 years?
. . . within 5 years?
. . . within 1 year?
What happened to
your level of
confidence as the
interval became
smaller?
Perform CI Activity . . .
Question for after the activity:
• What proportion of all possible CI’s contain
the true proportion p?
• This is called the confidence level.
Let’s develop the equation for the
We
canconfidence
generalize
thisinterval.
tothe
normal
For
large
random
samples,
large-sample
distributions
other
sampling
distribution
of than
p is the
To begin,approximately
westandard
will use anormal
95%
confidence
Use
distribution
–
normal.
So aboutlevel.
of 95%
the possible
pcurve
will are
fall
the table95%
of standard
areas
to
About
ofnormal
the values
within
95%value
of these
values
are of
within
determine
the
of z*deviations
such that
a central
area
1.96
standard
the
1.96
the
mean.
p (of
1mean
and
p
) z*.
of .95 falls within
between
–z*
1.96
within p
n
Central Area = .95
Lower tail area = .025
Upper tail area = .025
-1.96
0
1.96
Developing a Confidence Interval Continued . . .
Approximate sampling
Suppose weSuppose
get this we
p get this pdistribution of p
and create an interval
Create
an
interval
Suppose we get this p
around
p
and create
an interval
Using this
method of
calculation,
p
the confidence
p (1  p )
p (1  p )
1.96
1.96
interval will
n
n
not capture p
p
5% of the
p
time.
This
line
represents
1.96
This line represents 1.96
When
n
is
large,
a
95%
p
standard
deviations
below
Here
is
the
mean
of the
Notice
thatdeviations
the lengthabove
of
standard
confidence
interval
for
p is
the
mean.
sampling
distribution
This
p
doesn’t
fall
within
1.96
each
half
of
the
interval
the
mean.
This p fell within
1.96
standard
This
p
fell
within 1.96
standard
p
(
1
 the
p ) mean
standard
deviations
of
equals
deviations of the
mean
AND
its
pˆits
confidence
1of.96
deviations
the mean
AND its
p
(
1

p
)
AND
interval
does
confidence
interval
“captures”
p.
1.96 confidence interval “captures”
n
p.
NOT
“capture”
p.
n
Developing a Confidence Interval Continued . . .
p (1  p )
If p is within 1.96
n
of p,
this means the interval
p (1  p )
p (1  p )
pˆ  1.96
to pˆ  1.96
n
n
will capture p.
And this will happen for 95% of
all possible samples!
Confidence level
The confidence level associated with a
confidence interval estimate is the success rate
of the method used to construct the interval.
If this method was used to generate an
interval estimate over and over again from
Oursamples,
confidence
is in
therun
method
–
different
in the
long
95% (or
NOTconfidence
in any ONE
particular
whatever
level
we use)interval!
of the
resulting intervals would include the actual value
of theThe
characteristic
being
estimated.
most common
confidence
levels are
90%, 95%, and 99% confidence.
The diagram to the
right is 100
confidence intervals
for p computed from
100 different random
samples.
Note that the ones
with asterisks do not
capture p.
If we were to
compute 100 more
confidence intervals
for p from 100
different random
samples, would we get
the same results?
Recall the General Properties for
Sampling Distributions of p
1.
2.
These are the conditions that
must be true in order to
m pˆ  p
calculate a large-sample
confidence interval for p
p (1  p ) As long as the sample size is
s pˆ 
less than 10% of the population
n
3. As long as n is large (np > 10 and
n (1-p) > 10) the sampling
distribution of p is approximately
normal.
The Large-Sample Confidence
Interval for p
is an estimate
of the
The general formula for This
a confidence
interval
standard deviation of p or the
for a population proportion
p . . . iserror
standard
statistic  critical value (standard deviation of the statistic)
pˆ(1  pˆ)
pˆ  (z critical value)
n
In real life, we often do not know
The
standard
error
of a statistic
is
the
population
proportion?
What
the
estimated
deviation
point
estimate
value
can we standard
use to estimate
it?
of the statistic.
The 95%
confidence interval
is based on the
The
Large-Sample
Confidence
fact that, for approximately 95% of all random
Interval
p the margin of error
samples,for
p is within
estimation of p.
The general formula for a confidence interval
for a population proportion p . . . is
pˆ(1  pˆ)
pˆ  (z critical value)
n
This is called the margin of error.
Critical value (z*)
• Found from the confidence level
• The upper z-score with probability p lying
to its right under the standard normal
curve
Confidence level
90%
95%
99%
z*=1.645
z*=1.96
z*=2.576
tail area
z*
.05
1.645
.05
.025
.005
.025
1.96
.005
2.576
The Large-Sample Confidence
Interval for p
The general formula for a confidence
interval for a population proportion p
when (assumptions) (STEP 1)
• p is the sample proportion from a random sample
• the sample size n is large (np > 10 and
n(1-p) > 10), and
• if the sample is selected without replacement,
the sample size is small relative to the
population size (at most 10% of the population)
What are the steps for
performing a confidence interval?
1. Assumptions
•
•
•
Data from a random sample
Sample size is large enough
Sample size is small relative to population
size
2. Calculations
3. Conclusion
Conclusion: (memorize!!)
We are ________% confident that
the true proportion context is
between ______ and ______.
The article “How Well Are U.S.
Colleges Run?” (USA Today, February 17,
2010) describes a survey of 1031 adult
The point estimate is
Americans. The survey was carried out by the
567 the
Before
computing
National Center for Public Policy
pˆ  and the
 .55sample
1031 we
confidence
interval,
was selected in a way that makes it reasonable to
to verify the
regard the sample asneed
representative
of adult
conditions.
Americans. Of those surveyed, 567 indicated
that they believe a college education is essential
for success.
What is a 95% confidence interval for the
population proportion of adult Americans who
believe that a college education is essential for
success?
College Education Continued . . .
What is a 95% confidence interval for the
population proportion of adult Americans who
believe that a college education is essential for success?
Conditions:
1) np = 1031(.55) = 567
andconditions
n(1-p) = 1031(.45)
= 364,
All our
are verified
since both of these so
areitgreater
10, the sample
is safethan
to proceed
with
size is large enough to proceed.
the calculation of the
interval.
2) The sample size of n =confidence
1031 is much
smaller than
10% of the population size (adult Americans).
3) The sample was selected in a way designed to
produce a representative sample. So we can regard
the sample as a random sample from the population.
College Education Continued . . .
What is a 95% confidence interval for the
population proportion of adult Americans who
believe that a college education is essential for success?
Calculation:
pˆ(1  pˆ)
pˆ  (z critical value)
n
.55(.45)
.55  1.96
 (.52,.58)
1031
What does this
interval mean in the
We are 95% confident that the population
context proportion
of this
of adult Americans who believe that aproblem?
college education
Conclusion:
is essential for success is between 52% and 58%.
College Education Revisited . . .
Recall the “Rate
A 95% confidence interval for
theConfidence”
population
your
proportion of adult Americans who believe that a
Activity
college education is essential for success is:
.55(.45)
.55  1.96
 (.52,.58)
1031
What do you
notice
about the
Compute a 90% confidence interval for this
proportion.
relationship
.55(.45)
between the
.55  1.645
 (.524,.575)
confidence level
1031
ofproportion.
an interval
Compute a 99% confidence interval for this
and the width of
the interval?
.55(.45)
.55  2.58
1031
 (.510,.590)
A May 2000 Gallup Poll found that 38% of
a random sample of 1012 adults said that
they believe in ghosts. Find a 95%
confidence interval for the true
proportion of adults who believe in ghost.
Assumptions:
Step 1: check assumptions!
• Have an SRS of adults
• np =1012(.38) = 384.56 & n(1-p) = 1012(.62) = 627.44
Since both are greater than 10, the distribution can be
approximated by a normal curve
2: make
• Population of adultsStep
is at least
10,120.calculations
 .38(.62)
 p 1  p  
  .38  1.96
Pˆ  z * 



n
1012




  .35,.41 


Step 3: conclusion in context
We are 95% confident that the true proportion of
adults who believe in ghosts is between 35% and
41%.
Choosing a Sample Size
The margin of error estimation for a confidence
interval is
p (1  p )
m z *
Before collecting any
n data, an
investigator may wish to determine a
sample size needed to achieve a
If
there
is no prior
knowledge
and
a
What
In
other
value
cases,
should
be
used
for
the
may
certain
margin
of
error
estimation.
Sometimes, it is feasible to perform a
preliminary
study is notestimate
feasible,for
then
suggestunknown
a reasonable
p?
p. the
preliminary
study value
to estimate
the value
conservative estimate for p is 0.5.
for p.
Why is the conservative
estimate for p = 0.5?
.1(.9) = .09
.2(.8) = .16
.3(.7) = .21
.4(.6) = .24
.5(.5) = .25
By using .5 for p, we
are using the largest
value for p(1 – p) in
our calculations.
Recall
the activity where we
graphed the histograms for
binomials with different
probabilities of success –
which had the largest
standard deviation?
In spite of the potential safety hazards,
some people would like to have an internet
connection in their car. Determine the
sample size required to estimate the
proportion of adult Americans who would like an
internet connection in their car to within 0.03
with 95% confidence.
What value should be
p 1  p 
m z *
n
.5(.5)
.03  1.96
n
n  1067.111 
n  1068 people
used for p?
This is the value for the
margin of error estimate m.
Always round the
sample size up to the
next whole number.
Download