Sampling Distribution and Confidence Intervals (annotated)

advertisement
8 - Sampling Distributions and Confidence Intervals for Single
Population Parameters




Sampling Distribution of the Sample Mean (X )
Sampling Distribution of the Sample Proportion ( pˆ )
Confidence Interval for the Population Mean (  )
Confidence Interval for the Population Proportion ( p )
Introduction:
When take a sample of size n from a population and calculate summary statistics like the
sample mean (X ) , the sample median (Med), the sample variance ( s 2 ), the sample
standard deviation (s), or the sample proportion ( p̂ ) we must realize that these quantities
will _________________________________________________________________
and hence are themselves ________________________________________.
Any random variable in statistics has a probability distribution. We have been talking
about three common probability distributions in statistics. When X = # of “successes” in
n independent trials we used the binomial distribution to talk about X probabilistically,
when X = # of occurrences in a fixed time/space unit we use a Poisson distribution, and
finally when X was continuous and had an approximate bell-shaped distribution we used
the normal distribution to calculate probabilities and quantiles associated with X.
Because the summary statistics discussed above are random variables they also have a
probability distribution that determines the likelihood of certain values of these statistics
being obtained. The distribution of a summary statistic, e.g. the sample mean (X ) is
called the ______________________________________.
In this handout we explore the sampling distributions of the sample mean ( X ) and the
sample proportion ( p̂ ).
Sampling Distribution of X
The sample mean ( X ) is a random quantity that varies from sample to sample. The
probability distribution the sample mean follows is called the sampling distribution of X .
The sampling distribution demo I showed in class is found at the following web address:
http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/
1
The Central Limit Theorem (CLT) ~ tells us about the sampling distributions of
the sample mean ( X ). There is also a version (which we will see later) that tells us about
the sampling distribution of the sample proportion ( p̂ ) .
The CLT for X says the following:
1.
2.
3. The sampling distribution will be ___________ if either of the conditions
below are met:

or if

We now consider applications of the central limit theorem (CLT).
2
Applications to Decision Making
Example 1: Cholesterol levels of adult males (50-60 yrs. old)
The mean blood cholesterol level of adult males (50-60 yrs. old) is 200 mg/dl with a
standard deviation of 20 mg/dl. Assume also that blood cholesterol levels are
approximately normally distributed in this population.
a) What is the probability that when taking a sample of size n = 25 that you would obtain
sample mean greater than 225 mg/dl?
b) Give a range of values that we would expect the sample mean to fall approximately
95% of the time.
c) Suppose we took sample of adult males between the ages of 50 – 60 who are also
strict vegetarians and obtained sample mean of X  188 mg/dl. Does this provide
evidence that the subpopulation of vegetarians have a lower mean cholesterol level that
the greater population of men in this age group? Explain.
3
Example 2: S/R Ratio
The objectives of a study by Skjelbo et al. (1996) were to examine (a) the relationship
between chloroguanide metabolism and efficacy in malaria prophylaxis and (b) the
mephenytoin metabolism and its relationship to chloroguanide metabolism among
Tanzanians. From information provided by urine specimens from the n = 216 subjects,
the investigators computed the ratio of unchanged S-mephenytoin to R-mephenytoin (S/R
ratio).
Is there evidence that the S/R ratio of vaccinated Tanzanians is greater than .275?
4
Confidence Intervals for the Population Mean 
Example: Suppose we are trying to estimate the birth weight of infants born to women
who smoke during pregnancy. A sample of n = 73 women who smoked during
pregnancy and the birth weight of their baby was obtained yielding a sample mean
of X  6.08 lbs..
This is called a _____________________ for the population mean () because it yields a
single value for this unknown quantity.
A better estimate might be 6.08 lbs. give or take _____ lbs., i.e. ______ up to _______.
This is called an __________________________ as it gives a range or interval of
plausible values for the population mean.
How do we know this if this a good interval estimate?
What properties should a good interval estimate have?
1)
2)
The central limit theorem states that if our sample size (n) is sufficiently large, then

X 
X ~ N ( ,
) which also says by standardizing that Z 
~ N (0,1)

n
n
This means that when we collect our data the probability our observed sample mean will
fall within two standard errors of the mean is approximately .95 or a 95% chance, or
more precisely
X 


P(2  Z  2)  P(2 
 2)  P(2 
 X    2
)

n
n
n
P(   2

 X 2

)  .9544
n
n
To make this 95% exactly, we simply use 1.96 in place of 2.00 in the expression above,
because P(-1.96 < Z < 1.96) = .9500. For 99% confidence we use ________ and for 90%
we use ________ in place of 1.96.
Starting with the statement,
P(1.96 
X 

 1.96)  .9500
n
we can perform similar algebraic manipulations to those above to isolate the population
mean in the middle of the inequality instead. By doing this we will obtain an interval
that has an approximate 95% chance of covering the true population mean (.
5
This says that the interval from X  1.96 

up to X  1.96 

has a 95% chance of
n
n
covering the true population mean . This interval is simply the sample mean plus or
minus roughly two standard errors. However, this interval cannot be calculated in
practice! WHY?
A “simple fix” to this would be replace ____ by the estimated standard deviation from
our data _____.
The problem with our “simple fix” is that the distribution of
X 
is not a standard
s
n
normal, i.e. N(0,1)!!!
FACT: If the population we are sampling from is approximately normal then
X 
has a t-distribution with degrees of freedom df = n – 1.
s
n
What does a t-distribution look like?
Facts about the t-distribution:



Examples: Using the t-table to find confidence intervals
a) n = 20 and 95% confidence t =
b) n = 20 and 99% confidence t =
c) n = 50 and 90% confidence t =
d) n = 10 and 95% confidence t =
6
The basic form of most confidence intervals is:
(estimate)  (table value)( SE of estimate)
MARGIN OF ERROR
General Form for a Confidence Interval for the Mean
For the population mean we have,
X  (t  table value) SE ( X ) or
X t
s
n
The appropriate columns in Table A.4 (t-distribution table) for the different confidence
intervals are as follows:
90% Confidence look in the .05 column (if n is “large” we can use 1.645)
95% Confidence look in the .025 column (if n is “large” we can use 1.960)
99% Confidence look in the .005 column (if n is “large” we can use 2.576)
Example: Suppose we are trying to estimate the birth weight of infants born to women
who smoke during pregnancy. A sample of n = 73 women from Baltimore who smoked
during pregnancy and the birth weight of their baby was obtained yielding a sample mean
of X  6.08 lbs. with a sample standard deviation of s = 1.45 lbs.
Use this information to find a 95% CI for the mean birth weight of infants born to
mothers who smoked during pregnancy found, assuming that birth weights for this
population are normally distributed.
7
Suppose a sample of n = 113 Baltimore mothers who did not smoke during pregnancy
was obtained and a sample mean birth weight of X  6.71 lbs. with a standard deviation
of s = 1.66 lbs was obtained.
a) Find a 95% confidence interval for the mean birth weight of infants born to
nonsmoking mothers.
b) Does this interval in conjunction with the interval obtained for mothers who smoked
during pregnancy provide evidence that infants born to smoking mothers have a lower
mean birth weight?
8
Sampling Distribution of the Sample Proportion ( p̂ )
As with the sample mean ( X ) the sample proportion ( p̂ ) is also random, as it too varies
from sample to sample. The sampling distribution of p̂ has the following properties:
1. The mean of the sampling distribution is the population proportion (p)
2. The standard deviation of the sampling distribution or the standard error of
p̂ and is given by:
SE ( pˆ ) 
p  population proportion (unknown)
p(1  p)
where
n  sample size
n
3. The sampling distribution is approx. normal provided n is “sufficiently large”.
np  5
n(1  p)  5
* Note : some recommend using 10 in place of 5.
Note: When estimating proportions large sample sizes are general ly used (e.g. n > 100)
9
APPLICATIONS TO DECISION MAKING
Example: New Method for Treating a Certain Illness/Disease
Suppose the current treatment method for certain disease has 70% success rate. A new
method has been proposed that will hopefully have a higher success rate. The new
method is administered to a sample n = 50 patient and 40 have successful treatment.
Can we conclude on the basis of this result that the new method has a higher success
rate?
10
CONFIDENCE INTERVALS FOR THE POPULATION PROPORTION (p)
Motivating Example: In a study conducted to investigate the non-clinical factors
associated with the method of surgical treatment received for early-stage breast cancer,
some patients underwent a modified radical mastectomy while others had a partial
mastectomy accompanied by radiation therapy. We are interested in determining whether
the age of the patient affects the type of treatment she receives. In particular, we want to
know whether the proportions of women under 55 are identical in the two treatment
groups.
A sample of n = 658 women who underwent a partial mastectomy and subsequent
radiation therapy contains 292 women under 55, which is a sample percentage of 44%.
A better estimate might be 44% give or take 4%, i.e. estimating that the actual percentage
of women who receive this form of treatment under the age of 55 is between 39% and
48%. This is called an “interval estimate”, as it gives a range or interval of plausible
values for the population proportion/percentage. As with the population mean discussed
earlier, we wish this interval to be narrow enough to provide useful information about
this unknown percentage, yet have a high probability or chance of covering the actual
percentage of women under 55 amongst those opting for this course of treatment for
early-stage breast cancer.
The central limit theorem for proportions states that if our sample size (n) is sufficiently
p(1  p)
large, then pˆ ~ N ( p,
) . This means that when we take our sample and find our
n
sample proportion, p̂ , the probability our observed sample proportion will fall within
approximately two standard errors of the population proportion is roughly 95%, or more
precisely
P( p  1.96 
p(1  p)
 pˆ  p  1.96 
n
p(1  p)
)  .9500  Recall: P 1.96  Z  1.96  .9500
n
Starting with this statement we can perform some algebraic manipulations to isolate the
population proportion, p,in the middle of the inequality above. By doing this we will see
that the resulting interval will have a 95% chance of covering the true population
proportion (p).
After a Wonderfully Simple Mathematical Derivation:

p(1  p)
p(1  p)
up to pˆ  1.96 
has a 95%
n
n
chance of covering the true population proportion p. This interval is simply the sample
proportion plus or minus roughly two standard errors, i.e. pˆ  1.96  SE ( pˆ ) . However,
this interval cannot be calculated in practice! WHY?
This says that the interval from pˆ  1.96 
11
A simple fix is to replace ______ by our sample based estimate ________. Provided the
sample size is sufficient large the resulting interval will still have an approximate 95%
chance of covering the true population proportion. This gives what we should technically
call the estimated standard error of the proportion, but when we say “standard error of the
proportion” it is assumed this estimated version is the one we are talking about because in
reality the population proportion p is NOT known. If p were known we would not be
conducting a study in first place!
General Form for a C for Population Proportion (p)
estimate  (table value)  (estimated standard error of estimate)
pˆ  (normal table value) 
Margin of Error  z
pˆ (1  pˆ )
n
or
pˆ  z
pˆ (1  pˆ )
n
pˆ (1  pˆ )
n
Normal Table Values:
95% Confidence we use z = 1.96
90% Confidence we use z = 1.645
99% Confidence we use z = 2.576
Again we see the confidence interval has the basic form:
ESTIMATE  (TABLE VALUE)  (STANDARD ERROR OF THE ESTIMATE)
MARGIN OF ERROR
In other words we take our estimate plus or minus a certain number of standard errors to
obtain the confidence interval, i.e. plus or minus the margin of error.
Example: Early-Stage Breast Cancer Treatment Method and Age (cont’d)
In a sample of n = 658 women who underwent a partial mastectomy and subsequent
radiation therapy contains 292 women under 55, which is a sample percentage of 44.4%.
Find a 95% CI for the true proportion of women under 55 in this population.
12
In a sample of n = 1580 women who received a modified radical mastectomy 397 women
were under 55, which is a sample percentage of 25.1%. Find a 95% CI for the true
proportion of women under 55 in this population.
Do these intervals suggest that the proportion of women under the age of 55 differs
significantly for these two courses of treatment of early-stage breast cancer?
13
One-Sided Confidence Intervals
One-Sided CI’s for the Population Mean (
Lower Bound for 
s
X t
n
Upper Bound for 
s
X t
n
Where t comes from the t-distribution with df = n – 1. The appropriate columns in Table
A.4 for the different confidence intervals are as follows:
90% Confidence look in the .10 column
95% Confidence look in the .05 column
99% Confidence look in the .01 column
One-Sided CI’s for the Population Proportion (p)
Lower Bound for p
pˆ (1  pˆ )
pˆ  z
n
Upper Bound for p
pˆ (1  pˆ )
pˆ  z
n
Where z comes from the standard normal distribution. The appropriate values for the
different confidence intervals are as follows:
90% Confidence use z = 1.280
95% Confidence use z = 1.645
99% Confidence use z = 2.330
14
Download