Sampling Distributions and Confidence Intervals

advertisement
9 - Sampling Distributions and Confidence Intervals for & p
Introduction:
When take a sample of size n from a population and calculate summary statistics like the
sample mean (X ) , the sample median (med), the sample variance ( s 2 ), the sample
standard deviation (s), or the sample proportion ( p̂ ) we must realize that these quantities
will __________________________________________ and hence are themselves
___________________.
Any random variable in statistics has a probability distribution. We have been talking
about three common probability distributions in statistics. When X = # of “successes” in
n independent trials we used the binomial distribution to talk about X probabilistically,
when X = # of occurrences in a fixed time/space unit we used the Poisson distribution,
and finally when X was continuous and had an approximate bell-shaped distribution we
used the normal distribution to calculate probabilities and quantiles associated with X.
Because the summary statistics discussed above are random variables they also have a
probability distribution that determines the likelihood of certain values of these statistics
being obtained. The distribution of a summary statistic, e.g. the sample mean (X ) is
called the ______________________________________.
In this handout we explore the sampling distributions of the sample mean ( X ) and the
sample proportion ( p̂ ).
Sampling Distribution of X
The sample mean ( X ) is a random quantity that varies from sample to sample. The
probability distribution the sample mean follows is called the sampling distribution of X .
The sampling distribution demo I showed in class is found at the following web address:
http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/
58
The Central Limit Theorem for the Sample Mean (CLT) ~ tells us about
the sampling distributions of the sample mean ( X ). There is also a version (which we
will see later) that tells us about the sampling distribution of the sample proportion ( p̂ ) .
The CLT for X says the following:
1.
2.
3. The sampling distribution will be ___________ if either of the conditions
below are met:

or if

We now consider applications of the central limit theorem (CLT).
Applications to Decision Making
Example 1: Cholesterol levels of adult males (50-60 yrs. old)
The mean blood cholesterol level of adult males (50-60 yrs. old) is 200 mg/dl with a
standard deviation of 20 mg/dl. Assume also that blood cholesterol levels are
approximately normally distributed in this population.
a) What is the probability that when taking a sample of size n = 25 that you would obtain
sample mean greater than 225 mg/dl?
b) Give a range of values that we would expect the sample mean to fall approximately
95% of the time.
59
c) Suppose we took sample of adult males between the ages of 50 – 60 who are also
strict vegetarians and obtained sample mean of X  188 mg/dl. Does this provide
evidence that the subpopulation of vegetarians have a lower mean cholesterol level that
the greater population of men in this age group? Explain.
Example 2: Mercury Levels Found in Boulder Reservoir Walleyes
Fish consumption guidelines suggest you should limit the number of fish you eat with Hg
levels above .25 ppm. Is there evidence to suggest that walleyes from Boulder Reservoir
have a mean Hg content exceeding .25 ppm?
60
Confidence Intervals for the Population Mean 
Motivating Example: Suppose we are trying to estimate the mean protein content of
zebra mussels, which are becoming an increased part of the diet for ducks on the
Mississippi River. A sample of n = 25 zebra mussels are analyzed for their protein
content and a sample mean of X  9.14 units.
This is called a _____________________ for the population mean () because it yields a
single value for this unknown quantity.
A better estimate might be 9.14 give or take _____ units, i.e. ______ up to _______.
This is called an __________________________ as it gives a range or interval of
plausible values for the population mean.
How do we know this if this a good interval estimate? __________________
What properties should a good interval estimate have?
 It

dfk
The central limit theorem states that if our sample size (n) is sufficiently large, then
X 

~ N (0,1)
X ~ N ( ,
) which also implies that after standardizing Z 

n
n
This means that when we collect our data the probability our observed sample mean will
fall within two standard errors of the mean is approximately .95 or a 95% chance, or
being more precise we could use  1.96 standard errors because
P(1.96  Z  1.96)  .9500
Which gives


 
P   1.96
 X    1.96
  .9500
n
n

For a 99% chance we use _______ and for 90% we use ________ in place of 1.96.
Starting with the statement,




X 
P(1.96  Z  1.96)  P  1.96 
 1.96   .9500



n


61
we will perform algebraic manipulations to isolate the population mean in the middle
of this inequality instead. By doing this we will obtain an interval that has a 95% chance
of covering the true population mean.
Algebraic Manipulations of the Inequality on the Previous Page:
This says that the interval from X  1.96 

up to X  1.96 

has a 95% chance of
n
n
covering the true population mean . This interval is simply the sample mean plus or
minus roughly two standard errors. However, this interval cannot be calculated in
practice! WHY?
A “simple fix” to this would be replace ____ by the estimated standard deviation from
our data _____.
The problem with our “simple fix” is that the distribution of
X 
is not standard
s
n
normal, i.e. N(0,1) therefore the 1.96 value will not necessarily produce the desired level
of confidence.
FACT: If the population we are sampling from is approximately normal then
X 
has a t-distribution with degrees of freedom df = n – 1.
s
n
What does a t-distribution look like?
Facts about the t-distribution:



62
Examples: Using the t-table to find confidence intervals
a) n = 20 and 95% confidence
b) n = 20 and 99% confidence
c) n = 50 and 90% confidence
d) n = 10 and 95% confidence
t=
t=
t=
t=
The basic form of most confidence intervals is:
(estimate)  (table value)( SE of estimate)
MARGIN OF ERROR
General Form for a Confidence Interval for the Mean
For the population mean we have,
X  (t - table value)SE ( X ) or
X t
s
n
The appropriate columns in t-distribution table) for the different confidence intervals are
as follows:
90% Confidence look in the .05 column (if n is “large” we can use 1.645)
95% Confidence look in the .025 column (if n is “large” we can use 1.960)
99% Confidence look in the .005 column (if n is “large” we can use 2.576)
Example: Suppose we are trying to estimate the mean protein content of zebra mussels,
which are becoming an increased part of the diet for ducks on the Mississippi River. A
sample of n = 25 zebra mussels are analyzed for their protein content and a sample mean
of X  9.14 units with a sample standard deviation of s = 2.98 units.
a) Use this information to find a 95% CI for the mean protein content found in the tissues
of zebra mussels, assuming that protein content of zebra mussels has a normal
distribution.
63
Suppose a sample of n = 25 freshwater clams was obtained and similar protein analysis
was conducted resulting in a sample mean X  26.66 units with a standard deviation of s
= 12.12 units.
b) Find a 95% confidence interval for the mean protein content found in the tissue of
freshwater clams.
c) Does this interval in conjunction with the interval obtained for zebra mussels provide
evidence that freshwater clams are richer in protein than zebra mussels?
64
Sampling Distribution of the Sample Proportion ( p̂ )
Just like the sample mean (X ) the sample proportion ( p̂ ) is random, as it too varies
from sample to sample. The sampling distribution of p̂ has the following properties:
1. The mean of the sampling distribution is the population proportion (p)
2. The standard deviation of the sampling distribution or the standard error of
p̂ and is given by:
 pˆ 
p(1  p)
 SE ( pˆ ) where
n
p  population proportion (unknown)
n  sample size
3. The sampling distribution is approx. normal provided n is “sufficiently large”.
np  5
n(1  p )  nq  5
Note: When estimating proportions large sample sizes are generally used
(e.g. n > 100)
APPLICATIONS TO DECISION MAKING
Example: New Method for Treating a Certain Illness/Disease
Suppose the current treatment method for certain disease has 70% success rate. A new
method has been proposed that will hopefully have a higher success rate. The new
method is administered to a sample n = 50 patient and 40 have successful treatment.
Can we conclude on the basis of this result that the new method has a higher success
rate?
65
Example (cont’d)
Using the Binomial Table (this is called the Binomial Exact Test, see Section 7)
CONFIDENCE INTERVALS FOR THE POPULATION PROPORTION
Motivating Example: A study of 200 rainbow trout caught on baited size 8 barbed
hooks and released with the line cut at the hook (but the hook not removed from the fish)
showed that 58 fish died (from the National Symposium on Catch and Release Fishing).
An estimate of the proportion of trout that die when caught and released in this fashion is
.29 or 29%. A better estimate might be 29% give or take 4%, i.e. estimating that the
actual percentage of that will die to be somewhere between 25% and 33%. This is called
an “interval estimate”, as it gives a range or interval of plausible values for the
population proportion/percentage. As with the population mean discussed earlier, we
wish this interval to be narrow enough to provide useful information about this unknown
percentage, yet have a high probability or chance of covering the actual percentage of
trout that will die under this catch and release strategy.
The central limit theorem for proportions states that if our sample size (n) is sufficiently
p(1  p)
large, then pˆ ~ N ( p,
) . This means that when we take our sample and find our
n
sample proportion, p̂ , the probability our observed sample proportion will fall within
approximately two standard errors of the population proportion is roughly 95%, or more
precisely
P( p  1.96 
p(1  p)
p(1  p)
 pˆ  p  1.96 
)  .9500  Recall: P 1.96  Z  1.96  .9500
n
n
Starting with this statement we can perform some algebraic manipulations to isolate the
population proportion, p,in the middle of the inequality above. By doing this we will see
66
that the resulting interval will have a 95% chance of covering the true population
proportion (p).
After a wonderful algebraic manipulation of the equality above :


p(1  p)
p(1  p)
up to pˆ  1.96 
has a 95%
n
n
chance of covering the true population proportion p. This interval is simply the sample
proportion plus or minus roughly two standard errors, i.e. pˆ  1.96SE ( pˆ ) . However, this
interval cannot be calculated in practice! WHY?
This says that the interval from pˆ  1.96 
A simple fix is to replace ______ by our sample based estimate ________. Provided the
sample size is sufficient large the resulting interval will still have an approximate 95%
chance of covering the true population proportion. This gives what we should technically
call the estimated standard error of the proportion, but when we say “standard error of the
proportion” it is assumed this estimated version is the one we are talking about because in
reality the population proportion p is NOT known. If p were known we would not be
conducting a study in first place!
General Form for a C for Population Proportion (p)
estimate  (table value)  (estimated standard error of estimate)
pˆ  (normal table value) 
Margin of Error  z
pˆ (1  pˆ )
n
or
pˆ  z
pˆ (1  pˆ )
n
pˆ (1  pˆ )
n
Normal Table Values:
95% Confidence we use z = 1.96
90% Confidence we use z = 1.645
99% Confidence we use z = 2.576
67
Example: Mortality of Released Rainbow Trout with Barbed Hooks (cont’d)
A study of 200 rainbow trout caught on baited size 8 barbed hooks and released with the
line cut at the hook (but the hook not removed from the fish) showed that 58 fish died
(from the National Symposium on Catch and Release Fishing). Using this information to
construct a 95% confidence interval for the percentage of rainbow trout that will die
when caught and released using baited size 8 barbed hooks.
Mortality of Released Rainbow Trout Caught with Barbless Hooks
In a similar study of 855 rainbow trout caught using barbless hooks it was found that 26
of them died. Use this information to construct a 95% confidence interval for the
percentage of rainbow trout that will die when caught and released using barbless hooks.
Comparing the Mortality Rates
Does this interval suggest that the mortality rate of rainbow trout is lower when caught
and released using barbless hooks as opposed to barbless? Explain.
Construct a 2 X 2 contingency table from the combined results of these studies. Then
find the RR and OR associated with using barbed hooks when catching and releasing
rainbow trout.
68
10 – Confidence Intervals for the RR and OR
Disease
Present
Disease
Absent
Risk factor present
a
b
Risk factor absent
c
d
P( Disease | Risk Factor Present)
=
P(Disease | Risk Factor Absent)
The RR can only be calculated when the number individuals with and without the disease
in the study are random. If a case-control study is used where these numbers are fixed it
is inappropriate to calculate the necessary conditional probability to find the RR.
Recall that the RR 
P( Disease | Risk )
1  P( Disease | Risk )
The OR =
=
P( Disease | NoRisk )
1  P( Disease | NoRisk )
and gives a measure risk associated with the risk factor in terms a multiplicative
statement regarding the odds for the having the disease. We now examine confidence
intervals for these quantities.
CI for RR:
1) Take natural logarithm of RR to obtain ln( RR ) .
2) Compute SE(ln(RR)) =
b
d

a ( a  b ) c (c  d )
3) Find ln( RR )  1.96  SE (ln( RR )) to obtain (LCL, UCL)
4) 95% CI for RR is then given by (e LCL , eUCL )
Normal
Birth
Weight
Low Birth
Weight
Nonsmoker
368
96.59%
13
3.41%
381
Smoker
271
90.64%
28
9.36%
299
Column
Totals
639
41
Smoking
Status
Row
Totals
680
Smoking and Birthweight Example:
69
95% CI for OR:
1) Take natural logarithm of OR, to obtain ln( OR ) .
Normal
Birth
Weight
Low Birth
Weight
Nonsmoker
368
96.59%
13
3.41%
381
3) Find ln( OR )  1.96  SE (ln( OR )) to obtain (LCL, UCL)
Smoker
271
90.64%
28
9.36%
299
4) 95% CI for OR is then given by (e LCL , eUCL )
Column
Totals
639
41
2) Compute SE(ln(OR)) =
1 1 1 1
  
a b c d
Smoking
Status
Smoking and Birthweight Example:
70
Row
Totals
680
t-Distribution Table
71
Download