here - BCIT Commons

advertisement
MATH 2441
Probability and Statistics for Biological Sciences
Interval Estimates of the Difference of Two Population Means
Independent Samples -- Large Sample Case
This and the next three sections of the course address one of the most important applications of statistics:
the comparison of characteristics of two populations.
In this section we consider constructing confidence interval estimates for the difference, 1 - 2, of the
means of two populations. The idea is that a random sample of size n 1 is selected from population 1, to
give a sample mean x 1 and a sample standard deviation s1. Similarly, a second sample of size n2, with
mean x 2 and standard deviation s2 is selected from population #2. For the formulas presented below to be
valid, two distinct types of conditions must be satisfied:
(i)
(ii)
the two samples must be independent. That is, the particular elements included in one of the
samples is in no way determined by which particular elements were previously selected for the
other sample. (You will see later in the course that there are important applications involving
dependent samples -- in fact dependent samples are often used to get around some of the
shortcomings of this independent sample approach! See the document on the topic of "Paired
Differences" .)
In principle, each of the two sample sizes, n1 and n2 must be 30 or larger. If this isn't the case,
then you need to use formulas appropriate for the small sample case, which are described in the
next document in this series (and those formulas are subject to additional conditions).
Under these conditions (based on results already described in our coverage of "sampling distributions"), we
can use the fact that
z
x 1  x 2   1   2 
(DML-1)
12  22

n1
n2
is an approximately standard normally distributed random variable to derive a formula for a 100(1 - )%
confidence interval estimator for the difference, 1 - 2:
1   2  x1  x 2  z  / 2
12  22

n1 n2
@100 (1  )%
(DML-2)
If you have values for 1 and 2, you can use this formula directly. If not (which is more likely the case), then
use s1 and s2, respectively, as estimates, getting the working formula
1   2  x1  x 2  z  / 2
s12 s22

n1 n2
@100 (1  )%
(DML-3)
Example 1: (Cholesterol)
Consider the standard data set "Cholesterol", resulting from a study in which a technologist was interested in
determining whether the amount of cholesterol in eggs could be influenced by the diet of the chicken. The
cholesterol content of 35 eggs selected at random from chickens fed Diet #1 was determined (giving a mean
value of 197.68 mg and a standard deviation of 46.750 mg), and similarly for 45 eggs selected at random
from those laid by chickens fed on Diet #2 (giving a mean value of 201.43 mg and a standard deviation of
22.640 mg). Compute a 95% confidence interval estimate of the difference between mean cholesterol
content of eggs from these two sources.
© David W. Sabo (1999)
Estimating the Difference of Population Means: Large Samples
Page 1 of 4
Solution:
One of the clues that these two samples of eggs are independent is that the two samples contain different
numbers of eggs. Since both sample sizes are larger than 30, and the samples are independent, we are
justified in using formula (DML-3) to solve this problem.
In the notation of the formalism, we have
n1 = 35
x 1 = 197.68 mg
s1 = 46.750 mg
n2 = 45
x 2 = 201.43 mg
s2 = 22.640 mg
and
Thus, letting 1 represent the mean cholesterol content of all eggs laid by chickens on Diet #1, and 2 be the
mean cholesterol content of all eggs laid by chickens on Diet #2, we get using (DML-3) that
1   2  x1  x 2  z 0.025
s12 s22

n1 n2
 197 .68  201 .43  1.96
@ 95%
46 .750 2 22 .640 2

35
45
= -3.75 mg  16.84 mg
@ 95 %
@95%
Written as an interval, this result is
-20.59 mg  1 - 2  13.09 mg
@ 95%.
Thus there is a 95% probability that the interval of values from -20.59 mg to +13.09 mg contains the true
difference between the two mean values, written in the order indicated.
If you think about it for a minute, this is kind of a "non-result" as far as these two varieties of eggs are
concerned. Because the estimate of the difference contains the value zero, this interval estimate is
consistent with either 1 or 2 being the larger of the two means. Because this confidence interval estimate
contains the value zero, we aren't able to say anything about the relative values of the two population
means.

Example 2: (JonApples)
Compute a 95% confidence interval for the difference in the mean weight of Jonagold apples obtained
during the first harvest (data set JonApples1) and the mean weight of those obtained two weeks later (data
set JonApples2).
Solution:
From the description of this data in the "Example Data Sets" document and subsequent calculations
reported in Tutorial #2, we have the following information about these two samples:
JonApples1:
n1 = 60
x1 = 219.73 g
s1 = 43.879 g
JonApples2:
n2 = 55
x 2 = 257.27 g
s2 = 52.351 g
Both sample sizes are larger than 30, and again, and there is no reason to believe that they might be
dependent in some way. So it would appear that the conditions for (DML-3) to be valid are met. We thus
get
1   2  219 .73  257 .27  1.96
Page 2 of 4
43 .879 2 52 .3512

60
55
@ 95 %
Estimating the Difference of Population Means: Large Samples
© David W. Sabo (1999)
= -37.54 g  17.74
@ 95%
In interval form, this becomes
-55.28 g  1 - 2  -19.80 g
@ 95%
If you don't like the negative values, you can reverse the order of the difference, and write this as
19.80 g  2 - 1  55.28 g
@ 95%
Notice that this interval does not contain the value zero. Thus, at a 95% level of confidence, we are able to
state that the mean weight of these apples at the second harvest date is at least 19.80 g greater than the
mean weight at first harvest. Or, we could say that there is a 95% probability that the amount by which the
mean weight of apples from the second harvest exceeds the mean weight of apples from the first harvest by
an amount between 19.80 g and 55.28 g.

Sample Size Considerations
Notice the form of the '' part of formula (DML-3):
s12 s 22

n1 n2
  z / 2
(DML-4)
This depends on a probability factor, z/2, as expected. The square root term depends in a direct sort of way
on the variances of the two samples, and in an inverse sort of way on the two sample sizes. If both
variances are small numbers, and both sample sizes are large numbers, then both terms in this square root
will be small numbers, their sum will be a small number, and the resultant square root itself will be a small
number, hence leading to a small value of , the uncertainty in the confidence interval estimate. Notice,
however, that both of the terms in the square root must be small to get a small square root -- if one of the
terms is large, it doesn't matter how small the other term is, the square root will still have a relatively large
value.
This makes a bit of sense if you think about it. The goal here is to estimate the difference between two
quantities, 1 and 2. Suppose the data allows us to estimate, say 1, fairly accurately (that is s12/n1 is a
fairly small value). But suppose that data from the second population gives a much larger value of s 22/n2,
meaning that the precision with which we could estimate 2 is very much poorer. Formula (DML-4) says in
this case, the precision with which we can estimate the difference, 1 - 2, is more or less as poor as the
precision with which we can estimate 2 (or at least, we can't estimate the difference 1 - 2 any more
precisely than we can estimate either 1 or 2 individually). This is what you would expect, of course. What
it tells you is that to improve the precision of an estimate of 1 - 2, you should really put your resources into
increasing the sample size for the population with the greater variance.
Because of the presence of two independent sample sizes in (DML-4), the overall method of determining
appropriate sample sizes based on selected confidence levels and desired precision of the estimate is not
as straightforward here as in the single population examples we've considered earlier. Simply specifying a
desired value of  in (DML-4) leaves us with a single equation and two unknowns. The way most textbooks
solve this problem is to plan to use samples of equal size: n1 = n2 = n, say. Then there is just one unknown
in (DML-4), and we get the guideline
n

z  / 2 s12  s 22

2

(DML-5)
This is probably a satisfactory approach if the two variances are approximately equal. However, it is
probably not a very efficient approach when one of the variances is much larger than the other. In such a
case, it would probably be prudent to use a larger sample size for the population with the larger variance.
Just how one might go about deciding on sample sizes in such a case is a topic that is probably best left for
more advanced discussions than we have time for in this course. There's a suspicion that comparison of
© David W. Sabo (1999)
Estimating the Difference of Population Means: Large Samples
Page 3 of 4
means of populations that have greatly different variances is probably a relatively rare situation in practical
applications -- it's not clear what the value of such an estimate would be in such a situation.
Example: Use formula (DML-5) to suggest a common sample size in the Cholesterol experiment which
would result in an 95% confidence interval estimate precise to within 2 mg per egg.
Solution
Presumably the thinking here is that if the '' part of the 95% confidence interval estimate could be reduced
to 2 mg per egg, then we might get a result reflecting a meaningful difference between the eggs from
chickens on the two diets. Simply substituting values into formula (DML-5), we get
n

1.96 2 46 .750 2  22 .640 2
2
2
  2591 .29
Thus, to achieve this sort of precision, the technologist would have to select random samples of almost 2600
eggs each! This means increasing the amount of work by a factor of about 60 - 80 over the original study -probably not a practical alternative. (There's a philosophical problem associated with going wild with sample
sizes in order to detect small differences between two means -- we'll discuss this issue in more detail when
we deal with hypothesis testing. Essentially, one might ask: if the difference between the two mean
cholesterol levels is so small that you need to analyze thousands of eggs to detect it, does that difference
have any practical significance?)

Page 4 of 4
Estimating the Difference of Population Means: Large Samples
© David W. Sabo (1999)
Download