MATH 2441 Probability and Statistics for Biological Sciences Interval Estimates of the Difference of Two Population Means Independent Samples -- Large Sample Case This and the next three sections of the course address one of the most important applications of statistics: the comparison of characteristics of two populations. In this section we consider constructing confidence interval estimates for the difference, 1 - 2, of the means of two populations. The idea is that a random sample of size n 1 is selected from population 1, to give a sample mean x 1 and a sample standard deviation s1. Similarly, a second sample of size n2, with mean x 2 and standard deviation s2 is selected from population #2. For the formulas presented below to be valid, two distinct types of conditions must be satisfied: (i) (ii) the two samples must be independent. That is, the particular elements included in one of the samples is in no way determined by which particular elements were previously selected for the other sample. (You will see later in the course that there are important applications involving dependent samples -- in fact dependent samples are often used to get around some of the shortcomings of this independent sample approach! See the document on the topic of "Paired Differences" .) In principle, each of the two sample sizes, n1 and n2 must be 30 or larger. If this isn't the case, then you need to use formulas appropriate for the small sample case, which are described in the next document in this series (and those formulas are subject to additional conditions). Under these conditions (based on results already described in our coverage of "sampling distributions"), we can use the fact that z x 1 x 2 1 2 (DML-1) 12 22 n1 n2 is an approximately standard normally distributed random variable to derive a formula for a 100(1 - )% confidence interval estimator for the difference, 1 - 2: 1 2 x1 x 2 z / 2 12 22 n1 n2 @100 (1 )% (DML-2) If you have values for 1 and 2, you can use this formula directly. If not (which is more likely the case), then use s1 and s2, respectively, as estimates, getting the working formula 1 2 x1 x 2 z / 2 s12 s22 n1 n2 @100 (1 )% (DML-3) Example 1: (Cholesterol) Consider the standard data set "Cholesterol", resulting from a study in which a technologist was interested in determining whether the amount of cholesterol in eggs could be influenced by the diet of the chicken. The cholesterol content of 35 eggs selected at random from chickens fed Diet #1 was determined (giving a mean value of 197.68 mg and a standard deviation of 46.750 mg), and similarly for 45 eggs selected at random from those laid by chickens fed on Diet #2 (giving a mean value of 201.43 mg and a standard deviation of 22.640 mg). Compute a 95% confidence interval estimate of the difference between mean cholesterol content of eggs from these two sources. © David W. Sabo (1999) Estimating the Difference of Population Means: Large Samples Page 1 of 4 Solution: One of the clues that these two samples of eggs are independent is that the two samples contain different numbers of eggs. Since both sample sizes are larger than 30, and the samples are independent, we are justified in using formula (DML-3) to solve this problem. In the notation of the formalism, we have n1 = 35 x 1 = 197.68 mg s1 = 46.750 mg n2 = 45 x 2 = 201.43 mg s2 = 22.640 mg and Thus, letting 1 represent the mean cholesterol content of all eggs laid by chickens on Diet #1, and 2 be the mean cholesterol content of all eggs laid by chickens on Diet #2, we get using (DML-3) that 1 2 x1 x 2 z 0.025 s12 s22 n1 n2 197 .68 201 .43 1.96 @ 95% 46 .750 2 22 .640 2 35 45 = -3.75 mg 16.84 mg @ 95 % @95% Written as an interval, this result is -20.59 mg 1 - 2 13.09 mg @ 95%. Thus there is a 95% probability that the interval of values from -20.59 mg to +13.09 mg contains the true difference between the two mean values, written in the order indicated. If you think about it for a minute, this is kind of a "non-result" as far as these two varieties of eggs are concerned. Because the estimate of the difference contains the value zero, this interval estimate is consistent with either 1 or 2 being the larger of the two means. Because this confidence interval estimate contains the value zero, we aren't able to say anything about the relative values of the two population means. Example 2: (JonApples) Compute a 95% confidence interval for the difference in the mean weight of Jonagold apples obtained during the first harvest (data set JonApples1) and the mean weight of those obtained two weeks later (data set JonApples2). Solution: From the description of this data in the "Example Data Sets" document and subsequent calculations reported in Tutorial #2, we have the following information about these two samples: JonApples1: n1 = 60 x1 = 219.73 g s1 = 43.879 g JonApples2: n2 = 55 x 2 = 257.27 g s2 = 52.351 g Both sample sizes are larger than 30, and again, and there is no reason to believe that they might be dependent in some way. So it would appear that the conditions for (DML-3) to be valid are met. We thus get 1 2 219 .73 257 .27 1.96 Page 2 of 4 43 .879 2 52 .3512 60 55 @ 95 % Estimating the Difference of Population Means: Large Samples © David W. Sabo (1999) = -37.54 g 17.74 @ 95% In interval form, this becomes -55.28 g 1 - 2 -19.80 g @ 95% If you don't like the negative values, you can reverse the order of the difference, and write this as 19.80 g 2 - 1 55.28 g @ 95% Notice that this interval does not contain the value zero. Thus, at a 95% level of confidence, we are able to state that the mean weight of these apples at the second harvest date is at least 19.80 g greater than the mean weight at first harvest. Or, we could say that there is a 95% probability that the amount by which the mean weight of apples from the second harvest exceeds the mean weight of apples from the first harvest by an amount between 19.80 g and 55.28 g. Sample Size Considerations Notice the form of the '' part of formula (DML-3): s12 s 22 n1 n2 z / 2 (DML-4) This depends on a probability factor, z/2, as expected. The square root term depends in a direct sort of way on the variances of the two samples, and in an inverse sort of way on the two sample sizes. If both variances are small numbers, and both sample sizes are large numbers, then both terms in this square root will be small numbers, their sum will be a small number, and the resultant square root itself will be a small number, hence leading to a small value of , the uncertainty in the confidence interval estimate. Notice, however, that both of the terms in the square root must be small to get a small square root -- if one of the terms is large, it doesn't matter how small the other term is, the square root will still have a relatively large value. This makes a bit of sense if you think about it. The goal here is to estimate the difference between two quantities, 1 and 2. Suppose the data allows us to estimate, say 1, fairly accurately (that is s12/n1 is a fairly small value). But suppose that data from the second population gives a much larger value of s 22/n2, meaning that the precision with which we could estimate 2 is very much poorer. Formula (DML-4) says in this case, the precision with which we can estimate the difference, 1 - 2, is more or less as poor as the precision with which we can estimate 2 (or at least, we can't estimate the difference 1 - 2 any more precisely than we can estimate either 1 or 2 individually). This is what you would expect, of course. What it tells you is that to improve the precision of an estimate of 1 - 2, you should really put your resources into increasing the sample size for the population with the greater variance. Because of the presence of two independent sample sizes in (DML-4), the overall method of determining appropriate sample sizes based on selected confidence levels and desired precision of the estimate is not as straightforward here as in the single population examples we've considered earlier. Simply specifying a desired value of in (DML-4) leaves us with a single equation and two unknowns. The way most textbooks solve this problem is to plan to use samples of equal size: n1 = n2 = n, say. Then there is just one unknown in (DML-4), and we get the guideline n z / 2 s12 s 22 2 (DML-5) This is probably a satisfactory approach if the two variances are approximately equal. However, it is probably not a very efficient approach when one of the variances is much larger than the other. In such a case, it would probably be prudent to use a larger sample size for the population with the larger variance. Just how one might go about deciding on sample sizes in such a case is a topic that is probably best left for more advanced discussions than we have time for in this course. There's a suspicion that comparison of © David W. Sabo (1999) Estimating the Difference of Population Means: Large Samples Page 3 of 4 means of populations that have greatly different variances is probably a relatively rare situation in practical applications -- it's not clear what the value of such an estimate would be in such a situation. Example: Use formula (DML-5) to suggest a common sample size in the Cholesterol experiment which would result in an 95% confidence interval estimate precise to within 2 mg per egg. Solution Presumably the thinking here is that if the '' part of the 95% confidence interval estimate could be reduced to 2 mg per egg, then we might get a result reflecting a meaningful difference between the eggs from chickens on the two diets. Simply substituting values into formula (DML-5), we get n 1.96 2 46 .750 2 22 .640 2 2 2 2591 .29 Thus, to achieve this sort of precision, the technologist would have to select random samples of almost 2600 eggs each! This means increasing the amount of work by a factor of about 60 - 80 over the original study -probably not a practical alternative. (There's a philosophical problem associated with going wild with sample sizes in order to detect small differences between two means -- we'll discuss this issue in more detail when we deal with hypothesis testing. Essentially, one might ask: if the difference between the two mean cholesterol levels is so small that you need to analyze thousands of eggs to detect it, does that difference have any practical significance?) Page 4 of 4 Estimating the Difference of Population Means: Large Samples © David W. Sabo (1999)