MATH 2441 Probability and Statistics for Biological Sciences Interval Estimates of the Difference of Two Population Means Independent Samples -- Small Sample Case This document continues the discussion of methods for estimating the difference between two population mean values. The preceding document presented a method that is valid when (i.) (ii.) the two samples are independent each sample has a size of 30 or larger If the second condition doesn't hold, that is, the two samples available are not both of size 30 or larger, the estimation of the difference of two population means becomes somewhat more dicey. Unfortunately, this is quite a common situation in technical work. When condition (ii) above does not hold, the consensus seems to be that the most favorable situation is one in which two additional conditions hold: (iia) the populations are approximately normally distributed (iib) the population variances are equal: 12 = 22 = 2 When these two conditions hold, along with condition (i) above, we have the small sample estimation of the difference of two population means: equal variances case. When these three conditions hold, then the random variable t x 1 x 2 1 2 2 n1 x 1 x 2 1 2 2 n2 (DMS-1) 1 1 n1 n 2 is approximately t-distributed with n1 + n2 -2 degrees of freedom. In the usual fashion, this leads to a confidence interval formula for 1 - 2: 1 2 x 1 x 2 t / 2, 1 1 n1 n 2 @ 100 (1 ) % (DMS-2) If you happen to know what this common value of is, then you can use formula (DMS-2) directly. In the much more common situation that is not known, we need to use the observed sample variances to estimate . Now, s12 is an unbiased estimator of 12 = 2 and s22 is an unbiased estimator of 22 = 2. Thus, both s12 and s22 are estimating the same parameter 2. An even better estimate of this common 2 would be to take an average of s12 and s22 weighted by the respective sample sizes. The formula for this so-called pooled sample variance is s p2 n1 1 s12 n 2 1 s 22 n1 n 2 2 (DMS-3) sp is then substituted for in formula (DMS-2) to get the formula 1 2 x 1 x 2 t / 2, s p 1 1 n1 n 2 @ 100 (1 ) % (DMS-4) To repeat, in both (DMS-2) and (DMS-4), the number of degrees of freedom for the t-statistic is = n1 + n2 - 2. © David W. Sabo (1999) Estimation of Difference of Two Means: Small Samples Page 1 of 6 Example 1: (PAH) One of the standard data sets we're using involves comparison of levels of polycyclic aromatic hydrocarbons in river sediments at different times of the year. A sample of 8 specimens of sediments collected in April gave a sample mean of 194.63 g/g with a sample standard deviation of 64.66 g/g. A second sample of 12 specimens was collected in July, giving a sample mean of 134.69 g/g and a sample standard deviation of 66.98 g/g. Estimate the difference of the mean PAH levels in these sediments for April and July. Solution In summary and with standard notation, we are being asked to estimate April - July given the following information: nApril = 8 x April = 194.63 g/g sApril = 64.66 g/g nJuly = 12 x July = 134.69 g/g sJuly = 66.98 g/g and Clearly, with sample sizes of 8 and 12 respectively, we are dealing with a small sample situation. We will assume that the two samples are independent because it is unlikely that exactly the same locations of the riverbed were sampled on the two occasions. This leaves two additional conditions to be met before we should be comfortable with using (DMS-4): are the samples consistent with approximately normally distributed populations and are the population standard deviations equal. It is relatively common practice to simply consider the issue of normality essentially unanswerable for such small samples, and so assume the condition (iia) is met. However, we can take a brief look at the normal probability plots for both sets of data: PAH Levels (April) PAH Levels (July) 300 300 250 250 200 200 150 150 100 100 50 50 0 -2 -1 0 0 1 2 -2 -1 -50 0 1 2 One might think there is cause for concern in the April data, with an apparent curvature in the pattern of points. However, that curvature is caused by just the two most extreme observations (granted out of a total of only eight), and so we are probably justified in not giving the apparent non-normality much weight. The same thing happens in the July data, with the two most extreme values giving the sense of an upward curvature to the probability plot. Other than those two points, the July plot seems to be quite a good straight line. So, it appears at least that there is no strong reason to doubt the normality of the population distributions. We can dispense with condition (iib) quite quickly here. Since s April = 64.66 and sJuly = 66.98 are so close in value, it would be perverse to assert a strong suspicion that the two populations have vastly different standard deviations or variances. The number of degrees of freedom are 8 + 12 - 2 = 18. Thus, to write down a 95% confidence interval estimate, we need t0.025,18 = 2.101. The pooled variance is s p2 8 1 64.66 2 12 1 66.98 2 8 12 2 4367 .55 or sp = 66.09 g/g Page 2 of 6 Estimation of Difference of Two Means: Small Samples © David W. Sabo (1999) (Notice that since sp2 is a weighted average of the two original sample variances, its value must be between the values of those two variances, and so sp itself must have a value between the two sample standard deviations. If you find this is not so, you've made an arithmetic mistake!) So, finally, April July 194 .63 134 .69 (2.101) (66 .09 ) = 59.94 63.38 1 1 8 12 @ 95 % @ 95% or, in interval form -3.44 g/g April - July 123.32 g/g @95% Unfortunately, this confidence interval estimate just catches the value zero, reducing its meaningfulness rather drastically. At a 95% confidence level, we cannot rule out the possibility that the two mean values are identical. The two new conditions, (iia) and (iib), in this small sample case can be a bit problematic. We've illustrated a bit how to deal with them in the rather detailed example above. The following general comments have some support in statistical theory. First, the method above does not seem to be unduly sensitive to small departures from normality in the population distributions (this seems to be a general characteristic of methods based on the t-distribution). Rough checks of normality should be adequate either by constructing normal probability plots as was done above, or by just looking at frequency histograms or stemplots for the data. Results seem to be ok as long as the population distribution has a single major peak and is not too asymmetric. The requirement of equal variances in the two populations is much more problematic, both from the point of view of confirmation that it has been met, and also in devising an interval estimator which is valid when it appears that the populations do not come close to sharing common variances. First, how can you tell if it is reasonable to assume that 12 = 22 = 2? Almost always, this question will have to be answered by looking at the values of s12 and s22. Several suggestions have been made in standard textbooks: (i.) (ii.) (iii.) (iv.) most authors simply say something along the lines "as long as s12 and s22 are not too different, you're probably all right in assuming 12 = 22." Since they don't specify what they mean by "too different", this advice is rather useless. some authors (for example, Jarrell, p 468) actually give a rule of thumb, along the lines "if the larger of the two variances is not more than double the smaller one" as meaning "not too different." This is a good kind of rule of thumb, because it's easy to use, and many teachers and practitioners in statistics seem to have a vague impression that it is a reasonable rule of thumb. It would be comforting to know that at some time in the past someone has done some research to demonstrate the practicality of this rule. some authors suggest simply looking at a frequency histogram or stemplot of the two sets of data to see if they have approximately the same degree of spread. If so, it is reasonable to assume 12 = 22. you may find occasionally that an author suggests performing a hypothesis test procedure to determine if the evidence allows you to reject the claim that 12 = 22 . (This is the socalled F-test for two population variances you can find the details in most statistics textbooks if we don't get to it in this course.) However, almost everyone agrees that this is a rather dicey approach, since the F-test is known to be rather sensitive to departures from normality in the populations. In fact, to quote Neil Weiss (Introductory Statistics, 4th edition, p. 588), "As the noted statistician George E. P. Box remarked: "To make a preliminary test on variances is rather like putting to sea in a rowing boat to find out whether conditions are sufficiently calm for an ocean liner to leave port!" " [This is © David W. Sabo (1999) Estimation of Difference of Two Means: Small Samples Page 3 of 6 (v.) (vi.) statistics humor at its best, folks enjoy it while it lasts!] The point here is that the conclusion you might get from the F-test is even more questionable than the validity of the result from (DMS-4) when the assumption that 12 = 22 is not valid. it appears that the error introduced by erroneously assuming 12 = 22 is least when the two sample sizes are approximately equal. Thus, by using samples of approximately equal size, the validity of the assumption that 12 = 22 is a less pressing issue. finally, a number of authors advise that if there is any doubt about the validity of assuming 12 = 22, one should resort to a procedure which does not rely on this assumption. As you'll see in the next section, there is no general consensus about which method is best under those circumstances, but there is a sense that the most commonly used approaches (we'll describe three similar ones) give about the same quality of results, and results which are not too different from (DMS-4) when 12 = 22 is approximately correct. Sample Variances Unequal For the construction of confidence interval estimates of 1 - 2 when one or both sample sizes are less than 30 and there is good reason to doubt that 12 = 22 = 2, there are really three very similar modifications of (DMS-4) in common use. All involve modification of the probability factor, t/2,. (For testing hypotheses about 1 - 2, there is also an additional non-parametric approach that seems to be recommended highly in situations such as this the Mann-Whitney test.) The basic formula in this case is 1 2 x 1 x 2 t / 2 s12 s2 2 n1 n 2 @ 100 (1 ) % (DMS-5) where we've deliberately left the subscript denoting degrees of freedom off of the t-factor. Then, Weiss and others suggest calculating the effective degrees of freedom for this t-factor using the formula s12 s2 2 n1 n 2 2 s12 n1 n1 1 2 2 s 22 n2 n2 1 (DMS-6) This looks a bit frightening, but notice that most of the subexpressions are repetitions of s 2/n for each sample. If (DMS-6) doesn't give a whole number result, then round down. Formula (DMS-6) is apparently an approximation developed by Satterthwaite, to replace more complex approaches that required specialized tables. A second suggestion, which seems to be favored by authors of statistics textbooks oriented towards business applications is to simply use as the effective degrees of freedom in (DMS-5) the smaller of n1 - 1 or n2 - 1. This suggestion has some plausibility it really amounts to restating what we've mentioned above: you can't expect greater precision is the estimate of a difference between two means than you would be able to get in the estimation of either of the two means separately. A final suggestion, attributed by Wayne Daniel (in his textbook, Biostatistics, 6th edition, p. 168) to Cochran is to compute the t-factor as a weighted average of two values from the t-table. That is, use t w 1 t / 2, n1 1 w 2 t / 2, n2 1 w1 w 2 (DMS-7a) where the weights, w1 and w2, are given by Page 4 of 6 Estimation of Difference of Two Means: Small Samples © David W. Sabo (1999) w1 s12 n1 and w2 s 22 n2 (DMS-7b) There is little more to be said about these formulas. We'll illustrate them with a quick example and then leave this topic. Example 2: (Peas) In the standard data sets is a description of an experiment performed to compare amounts of vitamin C in peas. For the seven specimens of frozen peas that a technologist analyzed, the amounts of vitamin C were 25.9 23.4 21.2 12.3 18.4 18.0 CpeasFrozen 19.8 and, for the twelve specimens of canned peas that she analyzed, the amounts of vitamin C were: 9.7 7.0 8.2 9.5 6.6 5.0 6.5 8.2 6.5 7.3 6.8 10.6 CpeasCanned These numbers are in units of mg of vitamin C per 100 g of peas. Compute 95% confidence interval estimates of the difference in mean vitamin C content of frozen and canned peas, based on this data, and using each of the three variations on the basic estimation method described above. Solution: We have two independent samples of peas here. In the notation of the subject, the relevant sample characteristics are: nfrozen = 7 x f rozen = 19.86 sfrozen = 4.350 s2frozen = 18.926 ncanned = 12 x canned = 7.66 scanned = 1.623 s2canned = 2.634 Here, the larger variance is over seven times as large as the smaller variance -- clear evidence even for such small samples that the population variances are unequal. From formula (DMS-6), we get 2 18 .926 2.634 7 12 6.988 2 2 18 .926 2.634 7 12 6 11 Thus, in formula (DMS-5), we need t0.025,6 = 2.447. The resultant confidence interval estimate is: f rozen canned 19.86 7.66 2.447 18.926 2.634 7 12 @ 95 % = 12.20 4.18 mg/100g @95% Using the second suggestion above, we note that nfrozen - 1 = 7 - 1 = 6, and ncanned -1 = 12 - 1 = 11, so we would again use t0.025, 6 in formula (DMS-5). This would give exactly the same result. Finally, for the third approach, we need the two weight factors: w1 s12 18.926 2.704 n1 7 © David W. Sabo (1999) and w2 s22 2.634 0.219 n2 12 Estimation of Difference of Two Means: Small Samples Page 5 of 6 Thus, since t0.025, 6 = 2.447 and t0.025, 11 = 2.201, we get from formula (DMS-7a) t 2.704 2.447 0.219 2.201 2.704 0.219 2.429 Using this in formula (DMS-5) then gives the interval estimate f rozen canned 19.86 7.66 2.429 18.926 2.634 7 12 @ 95 % = 12.20 4.15 mg/100 g @95% This is just slightly narrower than the interval estimate given by the other two approaches. For all practical purposes, the three approaches have yielded the same results in this example. (Note that if we had ignored the obvious signs of unequal population variances and employed the procedure described in the first part of this document, we would have obtained sp 2.895. Then, using t0.025, 17 = 2.110, the confidence interval estimate would have turned out to be frozen - canned = 12.20 2.91 mg/100 g @ 95% a considerably narrower estimate. The considerable apparent difference in population variances results in a much less precise estimate of the difference between the two population means.) Page 6 of 6 Estimation of Difference of Two Means: Small Samples © David W. Sabo (1999)