MATH 2441 Probability and Statistics for Biological Sciences Confidence Interval Estimates of the Population Variance (Large Sample Case) The population variance (and its square root, the population standard deviation) is a measure of the variability (or, if you like, the uniformity) of the population. It is not as common to require an estimate of or 2 in statistical work as it is to require an estimate of the population mean or population proportion. In fact, if anything, the comparison of two population variances is a more common requirement than the estimation of the variance of a single population (and we will deal with that comparison of variances later as part of our study of hypothesis testing). Still, of the population parameters that require a sampling distribution which is non-symmetric, the variance is the most important. So, we will look briefly at the estimation of the population variance so you know how to compute interval estimates of the population variance and population standard deviation, and so you know how to work with interval estimates when the sampling distribution involved is not symmetric. We know from the brief comment in the document on sampling distributions that when the population is approximately normally distributed, the variable 2 n 1 s 2 (OV-1) 2 has the so-called 2-distribution with = n - 1 degrees of freedom. The 2-distribution is not symmetric about its mean value of , and so the sort of '' approach we used to construct confidence interval estimates of the area = /2 population mean and proportion can't really be used here. However, retaining the notion of as the probability that the interval estimate fails to capture the true value of 2, we can still come up with a useful interval estimate formula. Refer to the figure to the right. We have a probability that the true value of 2 is outside the interval estimate, and we'll set up the interval so that there is a probability of /2 that the interval misses the true value of 2 on the left and on the right. The starting probability equation becomes Pr( 12 / 2 2 2 / 2 ) Pr( 12 / 2 n 1 s 2 2 area = 1- 2 2 1- /2 2 / 2 ) 1 2 /2 (OV-2) Thus, at the left edge, we have that n 1 s 2 2 12 / 2 which gives 2 n 1 s 2 12 / 2 Similarly, at the right-edge, we have © David w. Sabo (1999) Interval Estimates of the Population Variance Page 1 of 4 n 1 s 2 2 2 / 2 which gives 2 n 1 s 2 2 / 2 Combining these two inequalities for 2, we get the 100(1-)% confidence interval estimate for the population variance: n 1 s 2 2 / 2 2 n 1 s 2 12 / 2 @ 100 (1 )% (OV-3) Note that the numerators of both fractions here are identical. In the denominator on the left side, the value 2/2 is larger than the value 21-/2 on the right side, and so the quotient on the left of the inequality gives a smaller value than the quotient on the right which is what we need here, of course. Although we haven't shown the symbol explicitly in this formula, you need to use 2 values from the row = n-1 of the 2-distribution tables. Formula (OV-3) gives a confidence interval estimate of 2. To get a corresponding interval estimate of , you simply take square roots. Example 1: The cholesterol concentration in the yolks of each of a sample of 18 randomly selected eggs laid by genetically engineered chickens were found to have a mean value, x , of 9.38 mg/g of yolk and a standard deviation, s, of 1.62 mg/g. Use this information to construct a confidence interval estimate of the true variance and standard deviation of the cholesterol concentration in these egg yolks. Solution: This requires a straightforward application of formula (OV-3). We have n = 18, so = 18 - 1 = 17. We also have s = 1.62 mg/g, so s2 = (1.62 mg/g)2 = 2.6244 (mg/g)2. No confidence level is specified, so we use the conventional default of 95%, meaning that = 0.05. Thus /2 = 0.025 and 1 - /2 = 0.975. Referring to the row = 17 of the 2-distribution table, we find that 20.025 = 30.191 and 20.975 = 7.564. Thus, we have 17 2.6244 17 2.6244 2 30 .191 7.564 or 1.478 2 5.898 @95% The units here are still (mg/g)2. This last line is the required confidence interval estimate of the population variance, 2. To get the 95% confidence interval estimate of the population standard deviation, , we just take square roots: 1.478 2 5.898 or 1.216 mg/g 2.429 mg/g @95%. (Before moving on, perhaps a remark about the implications of this result for the way we calculate confidence interval estimates of the population mean is in order. Recall that we used the value of s as a point estimate of in that formula. The rationale was that this was the best estimate available for . Now, for this example, we see that at a 95% confidence level, the true value of might be as much as 25% less than s or as much as 50% more than s. This means that the true width of the confidence interval estimate for the population mean from this data might be of the order of 25 % less to 50% more than the width of the interval we actually compute when the value of s is used as an estimate of the required value of . This is quite a bit of difference! Of course, we are working with a relatively small sample here, which tends to accentuate the effect.) Page 2 of 4 Interval Estimates of the Population Variance © David w. Sabo (1999) Large Sample Approximations Without trying to prolong this section too much, we note the transition to a large sample approximation, which results in somewhat simpler formulas, and allows you to use the standard normal probability table instead of the 2 table. Formula (OV-3) is valid for all values of greater than or equal to 1 (or none of them, for that matter, when you cannot support the assumption that the population is approximately normally distributed). For sample sizes larger than n = 30 or so, we also know that the distribution of the 2 random variable becomes more and more like the distribution of a normal random variable with a mean of = n - 1 and a variance 2 = 2(n - 1) = 2. This means that formula (OV-2) can be replaced by the formula Pr(( n 1) z / 2 2(n 1) n 1 s 2 2 (n 1) z / 2 2(n 1) ) 1 from which we get either 2 (n 1) s 2 @ 100 (1 )% (n 1) z / 2 2(n 1) (OV-4a) or, in the form of an interval, (n 1) s 2 (n 1) z / 2 2(n 1) 2 (n 1) s 2 (n 1) z / 2 2(n 1) @ 100 (1 )% (OV-4b) This might look pretty bad, but all the numbers in the formula are quite simple, and so it is quite easy to implement this formula when appropriate. Example 2: A technologist is developing a new method for processing a food material. It is known that for best quality, it is important to control moisture content in the final product. So, as one part of determining the practicality of the new method, the technologist must estimate the variability of water content in the resulting product. He collects 50 specimens of product from the new process, and determines the percent water in each. These 50 specimens give a sample mean water content of 43.24% and a sample standard deviation of 7.93%. Compute a 95% confidence interval estimate of the true standard deviation of the percentage water for this new process. Solution: Here n = 50, s2 = 7.932 and = 0.05, so z/2 = z0.025 = 1.96. Substituting these numbers into (OV-4b) gives 49 7.93 2 49 1.96 2 49 2 49 7.93 2 49 1.96 2 49 or 45.047 2 104.111 @ 95% as the 95% confidence interval estimate for 2. Taking square roots, we get the following confidence interval for the standard deviation : 6.712% 10.203% @95%. (As a matter of interest, 20.975 = 31.555 and 20.025 = 70.222. Thus, the "exact" 95% confidence interval estimate for the variance here is 43.880 2 97.650, and so for the standard deviation is 6.624% 9.882% @95%. Thus, the relative error in the large sample approximation is about 3 - 4 %.) © David w. Sabo (1999) Interval Estimates of the Population Variance Page 3 of 4 Devore, Freund, and other elementary statistics textbooks give one more large-sample approximation, based on the observation that for large samples from an approximately normally distributed population, the sample standard deviation, s, itself is approximately normally distributed with a mean of and a variance of 2/2n. This means that z s 2n (OV-5) has an approximately standard normal distribution, and so s Pr z / 2 z / 2 1 2n This leads directly to a 100(1-)% confidence interval for : 1 s z / 2 1 s z / 2 2n (OV-6) 2n (You can actually get this formula from (OV-4b) by a bit of rearrangement and approximation: first, replace n - 1 by n, then take the square roots, and finally, in the denominator, use the approximation 1 x 1 1 x ). 2 Example 3: Repeating Example 2, using (OV-6) gives 1 7.93 1.96 2 50 1 7.93 1.96 @ 95 % 2 50 or 6.630% 9.863% @ 95%. You can see that this approximation agrees very well with the results given by the exact and earlier approximate formula. With this, we end our discussion of the estimation of variances and standard deviations of a single population. Page 4 of 4 Interval Estimates of the Population Variance © David w. Sabo (1999)