7. Confidence Intervals 7.1 Introduction Suppose the high temperature forecast for tomorrow is 50 degrees Fahrenheit. This does not mean it will be exactly 50 degrees; it could be slightly more or slightly less. What would be useful is to have the forecast also state how much the actual temperature may deviate from 50 degrees. One way to do that is to make the forecast state, “There is a 95% chance that the high temperature tomorrow will be 50 3 degrees.” In this case, we say that 50 3 is the 95% confidence interval for tomorrow’s high temperature. We can also write this interval as [47, 53]. The central value 50 is called the point estimate and the 3 is the halfwidth of the confidence interval. Compare the above forecast with the next: “There is a 96% chance that the high temperature tomorrow will be 50 3.2 degrees.” This forecast is better in that the confidence has increased, but worse in that the interval has widened. Consider the next forecast where the interval is widened to 50 30. “There is a 99.99% chance that the high temperature tomorrow will be 50 30 degrees.” Although there is almost 100% confidence, this forecast is useless because the range of possible values is so large that it does not help us get a reasonable idea about what the weather is going to be and how to prepare for that weather. We would like a confidence interval to be narrow and at the same time have a large confidence level. In the estimation of confidence intervals, the objective is to minimize the half-width and maximize the confidence level at the same time. Usually, that is achieved by fixing the confidence level at 90%, 95% or 99% and then calculating the narrowest interval that has that confidence level. In what follows, we shall study how to calculate confidence intervals for population means and proportions based on sample statistics. 7.2 Confidence Intervals for Population Means 7.2.1 The basic case The symbol used for confidence level is (1), so that a 95% confidence level will mean an of 5%. In a standard normal distribution (z-distribution), if we want the narrowest interval that captures an area of (1), it should be symmetric about the center because the distribution peaks at the center. We should leave an area of /2 on each tail so that the central interval would cover an area of (1). The z value that has an area of /2 to the right is denoted by the symbol z/2. See Figure 7.2.1. The (1) confidence interval is thus [z/2 , z/2]. The narrowest interval that covers an area of (1 ) Area = /2 z/2 0 Area = /2 z z/2 Figure 7.2.1. A (1) Confidence Interval 7.2.2 One-sided Confidence Intervals At times, we want to be confident only about the maximum or the minimum possible value. We then do not care about the width of the interval. In this case, we need a one-sided confidence interval. We 35 leave the whole area on one of the two tails and the rest of the z-axis becomes the (1 ) confidence interval, which would be either [, z] or [z , ]. 7.2.3 Finite Population Correction In most cases, the populations is large or infinite and the variance of the sample mean, V( x ) equals 2/n. This value is used in confidence interval calculations. When the population is finite, V( x ) is a tiny bit less than 2/n. This difference grows as N keeps decreasing and becomes comparable to n. Indeed, if N = n, then the whole population is sampled and the sample mean must equal the population mean. The sample mean is no more random, and V( x ) is zero. In general, if n/N is more than 1%, a finite population correction is necessary. V( x ) reduces by a factor of (N n)/(N 1) and the width of the confidence interval decreases by a factor of ( N n ) / ( N 1 ) . This correction also applies to confidence intervals for population proportions. 7.2.4 When Population Standard Deviation is Known The calculation of confidence intervals for population means from sample means depends upon whether or not the population standard deviation is known. Most of the time, is not known; but if it is known it would help us to decrease the half-width for a given confidence level. We shall first see the case where is known. From the Central Limit Theorem, we know that if either the population is normally distributed, or the sample size n is more than 30, then the sample mean x ~ N(, 2/n). Therefore a (1 ) confidence interval for x is z/2*/ n . This means that with (1 ) confidence, we can say the difference between and x is not more than z/2*/ n . Therefore, a (1 ) confidence interval for is x z/2*/ n . The template to use in this case is shown in Figure 7.2.2. Figure 7.2.2. Confidence Interval for [Workbook: Estimation.xls; Sheet: CI for mu] When the data for and n are entered in cells B3, B4 and B5, the required confidence interval is displayed in the range B8:D8. The same template contains one-sided confidence intervals as well. To apply finite population correction to confidence intervals, the population size N must be entered in cell O4. The correction factor is computed in cell O6. The corrected confidence intervals appear in the range L8:P10. 7.2.5 When Population Standard Deviation is Unknown Most of the times, the population standard deviation is not known, and the population may be known to have a normal distribution. We then substitute the sample standard deviation s in its place and use the t-distribution to calculate confidence intervals. The t-distribution was discovered by W. S. Gossett who wrote under the pen name Student. It is therefore called Student’s t-distribution. Gossett proved that when 36 a random sample is drawn from a normally distributed population, the statistic x follows the ts / n distribution with (n 1) degrees of freedom, which helps us calculate confidence intervals for after knowing x . The t-distribution is similar to the standard normal distribution (the z-distribution) in that it has a mean of zero and is symmetric; but it is flatter with larger variance (> 1) and negative (relative) kurtosis. Unlike the z-distribution, of which there is only one, there is one t-distribution for each number of degrees of freedom, df = 1, 2, ... . As the df increases, the variance approaches 1, the (relative) kurtosis approaches zero, and the t-distribution approaches the z-distribution. In the workbook, Estimation.xls, there is a sheet for t-distribution. It is very similar to the one for normal distribution, with an input cell for df. All calculations similar to those of normal distribution are possible on this template. The half-width of the confidence interval for when is not known is given by t/2*s/ n , where the df for the t value is (n 1). On the bottom half of the sheet named “CI for mu” in Estimation.xls the confidence interval for when is unknown is calculated. See Figure 7.2.3. When the input data is entered in the range B14:B17, the confidence interval is displayed in the range B19:D19. Figure 7.2.3. CI for When is Unknown [Workbook: Estimation.xls; Sheet: CI for mu] In many textbooks, including Bowerman/O'Connell, the t values with large df are approximated to z values. Thus the approach is classified as "large sample" and "small sample" estimation. Such approximation is not necessary on a spreadsheet. In the templates, the approach is classified as " known, normal population or n >= 30" and " unknown, normal population" cases. This classification is more accurate as it avoids the t to z approximation. 7.3 Confidence Intervals for Population Proportions When np and n(1 p) are both at least 5, the (1 ) confidence interval for population proportion p is given by p z / 2 p ( 1 p ) / n where p is the sample proportion. The template for this is shown in Figure 7.3.1. When the input data are entered in the range B3:B5, the confidence interval appears in the range B8:D8 and in the range F8:J8. One-sided confidence intervals appear in the range F9:J9. If finite population correction is needed, the population size N should be entered in cell O4, and the confidence intervals appear in the range L8:P10. The data currently entered are for the Cheese Spread Case of Bowerman/O'Connell. 37 Figure 7.3.1. Confidence Interval for Population Proportion [Workbook: Estimation.xls; Sheet: CI for p] 7.4 Confidence Intervals for Population Variance A well known result in sampling theory is that if random samples are drawn from a normal population, the statistic (n 1)*S2/2, where S2 is the sample variance and 2 is the population variance, would follow a chi-square distribution with (n 1) degrees of freedom. Letting z denote a standard normal random variable, chi-square with k degrees of freedom is the sum of k independent z2 values. A chi-square distribution has its range from zero to infinity, mean equal to the degrees of freedom, variance equal to twice the degrees of freedom and is positively skewed. The sheet named “Chi-Square Distn.” in Estimation.xls is similar to the t-distribution sheet and can be used similarly. Figure 7.4.1. Confidence Interval for Population Variance [Workbook: Estimation.xls; Sheet: CI for Popn. Variance] When samples are drawn from a normal population, the (1 ) confidence interval for 2 is given by ( n 1 )s 2 ( n 1 )s 2 , 2 2 1 / 2 / 2 where the chi-square has (n ) degrees of freedom. Figure 7.4.1 shows the sheet named “CI for Popn. Variance” in Estimation.xls which implements the above formula. After the input data are entered in the range B3:B5, the confidence intervals appear in the range D7:H9. 7.5 Sample Size Determination An important practical decision in confidence interval estimation is how large the sample should be. Too small a sample will yield too wide an interval; too large a sample will require too much effort. The 38 objective in sample size determination is to find the smallest sample that will yield the desired width for the confidence interval. In the case of population mean, we have seen that the half-width for the confidence interval (denote it by B) is given by B = z/2/ n . Rearranging this to bring n to the left hand side yields the formula for minimum n as, z 2 Minimum n = / 2 B where the symbol means rounding up. For example, 48.2 49 . Rounding up is necessary since n cannot be a fraction. Note that the formula for minimum n calls for . Since the population mean itself is not known, it is unlikely that will be known. One should make a guess, or use past experience with similar data to enter a reasonably good value for . Figure 7.5.1 shows the sheet named “Sample Size Calc.” in Estimation.xls. When the input data are entered in the range B5:B7, the minimum sample size appears in cell B9. The currently input values are from the Car Mileage case of Bowerman/O'Connell. The answer 25 found in cell B10 differs from the one in the textbook, because the textbook uses t/2 in place of z/2 to take care of the uncertainty in population standard deviation (guessed using a small preliminary sample). Figure 7.5.1. Minimum Sample Size Calculation [Workbook: Estimation.xls; Sheet: Sample Size Calc.] In the case of population proportion, the half width B = z/2 p(1-p)/n . Rearranging this to get n on the left hand side yields, 2 z Minimum n = p( 1 p ) / 2 . B In the template shown in Figure 7.5.1, when the input data are entered in the range E5:E7 the minimum sample size appears in cell E9. 7.6 7.6.1 Survey Sampling Stratified Sampling Often, the population consists of strata that have their own individual characteristics in terms of their mean, variance or cost of sampling. By spreading the sample appropriately over the different strata, we can increase accuracy and/or decrease cost. Figure 7.6.1 shows the sheet named “Stratified for mu” which can be used for estimating population mean with a stratified sample. Let the notations be: k is the number of strata Ni is the size of the i-th stratum ni is the size of the sample from i-th stratum, 39 N and n are the respective sums Wi is the population proportion of the i-th stratum given by Ni/N fi is the sample proportion for the i-th stratum given by ni/n m is the number of strata xi is the mean of the sample from i-th stratum si2 is the variance of the sample from i-th stratum. With these notations, the estimate of population mean X and its variance S2( X ) are given by X = Wi xi S2( X ) = Wi2 Si2 (1 fi)/ni These formulas have been employed in the template. The confidence interval has been calculated assuming large samples. Figure 7.6.1. Stratified Sampling for Population Mean [Workbook: Estimation.xls; Sheet: Stratified for mu] In the case of population proportion, let pi denote the proportion of the sample from i-th stratum. The population proportion P and its variance S2(P) are given by P = Wi pi S2(P) = Wi2 pi(1 pi)/ni If the size of the strata are small, a finite population correction is necessary. With that correction, the formula is S2( P ) p (1 p ) Ni2 ( Ni ni ) ( Ni i 1)ini These formulas have been implemented in the sheet named “Stratified for p” in Estimation.xls. It is shown in Figure 7.6.2. 40 Figure 7.6.2. Stratified Sampling for Population Proportion [Workbook: Estimation.xls; Sheet: Stratified for p] 7.6.2 Optimum Allocation When unit sampling cost for each stratum along with the size and variance of each stratum are known it is possible to select ni values that would maximize accuracy and minimize cost. This method of deciding the optimal ni values is called optimum allocation. Figure 7.6.3 shows the sheet named “Allocation” in Estimation.xls. The formula for optimal fi is fi = (Wi i / Ci ) / (Wi i / Ci ) where Ci is the cost of sampling one item from the i-th stratum. The optimal ni values are then calculated using ni = n fi where n is the desired total sample size. Figure 7.6.3. Optimum Allocation [Workbook: Estimation.xls; Sheet: Allocation] 7.6.3 Cluster Sampling Suppose we want to sample the students at a university. A convenient way is to randomly select a class, go to that class and sample the whole class. If a desired number of classes (without enrollment overlap) are sampled, we then have a fairly representative sample. This method is known as cluster sampling where each class is a cluster. Sampling all the elements in a cluster at once can save time and money. Let the notations be ni = the size of i-th cluster m = number of clusters sampled M = the total number of clusters in the population n = average cluster size = ni / m 41 x i = average of i-th cluster. With these notations, the formulas for the estimate of population mean X and its variance S 2 ( X ) are X = ni x i / ni ni2 ( xi X )2 ) 1 n S ( X ) = (M m)/(Mm n 2 2 m1 These formulas have been implemented in the template shown in Figure 7.6.4. Figure 7.6.4. Cluster Sampling for Population Mean [Workbook: Estimation.xls; Sheet: Cluster for mu] Figure 7.6.5. Cluster Sampling for Population Proportion [Workbook: Estimation.xls; Sheet: Cluster for p] The case of cluster sampling for population proportion is quite similar. The formulas for the estimate P and its variance are P = ni x i / ni n 2 ( pi p ) 2 1 i ) . n S ( p ) = (M m)/(Mm n 2 2 m1 These formulas have been implemented in the sheet named “Cluster for P” in Estimation.xls, which is shown in Figure 7.6.5. 42 7.6.4 Systematic Sampling In systematic sampling every k-th item in the population is sampled, where k is selected suitably. If the data is already entered on a spreadsheet, such sampling can be carried out using the Sampling command under the Data Analysis command of the Tools menu. The advantage is that the sample is spread over the population evenly, which might go to increase the accuracy of the estimate. In this method, the sample mean is the estimate of the population mean. The variance of the estimate is given by S2( X ) = [(N n)/Nn] s2 This formula applies when the population has not been ordered in any particular order. This is what has been implemented in the sheet named “Systematic” in Estimation.xls, shown in Figure 7.6.6. When the population has been ordered, say, in ascending order, more complicated formulas are applicable. Such formulas are not presented here. Figure 7.6.6. Systematic Sampling [Workbook: Estimation.xls; Sheet: Systematic] 7.7 Exercises 1. The lengths of pins produced by an automatic lathe are normally distributed. A random sample of 20 pins gives a sample mean of 0.992” and a sample standard deviation of 0.013”. i. Give a 95% confidence interval for average lengths of all pins produced. ii. If it is claimed that the lathe has been set to have = 1 0.002” could you reject it with 95% confidence? iii. Give a 99% confidence interval for the average lengths of all pins produced. 2. You take a random sample of 100 pins from the lot supplied by a vendor, and test them. You find 3 of them defective. i. What is the 95% confidence interval for % defective in the lot? ii. If the vendor claims that the lot contains not more than 5% defectives, can the claim be rejected with 95% confidence? iii. What is the maximum % defective in the lot, with 99% confidence? 3. Solve exercises 7-6 to 7-12 and 7-20 to 7-22 in Bowerman/O'Connell. [Note: It is better to go by whether the population standard deviation is known or unknown rather than whether the sample is small or large.] 4. Solve exercises 7-39 to 7-50 in the textbook. 5. Solve exercises 7-70 to 7-72 in the textbook. 43 7.8 Projects 1. It is desired to estimate the average length of pins produced by an automatic lathe to within 0.002” with 95% confidence level. is guessed to be 0.015”. i. What is the minimum sample size? ii. If the value of may be anywhere between 0.010” and 0.018”, tabulate the minimum sample size required for values from 0.010” to 0.018” in steps of 0.002”. iii. If the cost of sampling and testing n pins is (25 + 6*n) dollars, tabulate the costs for the same range as in question ii. above. 2. It is desired to estimate the % defective in a lot of pins supplied by a vendor to within 1% with 90% confidence level. The actual % defectiveis guessed to be 4%. i. What is the minimum sample size? ii. If the actual % defective may be anywhere between 3% and 6%, tabulate the minimum sample size required for actual % defective from 3% to 6% in steps of 0.5%. iii. If the cost of sampling and testing n pins is (25 + 6n) dollars, tabulate the costs for the same % defective range as in question ii. above. 3. A company wants to conduct a telephone poll to estimate the % of voters who favor a particular candidate in a presidential election, to within 2% with 95% confidence. It is guessed that the proportion is 53%. i. What is the minimum sample size? ii. The actual proportion may be anywhere from 40% to 60%. Construct a 2-dimensional table for the minimum sample size required with half-width ranging from 1% to 3% in steps of 1% along the rows, and actual proportion ranging from 40% to 60% in steps of 5% along the columns. iii. Inspect the table produced in question ii. above. Comment on the relative sensitivity of the minimum sample size to the actual proportion and to the desired half-width. Also find the worst case value for the actual proportion. iv. If the cost of polling n people is (250 + 0.6n), tabulate the cost as in question ii. above.