7. Confidence Intervals

advertisement
7. Confidence Intervals
7.1
Introduction
Suppose the high temperature forecast for tomorrow is 50 degrees Fahrenheit. This does not mean
it will be exactly 50 degrees; it could be slightly more or slightly less. What would be useful is to have the
forecast also state how much the actual temperature may deviate from 50 degrees. One way to do that is to
make the forecast state,
“There is a 95% chance that the high temperature tomorrow will be 50  3 degrees.”
In this case, we say that 50  3 is the 95% confidence interval for tomorrow’s high temperature. We can
also write this interval as [47, 53]. The central value 50 is called the point estimate and the 3 is the halfwidth of the confidence interval. Compare the above forecast with the next:
“There is a 96% chance that the high temperature tomorrow will be 50  3.2 degrees.”
This forecast is better in that the confidence has increased, but worse in that the interval has widened.
Consider the next forecast where the interval is widened to 50 30.
“There is a 99.99% chance that the high temperature tomorrow will be 50  30 degrees.”
Although there is almost 100% confidence, this forecast is useless because the range of possible values is so
large that it does not help us get a reasonable idea about what the weather is going to be and how to prepare
for that weather. We would like a confidence interval to be narrow and at the same time have a large
confidence level. In the estimation of confidence intervals, the objective is to minimize the half-width and
maximize the confidence level at the same time. Usually, that is achieved by fixing the confidence level at
90%, 95% or 99% and then calculating the narrowest interval that has that confidence level.
In what follows, we shall study how to calculate confidence intervals for population means and
proportions based on sample statistics.
7.2
Confidence Intervals for Population Means
7.2.1
The basic case
The symbol used for confidence level is (1), so that a 95% confidence level will mean an  of
5%. In a standard normal distribution (z-distribution), if we want the narrowest interval that captures an
area of (1), it should be symmetric about the center because the distribution peaks at the center. We
should leave an area of /2 on each tail so that the central interval would cover an area of (1). The z
value that has an area of /2 to the right is denoted by the symbol z/2. See Figure 7.2.1. The (1)
confidence interval is thus [z/2 , z/2].
The narrowest
interval that
covers an area of
(1 )
Area = /2
z/2
0
Area = /2
z
z/2
Figure 7.2.1. A (1) Confidence Interval
7.2.2
One-sided Confidence Intervals
At times, we want to be confident only about the maximum or the minimum possible value. We
then do not care about the width of the interval. In this case, we need a one-sided confidence interval. We
35
leave the whole area  on one of the two tails and the rest of the z-axis becomes the (1 ) confidence
interval, which would be either [, z] or [z , ].
7.2.3
Finite Population Correction
In most cases, the populations is large or infinite and the variance of the sample mean, V( x )
equals 2/n. This value is used in confidence interval calculations. When the population is finite, V( x ) is
a tiny bit less than 2/n. This difference grows as N keeps decreasing and becomes comparable to n.
Indeed, if N = n, then the whole population is sampled and the sample mean must equal the population
mean. The sample mean is no more random, and V( x ) is zero. In general, if n/N is more than 1%, a finite
population correction is necessary. V( x ) reduces by a factor of (N n)/(N 1) and the width of the
confidence interval decreases by a factor of ( N  n ) / ( N  1 ) . This correction also applies to confidence
intervals for population proportions.
7.2.4
When Population Standard Deviation is Known
The calculation of confidence intervals for population means from sample means depends upon
whether or not the population standard deviation  is known. Most of the time,  is not known; but if it is
known it would help us to decrease the half-width for a given confidence level. We shall first see the case
where  is known.
From the Central Limit Theorem, we know that if either the population is normally distributed, or
the sample size n is more than 30, then the sample mean x ~ N(, 2/n). Therefore a (1 ) confidence
interval for x is  z/2*/ n . This means that with (1 ) confidence, we can say the difference
between  and x is not more than z/2*/ n . Therefore, a (1 ) confidence interval for  is
x z/2*/ n .
The template to use in this case is shown in Figure 7.2.2.
Figure 7.2.2. Confidence Interval for 
[Workbook: Estimation.xls; Sheet: CI for mu]
When the data for and n are entered in cells B3, B4 and B5, the required confidence interval is
displayed in the range B8:D8. The same template contains one-sided confidence intervals as well.
To apply finite population correction to confidence intervals, the population size N must be entered
in cell O4. The correction factor is computed in cell O6. The corrected confidence intervals appear in the
range L8:P10.
7.2.5
When Population Standard Deviation is Unknown
Most of the times, the population standard deviation  is not known, and the population may be
known to have a normal distribution. We then substitute the sample standard deviation s in its place and use
the t-distribution to calculate confidence intervals. The t-distribution was discovered by W. S. Gossett who
wrote under the pen name Student. It is therefore called Student’s t-distribution. Gossett proved that when
36
a random sample is drawn from a normally distributed population, the statistic
x 

 follows the ts / n 
distribution with (n 1) degrees of freedom, which helps us calculate confidence intervals for  after
knowing x .
The t-distribution is similar to the standard normal distribution (the z-distribution) in that it has a
mean of zero and is symmetric; but it is flatter with larger variance (> 1) and negative (relative) kurtosis.
Unlike the z-distribution, of which there is only one, there is one t-distribution for each number of degrees
of freedom, df = 1, 2, ... . As the df increases, the variance approaches 1, the (relative) kurtosis approaches
zero, and the t-distribution approaches the z-distribution.
In the workbook, Estimation.xls, there is a sheet for t-distribution. It is very similar to the one for
normal distribution, with an input cell for df. All calculations similar to those of normal distribution are
possible on this template.
The half-width of the confidence interval for  when  is not known is given by t/2*s/ n , where
the df for the t value is (n  1). On the bottom half of the sheet named “CI for mu” in Estimation.xls the
confidence interval for  when  is unknown is calculated. See Figure 7.2.3. When the input data is
entered in the range B14:B17, the confidence interval is displayed in the range B19:D19.
Figure 7.2.3. CI for  When  is Unknown
[Workbook: Estimation.xls; Sheet: CI for mu]
In many textbooks, including Bowerman/O'Connell, the t values with large df are approximated to
z values. Thus the approach is classified as "large sample" and "small sample" estimation. Such
approximation is not necessary on a spreadsheet. In the templates, the approach is classified as " known,
normal population or n >= 30" and " unknown, normal population" cases. This classification is more
accurate as it avoids the t to z approximation.
7.3
Confidence Intervals for Population Proportions
When np and n(1  p) are both at least 5, the (1 ) confidence interval for population proportion
p is given by
p  z / 2 p ( 1  p ) / n
where p is the sample proportion. The template for this is shown in Figure 7.3.1.
When the input data are entered in the range B3:B5, the confidence interval appears in the range
B8:D8 and in the range F8:J8. One-sided confidence intervals appear in the range F9:J9. If finite
population correction is needed, the population size N should be entered in cell O4, and the confidence
intervals appear in the range L8:P10. The data currently entered are for the Cheese Spread Case of
Bowerman/O'Connell.
37
Figure 7.3.1. Confidence Interval for Population Proportion
[Workbook: Estimation.xls; Sheet: CI for p]
7.4
Confidence Intervals for Population Variance
A well known result in sampling theory is that if random samples are drawn from a normal
population, the statistic (n 1)*S2/2, where S2 is the sample variance and 2 is the population variance,
would follow a chi-square distribution with (n 1) degrees of freedom. Letting z denote a standard normal
random variable, chi-square with k degrees of freedom is the sum of k independent z2 values. A chi-square
distribution has its range from zero to infinity, mean equal to the degrees of freedom, variance equal to
twice the degrees of freedom and is positively skewed. The sheet named “Chi-Square Distn.” in
Estimation.xls is similar to the t-distribution sheet and can be used similarly.
Figure 7.4.1. Confidence Interval for Population Variance
[Workbook: Estimation.xls; Sheet: CI for Popn. Variance]
When samples are drawn from a normal population, the (1 ) confidence interval for 2 is given
by
 ( n  1 )s 2 ( n  1 )s 2 
, 2


2
 1 / 2 
   / 2
where the chi-square has (n ) degrees of freedom. Figure 7.4.1 shows the sheet named “CI for Popn.
Variance” in Estimation.xls which implements the above formula. After the input data are entered in the
range B3:B5, the confidence intervals appear in the range D7:H9.
7.5
Sample Size Determination
An important practical decision in confidence interval estimation is how large the sample should
be. Too small a sample will yield too wide an interval; too large a sample will require too much effort. The
38
objective in sample size determination is to find the smallest sample that will yield the desired width for the
confidence interval.
In the case of population mean, we have seen that the half-width for the confidence interval
(denote it by B) is given by B = z/2/ n . Rearranging this to bring n to the left hand side yields the
formula for minimum n as,
 z   2 
Minimum n =    / 2  
  B  
where the symbol   means rounding up. For example, 48.2  49 . Rounding up is necessary since n
cannot be a fraction. Note that the formula for minimum n calls for . Since the population mean itself is
not known, it is unlikely that  will be known. One should make a guess, or use past experience with
similar data to enter a reasonably good value for . Figure 7.5.1 shows the sheet named “Sample Size
Calc.” in Estimation.xls. When the input data are entered in the range B5:B7, the minimum sample size
appears in cell B9. The currently input values are from the Car Mileage case of Bowerman/O'Connell. The
answer 25 found in cell B10 differs from the one in the textbook, because the textbook uses t/2 in place of
z/2 to take care of the uncertainty in population standard deviation (guessed using a small preliminary
sample).
Figure 7.5.1. Minimum Sample Size Calculation
[Workbook: Estimation.xls; Sheet: Sample Size Calc.]
In the case of population proportion, the half width B = z/2 p(1-p)/n . Rearranging this to get n
on the left hand side yields,
2

z
 
Minimum n =  p( 1  p )  / 2   .
 B  


In the template shown in Figure 7.5.1, when the input data are entered in the range E5:E7 the minimum
sample size appears in cell E9.
7.6
7.6.1
Survey Sampling
Stratified Sampling
Often, the population consists of strata that have their own individual characteristics in terms of
their mean, variance or cost of sampling. By spreading the sample appropriately over the different strata,
we can increase accuracy and/or decrease cost.
Figure 7.6.1 shows the sheet named “Stratified for mu” which can be used for estimating
population mean with a stratified sample. Let the notations be:
k is the number of strata
Ni is the size of the i-th stratum
ni is the size of the sample from i-th stratum,
39
N and n are the respective sums
Wi is the population proportion of the i-th stratum given by Ni/N
fi is the sample proportion for the i-th stratum given by ni/n
m is the number of strata
xi is the mean of the sample from i-th stratum
si2 is the variance of the sample from i-th stratum.
With these notations, the estimate of population mean X and its variance S2( X ) are given by
X =  Wi xi
S2( X ) =  Wi2 Si2 (1  fi)/ni
These formulas have been employed in the template. The confidence interval has been calculated assuming
large samples.
Figure 7.6.1. Stratified Sampling for Population Mean
[Workbook: Estimation.xls; Sheet: Stratified for mu]
In the case of population proportion, let pi denote the proportion of the sample from i-th stratum. The
population proportion P and its variance S2(P) are given by
P =  Wi pi
S2(P) =  Wi2 pi(1  pi)/ni
If the size of the strata are small, a finite population correction is necessary. With that correction, the
formula is
S2( P ) 
p (1  p )
 Ni2 ( Ni  ni ) ( Ni i  1)ini
These formulas have been implemented in the sheet named “Stratified for p” in Estimation.xls. It is shown
in Figure 7.6.2.
40
Figure 7.6.2. Stratified Sampling for Population Proportion
[Workbook: Estimation.xls; Sheet: Stratified for p]
7.6.2
Optimum Allocation
When unit sampling cost for each stratum along with the size and variance of each stratum are
known it is possible to select ni values that would maximize accuracy and minimize cost. This method of
deciding the optimal ni values is called optimum allocation.
Figure 7.6.3 shows the sheet named “Allocation” in Estimation.xls. The formula for optimal fi is
fi = (Wi i / Ci ) /  (Wi i / Ci )
where Ci is the cost of sampling one item from the i-th stratum. The optimal ni values are then calculated
using ni = n fi where n is the desired total sample size.
Figure 7.6.3. Optimum Allocation
[Workbook: Estimation.xls; Sheet: Allocation]
7.6.3
Cluster Sampling
Suppose we want to sample the students at a university. A convenient way is to randomly select a
class, go to that class and sample the whole class. If a desired number of classes (without enrollment
overlap) are sampled, we then have a fairly representative sample. This method is known as cluster
sampling where each class is a cluster. Sampling all the elements in a cluster at once can save time and
money.
Let the notations be
ni = the size of i-th cluster
m = number of clusters sampled
M = the total number of clusters in the population
n = average cluster size = ni / m
41
x i = average of i-th cluster.
With these notations, the formulas for the estimate of population mean X and its variance S 2 ( X )
are
X =  ni x i /  ni
 ni2 ( xi  X )2
) 1
n
S ( X ) = (M  m)/(Mm n
2
2
m1
These formulas have been implemented in the template shown in Figure 7.6.4.
Figure 7.6.4. Cluster Sampling for Population Mean
[Workbook: Estimation.xls; Sheet: Cluster for mu]
Figure 7.6.5. Cluster Sampling for Population Proportion
[Workbook: Estimation.xls; Sheet: Cluster for p]
The case of cluster sampling for population proportion is quite similar. The formulas for the
estimate P and its variance are
P =  ni x i /  ni
n 2 ( pi  p ) 2

1 i
)
.
n
S ( p ) = (M  m)/(Mm n
2
2
m1
These formulas have been implemented in the sheet named “Cluster for P” in Estimation.xls, which is
shown in Figure 7.6.5.
42
7.6.4
Systematic Sampling
In systematic sampling every k-th item in the population is sampled, where k is selected suitably.
If the data is already entered on a spreadsheet, such sampling can be carried out using the Sampling
command under the Data Analysis command of the Tools menu. The advantage is that the sample is
spread over the population evenly, which might go to increase the accuracy of the estimate.
In this method, the sample mean is the estimate of the population mean. The variance of the
estimate is given by
S2( X ) = [(N  n)/Nn] s2
This formula applies when the population has not been ordered in any particular order. This is what has
been implemented in the sheet named “Systematic” in Estimation.xls, shown in Figure 7.6.6. When the
population has been ordered, say, in ascending order, more complicated formulas are applicable. Such
formulas are not presented here.
Figure 7.6.6. Systematic Sampling
[Workbook: Estimation.xls; Sheet: Systematic]
7.7
Exercises
1. The lengths of pins produced by an automatic lathe are normally distributed. A random sample of 20
pins gives a sample mean of 0.992” and a sample standard deviation of 0.013”.
i. Give a 95% confidence interval for average lengths of all pins produced.
ii. If it is claimed that the lathe has been set to have  = 1  0.002” could you reject it with 95%
confidence?
iii. Give a 99% confidence interval for the average lengths of all pins produced.
2. You take a random sample of 100 pins from the lot supplied by a vendor, and test them. You find 3 of
them defective.
i. What is the 95% confidence interval for % defective in the lot?
ii. If the vendor claims that the lot contains not more than 5% defectives, can the claim be rejected
with 95% confidence?
iii. What is the maximum % defective in the lot, with 99% confidence?
3. Solve exercises 7-6 to 7-12 and 7-20 to 7-22 in Bowerman/O'Connell. [Note: It is better to go by
whether the population standard deviation  is known or unknown rather than whether the sample is small
or large.]
4. Solve exercises 7-39 to 7-50 in the textbook.
5. Solve exercises 7-70 to 7-72 in the textbook.
43
7.8
Projects
1. It is desired to estimate the average length of pins produced by an automatic lathe to within 0.002” with
95% confidence level.  is guessed to be 0.015”.
i. What is the minimum sample size?
ii. If the value of  may be anywhere between 0.010” and 0.018”, tabulate the minimum sample
size required for  values from 0.010” to 0.018” in steps of 0.002”.
iii. If the cost of sampling and testing n pins is (25 + 6*n) dollars, tabulate the costs for the same 
range as in question ii. above.
2. It is desired to estimate the % defective in a lot of pins supplied by a vendor to within 1% with 90%
confidence level. The actual % defectiveis guessed to be 4%.
i. What is the minimum sample size?
ii. If the actual % defective may be anywhere between 3% and 6%, tabulate the minimum sample
size required for actual % defective from 3% to 6% in steps of 0.5%.
iii. If the cost of sampling and testing n pins is (25 + 6n) dollars, tabulate the costs for the same %
defective range as in question ii. above.
3. A company wants to conduct a telephone poll to estimate the % of voters who favor a particular
candidate in a presidential election, to within 2% with 95% confidence. It is guessed that the proportion is
53%.
i. What is the minimum sample size?
ii. The actual proportion may be anywhere from 40% to 60%. Construct a 2-dimensional table for
the minimum sample size required with half-width ranging from 1% to 3% in steps of 1% along the rows,
and actual proportion ranging from 40% to 60% in steps of 5% along the columns.
iii. Inspect the table produced in question ii. above. Comment on the relative sensitivity of the
minimum sample size to the actual proportion and to the desired half-width. Also find the worst case value
for the actual proportion.
iv. If the cost of polling n people is (250 + 0.6n), tabulate the cost as in question ii. above.
Download