Chapter 9 Estimation Using a Single Sample In many practical problems, we want to estimate some population characteristics, for example, population mean , population standard deviation , the proportion of S’s in a population, and so on. In this chapter, we will introduce two estimation techniques, point estimation and interval estimation. 9.1 Point Estimation Definition 9.1: A point estimate of a population characteristic is a single number computed from sample data and represents a plausible value of the characteristic. Note: (1) The adjective point reflects the fact that the estimate corresponds to a single point on the number line. (2) A point estimate is obtained by (i) selecting an appropriate statistic; (ii) computing the value of the statistic for the given sample. For example, the computed value of the sample mean x provides a point estimate of a population mean . Sometimes, there may be several statistics that can reasonably be used to obtain a point estimate of a specified population characteristic. For example, to obtain a point estimate of a population mean , we can use the sample mean x , a trimmed mean, or the sample median. Then which one should we choose for computing an estimate? Criteria for choosing among competing statistics Generally, we choose the statistic that tends, on average, to produce an estimate closest to the true value, that is, the most accurate estimate. Information about the accuracy of estimation for a particular statistic is provided by the statistic’s sampling distribution. (a) If a statistic whose sampling distribution is centered to the right of the true value is used to compute an estimate, the estimate will tend to be larger than the true value. (b) If a statistic whose sampling distribution is centered to the left of the true value is used to compute an estimate, the estimate will tend to be smaller than the true value. (c) When a statistic whose sampling distribution is centered at the true value is used to compute an estimate, there will be no long-run tendency to over- or underestimate the true value. Definition 9.2: A statistic whose mean is equal to the value of the population characteristic being estimated is said to be an unbiased statistic. A statistic that is not unbiased is said biased. Questions: (1) Is x an unbiased statistic for estimating ? Is p an unbiased statistic for estimating a population proportion ? Generally, we prefer an unbiased statistic. If several unbiased statistics could be used for estimating a population characteristic, the best statistic to use is the one with the smallest standard deviation. Consider the problem of estimating a population mean, . When the population distribution is symmetric, the sample mean x , the sample median, and any trimmed mean are all unbiased statistics. Which statistic should be used? The following facts are helpful in making a choice. 1. If the population distribution is normal, then x has a smaller standard deviation than any other unbiased statistic for estimating . 2. When the population is symmetric with heavier tails than the normal curve, a trimmed mean has a smaller standard deviation than x . Thus, when the population distribution is normal, we should use x to estimate . But with a heavy-tailed distribution, a trimmed mean is a better statistic than x for estimating . Generally, we estimate a population characteristic by the corresponding sample characteristic, which is summarized in the following table. Population characteristic to be estimated Statistic to use Unbiasedness p Unbiased Population proportion, x Unbiased Population mean, 2 2 s Unbiased Population variance, s Biased Population standard deviation, Table 9.1 Statistics used to estimate some important population characteristics Note: s is not an unbiased statistic for estimating . It tends to underestimate slightly the true value of . Since unbiasedness is not the only criterion to judge a statistic, and there are other good reasons for using s to estimate , we will use s to estimate in the following discussion. 9.2 A large-Sample Confidence Interval for a Population Proportion In Section 9.1, we saw how to use a statistic to produce a point estimate of a population characteristic. However, because of sampling variability, rarely is the point estimate from a sample exactly equal to the true value of the population characteristic. Although a point estimate may represent our best single-number guess for the value of the population characteristic, it is not the only plausible value. Thus we need to indicate in some way how precisely the population characteristic has been estimated. A point estimate by itself does not provide this information. As an alternative to a point estimate, we report an interval of reasonable values based on the sample data. Then we can have some “confidence” in the interval estimate. Definition 9.3: A confidence interval for a population characteristic is an interval of plausible values for the characteristic. It is constructed so that, with a chosen degree of confidence, the value of the characteristic will be captured inside the interval. Definition 9.4: The confidence level associated with a confidence interval estimate is the success rate of the method used to construct the interval. Note: The confidence level provides information on how much “confidence” we can have in the method used to construct the interval, not our confidence in any one particular interval. See Figure 9.4 on page 381 for interpretation. We first consider a large-sample confidence interval for a population proportion. Let = proportion of individuals in the population that possess the property of interest, p = (number of individuals in sample that possess the property of interest) / n, the sample proportion. We know that the sampling distribution of the statistic p has the following properties: (1) (2) (3) The sampling distribution of p is centered at ; that is, p = . Therefore, p is an unbiased statistic for estimating . The standard deviation of p is p (1 ) / n . When n 10 and n(1-) 10, the sampling distribution of p is approximately normal with mean and standard deviation (1 ) / n . The development of a confidence interval for with confidence level 95% (i) Use Appendix Table 2 to determine a value z* such that P(– z*< z < z*) = 0.95. z* =1.96. (ii) Since –1.96 < (p1 ) / n 1.96 is equivalent to p 1.96 P( p 1.96 (1 ) n (1 ) n p 1.96 (1 ) p 1.96 n (1 ) n ) = P(–1.96 < , p (1 ) / n 1.96 ) = 0.95 This implies that in repeated sampling, 95% of the time the interval p 1.96 will contain . (1 ) n to p 1.96 (1 ) n (iii) Since is unknown, the value of (1 ) / n must be estimated. When the sample size is large, p(1 p) / n should be close to (1 ) / n and can be used in its place. Thus when n is large, a 95% confidence interval for is p 1.96 p (1 p ) n , p 1.96 p (1 p ) n An abbreviated formula for the interval is p 1.96 p (1 p ) n where + gives the upper endpoint of the interval and – gives the lower endpoint of the interval. The interval can be used as long as np 10 and n(1-p) 10. The formula given for a 95% confidence interval can easily be adapted for other confidence levels. The large-sample confidence interval for When 1. p is the sample proportion from a random sample, and 2. the sample size n is large (np 10 and n(1-p) 10) the general formula for a confidence interval for a population proportion is p (z critical value) p (1 p ) n The desired confidence level determines the z critical value. The three most commonly used confidence levels, 90%, 95%, and 99%, use z critical values 1.645, 1.96, and 2.58, respectively. Note: Some z critical values can be found in Appendix Table 3 on page 708. Exercise in class: Discuss how each of the following factors affects the width of the confidence interval for : (1) The confidence level; (2) The sample size n; (3) The value of p. Generally, the higher reliability of a interval (where “reliability” is specified by the confidence level) entails a loss in precision (as indicated by the wider interval). For example, the width of the 99% interval is 2(2.58 p (1 p ) n ), which is wider than the width of the 95% interval, 2(1.96 p (1n p ) ). In the opinion of many investigators, a 95% interval gives a reasonable compromise between reliability and precision. The general form of a confidence interval Many confidence intervals have the same general form as the large-sample intervals for : 1. (Point estimate using a specified statistic) (critical value) (standard deviation of the statistic) If it is known 2. (Point estimate using a specified statistic) (critical value) (estimated standard deviation of the statistic) If it is unknown Definition 9.5: The standard error of a statistic is the estimated standard deviation of the statistic. Choosing the sample size Definition 9.6: If the sampling distribution of a statistic is normal (approximately), the bound on error of estimation, B, associated with a confidence interval is (z critical value)(standard deviation of the statistic). When we use p to construct a 95% confidence interval for , the bound is B = 1.96 (1n ) . Sometimes, we may wish to determine a sample size such that a particular value of the bound B is achieved. For such purposes, solving the equation B = 1.96 (1 ) n for n, we obtain n = (1-) ( 1.B96 ) 2 Generally, the sample size required to estimate a population proportion to within an amount B with a confidence level is value 2 n = (1-) ( z critical ) B The value of may be estimated using prior information. In the absence of any such information, using = .5 in this formula gives a conservatively large value for the required sample size. Note: ( 1 ) 2 0 ( ) 2 2 1 ( 1 ) 2 0 (1 ) 1 2 (1-) ¼ (1-) ( 1.B96 ) 2 1 4 ( 1.B96 ) 2 for any . Example 9.1 A survey designed to obtain information on = the proportion of registered voters who are in favor of a constitutional amendment requiring a balanced budget results in a sample of size n = 400. Of the 400 voters sampled 272 are in favor of a constitutional amendment requiring a balanced budget. a) Give a point estimate of . b) Determine the estimated standard deviation of your estimate in part a). c) Calculate a 99% confidence interval for and interpret the confidence interval. d) Based on this confidence interval, do the majority of registered voters favor the constitutional amendment? e) How large would n have needed to be in order to have estimated to within .03 with 95% confidence? a) The point estimate of is p = 272 / 400 = 0.68. b) The estimated standard deviation of p is p (1 p ) n = 0.68(1 0.68) 400 = .0233 c) Since np = 400 0.68 = 272 > 10 and n(1-p) = 400(1-0.68) = 128 > 10, we can use the formula for a large-sample confidence interval to obtain a 99% confidence interval for . p (z critical value) p (1 p ) n = 0.68 2.58 0.68(1 0.68) 400 = 0.68 2.58 0.0233 = 0.68 0.0601 = (0.6199, 0.7401). We are 99% confident that is between 0.6199 and 0.7401. d) Yes, since the entire interval is above 0.5. e) Using a conservative value of = .5 in the formula for required sample size gives ) 2 = 1067.11 n = (1-) ( 1.B96 ) 2 = 0.5(1-0.5) ( 10..96 03 Thus, n would need to be 1068 in order to estimate to within .03 with 95% confidence. Question: Are the following statements correct? (1) Since (0.6199, 0.7401) is a 99% confidence interval for , P((0.6199, 0.7401) contains ) = 99%. (2) If the process of selecting a sample of size 400 and then computing the corresponding 99% confidence interval is repeated 100 times, 99 of the resulting intervals will include . 9.3 A Confidence interval for a population mean In this section, we consider how to use information from a random sample to construct a confidence interval estimate for a population mean. Recall the four properties about the sampling distribution of x : 1. The mean of x , x 2. The standard deviation of x , x / n 3. When the population distribution is normal, the sampling distribution of x is also normal. 4. When n is sufficiently large (generally n 30), the sampling distribution of x is approximately normal. The one-sample z confidence interval for When 1. x is the sample mean of a random sample from a population 2. the population distribution is normal OR the sample size n is large (generally n 30), and 3. the population standard deviation is known the formula for a confidence interval for population mean is x ( z critical value) ( n ) Example 9.2 The McClatchy News Service reported on a sample of prime-time television hours. The following table summarizes the information reported for two networks. Network Mean Number of Violent Acts per Hour ABC 15.6 FOX 11.7 Suppose that each of these sample means was computed on the basis of viewing n = 50 randomly selected prime-time hours and that the population standard deviation for each of the two networks is known to be = 5. a) Compute a 95% confidence interval for ABC , the true mean number of violent acts per prime-time hour for ABC. b) Compute a 95% confidence interval for FOX , the true mean number of violent acts per prime-time hour for FOX. c) The National Coalition on Television Violence claims that shows on ABC are more violent than on FOX. Based on the confidence intervals from parts a) and b), do you agree with this conclusion? Explain. Since n = 50 > 30 and = 5, we can use the one-sample z confidence interval formula. a) The 95% confidence interval for ABC is x ABC (z critical value) ( n ) = 15.6 (1.96)( 5 50 ) = 15.6 1.39 = (14.21, 16.99) b) The 95% confidence interval for FOX is x FOX ( z critical value) ( n ) = ? ? × ? = ? ? = (?, ?) c) Yes, because the plausible values for ABC is at least 14.21, while the plausible values for FOX are not greater than 13.09. The one-sample t confidence interval for Let us look at the development of the 95% confidence interval for when is known. When the population distribution is normal, the sampling distribution of x is normal. Thus, z x/ n has the standard normal distribution. Since 1.96 x/ n 1.96 is equivalent to x 1.96( n ) x 1.96( n ) , P( x 1.96( n ) x 1.96( n ) ) = P( 1.96 x/ n 1.96 ) = 0.95 Then a confidence interval for is x 1.96( n ) . When is unknown, we must use the sample data to estimate . A natural estimate of is s. Now we use t sx/ n To use t to develop a confidence interval for , we must know the probability distribution of t. Let x1, x2, , xn be a random sample from a normal population distribution. Then the probability distribution of the standardized variable t x s/ n is the t distribution with n-1 df. When 1. x is the sample mean of a random sample from a population 2. the population distribution is normal OR the sample size n is large (generally n 30), and 3. the population standard deviation is unknown the formula for a confidence interval for population mean is x (t critical value) ( s n ) where the t critical value is based on n-1 df, which can be found by Appendix Table 3 on page 708. Note: Appendix Table 3 jumps from 30 df to 40 df, then 60 df, then 120 df, and finally to the row of z critical values. If we need a critical value for a number of degrees of freedom between those tabulated, we just use the critical value for the closest df. For df > 120, we use the z critical values. Example 9.3 A medical researcher from the National Institute of Health has collected samples on the life expectancies of people who are long-time smokers and those who are nonsmokers. The sample data is summarized in the table below. Group Sample Size Sample Mean Sample Standard Deviation Smokers 50 67.6 5 Nonsmokers 60 74.5 3.5 a) Compute a 95% confidence interval for the mean life expectancy of a smoker. b) Compute a 95% confidence interval for the mean life expectancy of a nonsmoker. c) Do the confidence intervals in parts (a) and (b) provide convincing evidence that nonsmokers live longer on the average than do smokers? Explain. a) Since n1 = 50 > 30 and is unknown, we can use the one-sample t confidence interval formula. x S (t50-1 critical value) s1 = 67.6 2.02 550 = 67.6 1.4284 n1 = (66.1716, 69.0284) b) Since n2 = 60 > 30 and is unknown, we can use the one-sample t confidence interval formula. x N (t60-1 critical value) s2 = ? ? ? = ? ? n2 = (?, ?) c) The confidence intervals in parts a) and b) do provide convincing evidence that nonsmokers live longer than long-time smokers since the largest value in the confidence interval for smokers is roughly 4.5679 years less than the smallest value in the confidence interval for non-smokers. Choosing the sample size When we use x to construct a 95% confidence interval for , the bound on error of estimation is B = 1.96( n ) Before collecting any data, an investigator may wish to determine a sample size for which a particular value of the bound is achieved. Solving B = 1.96( n ) for n, we obtain n = [ 1.96B ] 2 . Generally, we have the following result. The sample size required to estimate a population mean to within an amount B with a confidence level is n = [ ( z criticalB value) ]2 . If is unknown, it may be estimated based on previous information or, for a population that is not too skewed, by using (range)/4 Example 9.4 The financial aid office wishes to estimate the mean cost of textbooks per semester for students at a university. For the estimate to be useful, it should be within $20 of the true population mean. How large a sample should be used to be 95% confident of achieving this level of accuracy? To determine the required sample size, we must have a value for . The financial aid office is pretty sure that the amount spent on books varies widely, with most values between $50 and $450. A reasonable estimate of is then (range) / 4 = (450 – 50) / 4 = 400 / 4 = 100. The required sample size is n = [ 1.96 / B ]2 = [(1.96)(100) / 20]2 = [9.8]2 = 96.04. Rounding up, a sample size of 97 or larger is required.