CHAPTER 5 STATISTICAL INFERENCE: ESTIMATION 1. Statistical Inference—an Introduction. 2. Statistical Estimation 2.1. Point Estimation 2.2. Interval Estimation 3. Confidence Interval For the Population Mean 3.1. Interval Estimate of μ 3.2. The Margin of Error and Precision of a Confidence Interval 3.2.1. The Impact of Changing the Error Probability α 3.2.2. The Impact of Changes in σ 3.2.3. The Impact of Changing the Sample Size n 3.3. Confidence Interval for μ when Population Standard Deviation is Unknown 3.3.1. The t-Distribution 3.3.1.1. How to Find the t Score Corresponding to a Given Tail Area 3.3.1.1.1. Using the t-Table 3.3.1.1.2. Using Excel 3.3.2. The Option of Using z in Place of t for Large Sample Sizes 3.4. Confidence Intervals for μ when Population is not Normal 3.5. Minimum Sample Size Needed to Estimate μ to Within a Desired Margin of Error 4. Confidence Interval For The Population Proportion π 4.1. Minimum Sample Size Needed to Estimate π to Within a Desired Margin of Error 1. Statistical Inference—an Introduction In the previous chapters it was shown that features of a data set can be described by summary characteristics that are measures of location (for example, the mean) or measures of dispersion (such as the standard deviation). If the data set represents a population, then these summary characteristics, the mean μ and the standard deviation σ, each define a population parameter. These parameters are constants whose values are most often unknown. If the data set represents a sample, then the summary characteristics mean xΜ and standard deviation s each define a sample statistic. The summary characteristics of a population data set, the parameters, provide very useful information about the data set. Thus in statistical analysis attention is focused on these parameters. For example, a population may consist of data showing the number of hours light bulbs manufactured in a plant last. The manufacturer may be interested in the average life, the population parameter μ. Theoretically, the value of the parameter μ could be computed by measuring the lifetime of each unit in the population. Practically, however, this would not be feasible. Instead, a sample of light bulbs could be selected and their lifetime used to estimate the value of μ. Here, the sample mean xΜ , a sample statistic, is used as the estimator of the population mean μ, the population parameter. In many cases like the above it may be impossible or impractical to determine the value of population parameter by analyzing the entire population values. The process of determining the value of the parameter may destroy the population units or it may simply be too expensive in money or resources to analyze each unit. In these cases, we use statistical inference to obtain information about the values of the population parameters. The objective of statistical inference is to make inferences about a population based on the sample drawn from the population. Chapter 5—Interval Estimates Page 1 of 21 In this course, we will consider statistical inference for two population parameters: the population mean μ and the population proportion π. There are two basic, closely related, ways to draw inferences about the value of these two parameters. • The first way is to estimate the value of the parameter, as mentioned above. Here the idea is to select a sample statistic whose value will be used as an estimate of the population parameter. This process involves point estimation and interval estimation (building a confidence interval). • The second way is called hypothesis testing. In this process, we hypothesize a value for the population parameter and use the sample information to make a decision as to whether to accept or reject the hypothesis. In this chapter estimation will be discussed. Hypothesis testing is explained in the next chapter. 2. Statistical Estimation Statistical estimation involves two main categories: point estimation and interval estimation. 2.1. Point Estimation In point estimation a single number computed from the sample data is used as an estimate of the population parameter. It is called a "point" estimate because one point on the real number line is used to estimate the population parameter. In interval estimation two points are used defining an interval on the real number line, which hopefully will contain the value of the population parameter. For example, if the parameter is the population mean lifetime of light bulbs, based on sample information we might arrive at a point estimate of μ, say, π₯Μ = 923 hours, and an interval estimate of μ of from 913 hours to 933 hours. The point estimate is obtained as the mean of a single sample of size π randomly selected from the population. In the previous chapter it was shown that, because infinite number of samples of a given size may be obtained from the population, π₯Μ , the sample mean, is a random variable with infinite number of values. Thus, a given point estimate is only one of these infinite number of values the random variable π₯Μ can take on. The point estimate, therefore, may or may not be close to the population parameter. For example, given that the distribution of π₯Μ is normal, 95% of all π₯Μ ‘s fall within ±1.96 standard errors from the population mean, that is, P(µ − 1.96se(π₯Μ ) ≤ π₯Μ ≤ µ + 1.96se(π₯Μ )) = 0.95 Therefore, one cannot tell where a single sample mean falls on the number line relative to the population mean. If we only provide a point estimate of the parameter μ, we do not have any information about the reliability of the estimation process. 2.2. Interval Estimation To overcome the drawback associated with point estimation, we can obtain an interval or range of values which may contain the population parameter. Associated with this interval is a percentage value reflecting the level of confidence that the interval may contain the actual population parameter. For example, we may state that we are 95% confident that the population mean lifetime of light bulbs is between 913 to 933 hours. This is why such interval estimates are called confidence intervals. Confidence intervals provide specific information about the accuracy of the estimation. Chapter 5—Interval Estimates Page 2 of 21 Building confidence intervals for population parameters μ and π are routine and involve simple formulas. The following is an explanation of how these formulas are obtained and what they imply. 3. Confidence Interval For the Population Mean To explain confidence intervals, let us refer back to the sampling distribution of xΜ . In the last chapter it was shown that xΜ is a normally distributed random variable1 with a mean of µπ₯Μ = µ and standard error of se(π₯Μ ) = σ⁄√π . Now consider the following example involving the sampling distribution of xΜ . Example 1 Assume we know the population mean lifetime of light bulbs manufactured in a plant to be µ = 920 hours. The standard deviation of population is also known to be σ = 20 hours. The population is normally distributed. We know that we can take infinite samples of size n from the population. If n = 25, then the sample mean values are normally distributed with a mean of ππ₯Μ = π = 920 and a standard error of se(π₯Μ ) = σ⁄√π = 4. Given this information, find the percent of all xΜ values that are between 916 and 924. That is, find: P(916 < π₯Μ < 924) For π₯Μ = 916, π§ = π₯Μ −µ se(π₯Μ ) = −1.00, and for π₯Μ = 924, π§ = 1.00. Using the π§ table, we find that P(−1.00 < π§ < 1.00) = 0.6827. This means that approximately 68% of x values fall within one standard error from the population mean. This deviation from the mean is ±π§ β se(π₯Μ ) = ±1.00 × 4 = ±4 hours. That is, 68% of all π₯Μ values deviate from the population mean by no more than ±4 hours. The deviation ±π§ β π π(π₯Μ ) = ±4 is the margin of error (MOE) or, to be precise, the 68% MOE. Next you will be asked to find the 95% MOE: Example 2 Find the two boundaries of the interval, symmetric about the mean, that contains 95% of all possible π₯Μ ‘s. Since the interval would contain 95% of all sample means, the 5% of π₯Μ values would fall outside this interval. This 5% is the “error probability”: πΌ = .05. Here, you must first find the z scores that bound an area of 2.5% at each tail of the z curve. These z scores are ±π§πΌ⁄2 = π§0.025 = ±1.96. Thus, the 95% MOE is πππΈ = ±π§0.025 se(π₯Μ ) = 1.96 × 4 = 7.84 and the lower and upper boundaries are, respectively: π₯Μ 1 = π − πππΈ = 920 − 7.84 = 912.16 π₯Μ 2 = π + πππΈ = 920 + 7.84 = 927.84 Therefore, P(912.16 < π₯Μ < 927.84) = 0.9500, which means that 95 percent of all sample mean values fall within the interval (912.15, 927.84) formed by πππΈ = 7.84. 1 The sampling distribution of xΜ is normal if the population is normal or, per Central Limit Theorem, is approximately normal if the sample size is 30 or more Chapter 5—Interval Estimates Page 3 of 21 P(μ − z0.025 se(xΜ ) ≤ xΜ ≤ μ − z0.025 se(xΜ )) = 0.95 0.95 xΜ μ 95% of all xΜ values fall within this interval μ − z0.025 se(xΜ) 3.1. μ + z0.025 se(xΜ) Interval Estimate (Confidence Interval) for μ Now suppose a single sample of size n = 25 is selected and the mean π₯Μ = 925 hours. Using the MOE computed above, ±1.96 × 4 = ±7.84, we can build an interval around this sample mean with the following Lower and Upper boundaries: πΏ = π₯Μ − πππΈ = 925 − 7.84 = 917.16 πΏ = π₯Μ + πππΈ = 925 + 7.84 = 932.84 Note that this interval (917.16, 932.84) captures the population mean µ = 920. An Interval Estimate Built Around xΜ xΜ ± MOE 912.16 927.84 917.6 xΜ = 925 932.84 μ = 920 We can repeat the sampling process infinite number of times, obtaining infinite number of π₯Μ values. As stated above, 95% of these would fall within ±πππΈ from the population mean. Logically, then, if we built infinite Chapter 5—Interval Estimates Page 4 of 21 intervals π₯Μ ± πππΈ, then of every 100 such intervals 95 should capture the population mean. In the diagram below, some of the possible intervals are shown, one of which, where π₯Μ = 929, does not capture μ. The percentage of all intervals that do not capture the population mean is 5%. In the diagram below several of these infinite number of intervals are shown. Note that as long as π₯Μ falls within the πππΈ (within the boundaries π ± πππΈ), then the interval estimate would capture μ. One of the intervals in the diagram is shown not to contain μ because π₯Μ = 929 is outside the boundary form by ±πππΈ. Multiple Interval Estimates Built Around xΜ xΜ xΜ = 925 xΜ = 915 xΜ = 922 xΜ = 918 xΜ = 915 xΜ = 919 xΜ = 929 xΜ = 913 912.16 μ = 920 927.84 In the estimation process, only one sample of size n is selected. Using the formula πΏ, π = π₯Μ ± πππΈ we build only one interval. This interval may or may not capture the population mean. However, since we know that 95 percent of such intervals would capture the population mean, we can say that we are 95% confident that the interval we have built would capture µ. This interval, hence, is called a 95% confidence interval. Generally, the fraction (proportion, or percentage) of intervals that do not contain μ is denoted by α. Thus, 1 − πΌ proportion will contain the μ. The term 1 − πΌ is called the confidence coefficient (when expressed as a percentage, it is called the confidence level) and πΌ is called the error probability. Example 3 A normal population has a standard deviation of π = 10. A sample of size π = 25 is taken. The sample mean π₯Μ = 30. Build a 95% confidence interval for the population mean μ. Chapter 5—Interval Estimates Page 5 of 21 The 95% confidence interval means that the confidence coefficient 1 − πΌ = 0.95. Thus πΌ = 0.05, and we know π§πΌ⁄2 = π§0.025 = 1.96, and se(π₯Μ ) = σ⁄√π = 10⁄5 = 2. The lower and upper bounds of the confidence interval are then: πΏ, π = π₯Μ ± πππΈ πππΈ = π§πΌ⁄2 se(π₯Μ ) πππΈ = 1.96 × 2 = 3.92 πΏ = 30 − 3.92 = 26.08 π = 30 + 3.92 = 33.92 We are 95% confident that: 3.2. 26.08 ≤ µ ≤ 33.92 The Margin of Error and Precision of a Confidence Interval In the confidence interval for the population mean, πΏ, π = π₯Μ ± πππΈ The two ends of the interval deviate from π₯Μ by the margin of error, where πππΈ = ±π§πΌ⁄2 se(π₯Μ ), and se(π₯Μ ) = σ⁄√π. Therefore, πππΈ = π§α⁄2 σ √π The width, hence the precision, of the interval is determined by the size of MOE. The size of the margin of error, in turn, is determined by the three factors visible in the formula. These are, ο· ο· ο· α, the error probability (the proportion or percentage of all possible intervals not capturing the population mean); σ, the population standard deviation; and n, the sample size. The following explains the impact of each of these factors on the precision of the confidence interval. The wider the interval, the less precise, and hence, the less meaningful it is. The information about the population parameter may be so wide and imprecise to make the confidence interval meaningless. 3.2.1. The Impact of Changing the Error Probability α The term π§πΌ⁄2 in the margin of error formula is the π§ score that bounds a tail area of πΌ⁄2 under the π§ curve. The location of the π§ score thus depends on the choice for πΌ, the error probability. If we want to lower the risk of missing the population parameter, then a smaller πΌ must be chosen. A smaller πΌ will increase π§α⁄2 , thus widening the confidence interval. The following table shows this relationship between πΌ and the interval width for a given π = 10 and sample size of π = 25. Chapter 5—Interval Estimates Page 6 of 21 Holding π = 10 and π = 25, the interval width increases as α decreases. π§α⁄2 πππΈ = π§α⁄2 σ⁄√π π€ = 2 × πππΈ Α 0.20 0.10 0.05 0.01 1.28 1.64 1.96 2.58 2.6 3.3 3.9 5.2 5.1 6.6 7.8 10.3 Thus, the width of the confidence interval varies inversely with the error probability. The interval narrows (becomes more precise) as α increases. But this precision is obtained by increasing the risk of having the population parameter fall outside the interval. On the other hand, reducing the probability of error widens the interval and makes it less precise. 3.2.2. The Impact of Differences in σ The second factor affecting the margin of error is σ, the population standard deviation, the measure of dispersion of population data. Note that there is a direct relationship between σ and πππΈ. When holding n and α constant, as σ rises, πππΈ gets bigger and, consequently, the interval becomes wider and less precise Holding α = 0.05 and n = 25, the interval width increases directly and proportionately with σ se(π₯Μ ) = σ⁄√π πππΈ = π§α⁄2 se(π₯Μ ) πΌππ‘πππ£ππ ππππ‘β = 2 × πππΈ σ 5 1 1.96 3.92 10 2 3.92 7.84 20 4 7.84 15.68 40 8 15.68 31.36 Thus, if the population standard deviation is doubled, the margin of error will also double. However, Unlike α, which can be determined by the statistician, σ is not a controlled factor. It is a characteristic of the population. The more dispersed the population data, the wider the confidence interval. 3.2.3. The Impact of Changing the Sample Size n The sample size is the most important factor affecting πππΈ because it is controlled by the statistician and can be used to improve the precision of the interval without increasing the error probability. As the formula for πππΈ shows, there is an inverse relationship between πππΈ and n: πππΈ = π§α⁄2 σ⁄√π. As n increases, πππΈ gets smaller, thus narrowing the interval. By increasing n the confidence interval becomes more precise without increasing the error probability. For a given population σ, and holding α constant, the larger the sample size, the more precise the interval becomes. Chapter 5—Interval Estimates Page 7 of 21 Holding α = 0.05 and σ = 10, the interval width decreases as n increases. πππΈ = π§α⁄2 se(π₯Μ ) πΌππ‘πππ£ππ ππππ‘β = 2 × πππΈ n 5 8.77 17.53 10 6.20 12.40 25 3.92 7.84 50 2.77 5.54 100 1.96 3.92 500 0.88 1.75 1000 0.62 1.24 3.3. Confidence Interval for μ when Population Standard Deviation is Unknown So far, to explain the theory of confidence intervals we have used examples in which the population standard deviation was assumed to be known. This, obviously, is an unrealistic assumption because the value of σ is obtained using the population mean µ, the unknown parameter we are attempting to estimate: σ=√ ∑(π₯ − µ)2 π In practical inferential statistical analyses, population standard deviation is also an unknown parameter which must be estimated using the sample data. The estimator of the population parameter σ is the sample statistic π , the sample standard deviation, which is computed from the sample data according to the formula, ∑(π₯ − π₯Μ )2 π =√ π−1 Since π is used in place of σ, in the πππΈ formula the standard error of π₯Μ changed from se(π₯Μ ) = σ ⁄√π to: se(π₯Μ ) = s √π 3.3.1. The t-Distribution When π is used in place of σ a peculiar thing happens to the shape of the sampling distribution of π₯Μ . The sampling distribution is still bell shaped, but the area under the curve for a given interval of π₯Μ values is not the same as when the known σ is used. To illustrate, consider the following example: First, suppose the population mean is µ = 100 and the standard deviation is σ = 20. The fraction (proportion, or percentage) of π₯Μ values for samples of size n = 16 taken from this population that fall between, say, 90.2 and 109.8 are determined as follows: P(90.2 < π₯Μ < 109.8) se(π₯Μ ) = π⁄√π = 20⁄√16 = 5 π§= π₯Μ − µ 90.2 − 100 = = −1.96 se(π₯Μ ) 5 Chapter 5—Interval Estimates Page 8 of 21 P(−1.96 < z < 1.96) = 0.95 Now, instead of using σ, let the standard deviation 20 be as if determined from a sample. That is, let π = 20. Hence, se(π₯Μ ) = s⁄√π = 5. However, when we attempt to transform π₯Μ to z using the formula π§ = (π₯Μ − µ)⁄se(π₯Μ ) a problem arises. The new random variable obtained through this transformation no longer has a z distribution (with mean 0 and standard deviation 1). This problem was observed by William S. Gosset (1867-1937), a British chemist/statistician, in a paper published in 1908. Gosset showed that, when the sample size is small, the standard normal table z does not provide the accurate area under the curve for π₯Μ −µ π₯Μ −µ the scores obtained from the conversion formula . In the above example, if = 1.96, the area under se(π₯Μ ) se(π₯Μ ) the curve is bounded by the two scores ±1.96 is no longer 0.95. In fact, that area is less than 0.95. To get 0.95, the boundaries must be extended further from zero in either direction. He developed an alternative table to obtain the more accurate areas or probability values for the scores thus calculated. The new table of probabilities he provided is now called the t table. And the random variable obtained from this transformation is said to have a t distribution, where, π‘= π₯Μ − µ se(π₯Μ ) The difference between the z and t distributions is shown in the following diagram. Like the z distribution, the t distribution is symmetric about the mean of 0. However, unlike z, which has a unique, unchanging shape due to its fixed standard deviation 1, the t distribution acquires different shapes depending on a parameter called degrees of freedom. In estimations involving μ, the degrees of freedom is df = n − 1, the denominator used in computing the sample variance. The smaller the degrees of freedom, the larger the tail areas. As the df increases, the tail area under the t curve approaches the tail area under the z.2 As the degrees of freedom increases, the distinction between z and t practically disappears. For any df > 2, the standard deviation of the t distribution is σπ‘ = √ππ ⁄(ππ − 2). For example, if df = 9, then σπ‘ = 1.134. As df rises, the standard deviation approaches 1, which is the standard deviation of z. For example, let df = 1000, then the standard deviation is practically 1 (√1000⁄998 = 1.001). The fact that t has a larger standard deviation than z makes the tail area under the t curve relatively larger. Thus, using a computer, it can be shown that, while the tail area for the z score 1.96 is 0.025, the tail area associated with a t score of 1.96 (with df of, say, 9) is 0.0408. Having a larger standard deviation and tail area than z is a reflection of the fact that the t distribution applies to situations with a greater inherent uncertainty. The Μ −µ π₯ uncertainty arises from the fact that σ is unknown and is estimated by the random variable s. The t distribution, π‘ = ⁄ , 2 thus reflects the uncertainty in two random variables, π₯Μ and s, while π§ = Μ π₯−µ σ⁄√π π √π reflects only an uncertainty due to xΜ . The greater uncertainty in t (which makes confidence intervals based on t wider than those based on z) is the price we pay for not knowing σ and having to estimate it from sample data. Chapter 5—Interval Estimates Page 9 of 21 0 In the above diagram, the tail area corresponding to the score 1.96 is shown under the two curves. The tail area corresponding to 1.96 under the curve π‘ππ=9 (0.0408) is larger than that under the z curve (0.025). The t score which bounds a tail area of 0.025 under π‘ππ=9 is 2.262, which is further away from the center 0 than 1.96. The t distribution is a family of curves with a mean of 0 and a standard deviation which is determined by the degrees of freedom. The tail area for each π‘ score thus depends on the degrees of freedom. Conversely, the t score corresponding to a given tail area is determined by the degrees of freedom of the t distribution. Going back to the example, after the conversions, using df = n − 1 = 9, we find: P(−1.96 < t < 1.96) = 0.9184.3 Conversely, P(−2.262 < t < 2.262) = 0.95 As we can see, in the above example if the population standard deviation is known, the proportion of xΜ values that fall within 1.96 standard errors from μ is 0.95, but when s is used the proportion falling within ±1.96se(xΜ ) is 0.9184. Or, conversely, 95% of π₯Μ values fall within ±2.262se(π₯Μ ) from the population mean. Back to the Confidence Intervals for μ To build a 95% confidence interval when the population standard deviation is estimated from the sample, the margin of error is, πππΈ = π‘πΌ⁄2,(π−1) se(π₯Μ ) For example, suppose you are asked to build a 95% confidence interval for the population mean where you have to estimate the standard deviation using a sample of size π = 10. In this case you need to find the t score with π − 1 = 9 degrees of freedom corresponding to a tail area of πΌ⁄2 = 0.025, that is, tα/2,(n − 1) = t0.025,(9). The t-score that bounds a tail area of 0.025 for a t-distribution with 9 degrees of freedom is π‘0.025,9 = 2.262. Next section explains how to obtain this t-score. This probability is found by using the T.DIST function in Excel. The T.DIST syntax is: =T. DIST(π±, deg_freedom, cumulative). Using = T. DIST(­π. ππ, 9,1), we find the tail area 0.0408 under the t-curve. Then, the area P(−1.96 < t < 1.96) = 1 − 2 × 0.0408 = 0.9184. 3 Chapter 5—Interval Estimates Page 10 of 21 3.3.1.1. How to Find the t Score Corresponding to a Given Tail Area 3.3.1.1.1. Using the t-Table In the absence of a computer, you have to use a t table. The following shows a portion of a typical t table. The full table is at the end of the chapter. df 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 t scores corresponding to different tail areas 0.1 0.05 0.025 0.01 0.005 3.078 6.314 12.706 31.821 63.656 1.886 2.920 4.303 6.965 9.925 1.638 2.353 3.182 4.541 5.841 1.533 2.132 2.776 3.747 4.604 1.476 2.015 2.571 3.365 4.032 1.440 1.943 2.447 3.143 3.707 1.415 1.895 2.365 2.998 3.499 1.397 1.860 2.306 2.896 3.355 1.383 1.833 2.262 2.821 3.250 1.372 1.812 2.228 2.764 3.169 1.363 1.796 2.201 2.718 3.106 1.356 1.782 2.179 2.681 3.055 1.350 1.771 2.160 2.650 3.012 1.345 1.761 2.145 2.624 2.977 1.341 1.753 2.131 2.602 2.947 To find the t score associated with a tail area of 0.025 for a t-distribution with 9 degrees of freedom, you look up the t score under the column associated with a right tail area of 0.025 with df = 9. In this case, π‘0.025,9 = 2.262. Clearly, the t-score for the left tail area would be -2.626. 3.3.1.1.2. Using Excel In Excel, there are two functions to find the t scores associated with a given tail area and degrees of freedom. You may use either one. ο· ο· =T. INV(probability, deg _freedom) =T. INV. 2T(probability, deg _freedom) For example, to find π‘0.025,9 = 2.262, you can use the two Excel functions as follows: ο· =T.INV(0.025,9) This would return the t-score for the left tail area, −2.262. You then ignore the negative sign. =T.INV(α/2,df) ο· =T.INV.2T(0.05,9) =T.INV.2T(α,df) The term “2T” in the function implies “two tails”. Therefore you must double the tail area to get the t-score for the right tail. Examples 4 Using Excel, find tα/2, df for each of the following error probabilities (α) and degrees of freedom df = n − 1. Chapter 5—Interval Estimates Page 11 of 21 (a) πΌ = 0.05, π = 5 π‘πΌ⁄2,ππ = π‘0.025,4 (b) α = 0.10, n = 15 π‘πΌ⁄2,ππ = π‘0.05,14 (c) α = 0.05, n = 100 π‘πΌ⁄2,ππ = π‘0.025,99 =T.INV(0.025,4) = -2.776 =T.INV.2T(0.05,4) = 2.776 =T.INV(0.05,14) = -1.761 =T.INV.2T(0.1,14) = 1.761 =T.INV(0.025,99) = -1.984 =T.INV.2T(0.05,99) = 1.984 Example 5 The Food and Drug Administration needs to estimate the average content of an additive in a given food product. A random sample of 20 portions of the product yields a sample average of π₯Μ = 8.9 and sample standard deviation of π = 0.5 units. Construct a 95% confidence interval for the average number of units of additive in any portion of this food product. First, find the standard error, se(π₯Μ ) = π ⁄√π = 0.5⁄√20 = 0.112. The appropriate π‘ score is, t0.025,19 = 2.093. The margin of error is then, πππΈ = π‘πΌ⁄2,ππ se(π₯Μ ) = 2.093 × 0.112 = 0.234 MOE = tα/2,(n − 1) se(xΜ ) = 2.093 × 0.112 = 0.234 The interval is then, πΏ, π = 8.9 ± 0.234 = (8.67,9.13) Example 6 A sample of 16 five-pound bags of potatoes have the following weights (in pounds). 4.8 4.9 4.7 5.0 5.0 5.0 5.2 4.6 5.0 4.7 5.1 4.7 4.7 4.9 4.5 4.9 Construct a 90% confidence interval for the population mean μ. In this example, you must first compute π₯Μ and π from the sample data. π₯Μ = ∑π₯ ⁄π = 77.7⁄16 = 4.856 π 2 = ∑(π₯ − π₯Μ )2 ⁄(π − 1) = 0.559⁄15 = 0.035 π = √0.035 = 0.193 Since 1 − πΌ = 0.90, πΌ⁄2 = 0.05, then π‘0.05,15 = 1.753 se(π₯Μ ) = π ⁄√π = 0.193⁄√16 = 0.048 πππΈ = π‘πΌ⁄2,ππ se(π₯Μ ) = 1.753 × 0.048 = 0.085 πΏ, π = π₯Μ ± πππΈ = 4.856 ± 0.085 = (4.77,4.94) 3.3.2. The Option of Using π in Place of π for Large Sample Sizes As the sample size increases, the t score converges to the z score for a given tail area. The following table shows this convergence. The t scores for larger degrees of freedom are obtained using Excel. Chapter 5—Interval Estimates Page 12 of 21 The t and z scores associated with a right tail area of 0.025 and different sample sizes (ππ = π − 1) n t score z score 10 2.262 1.96 25 2.064 1.96 50 2.010 1.96 100 1.984 1.96 500 1.965 1.96 1000 1.962 1.96 The practical impact of this convergence is that for large sample sizes you can use π§πΌ⁄2 in place of π‘πΌ⁄2,ππ with a negligible impact on the πππΈ. The following table compares the margin of errors for 95% confidence intervals for different sample sizes for z and t distributions. A sample standard deviation of s = 8 is used. Margin of error for 95% confidence intervals associated with different n’s: A comparison of t and z scores. (π = 8) π‘0.025,(π−1) π§0.025 πππΈ = π‘0.025,(π−1) 8⁄√π π πππΈ = π§0.025 8⁄√π 10 25 30 50 100 500 1000 2.262 2.064 2.045 2.010 1.984 1.965 1.962 1.96 1.96 1.96 1.96 1.96 1.96 1.96 5.72 3.30 2.99 2.27 1.59 0.70 0.50 4.96 3.14 2.86 2.22 1.57 0.70 0.50 Note that after π = 100, the difference between the two margin of error estimates practically disappears. Therefore, for large sample sizes (π > 100) you may use the following formula for the margin of error: πππΈ = π§πΌ⁄2 se(π₯Μ ) By convention, any sample of size greater than 100 is considered “large”. Therefore, use z whenever the Μ ), if the sample size sample size is greater than 101. You must use t in the MOE formula, π΄πΆπ¬ = ππΆ⁄π,π π π¬π(π is π ≤ πππ, which yields π π ≤ ππ. 3.4. Confidence Intervals for μ when Population is not Normal If the population distribution is not known or it is known not to be normally distributed, then to assure that the sampling distribution of xΜ is approximately normal, the sample size, per Central Limit Theorem, must be 30 or more. The advantage afforded by the Central Limit Theorem makes it possible to construct confidence intervals for the population mean regardless of the distribution of the population. Example 7 To determine the profitability of used car sales in the Indianapolis metropolitan area a sample of 120 used car sales provided a sample mean profit of $320 and a sample standard deviation of $164. Build a 95% confidence interval for the mean profit for all used car sales in the Indianapolis metropolitan area. First, compute the standard error. Chapter 5—Interval Estimates Page 13 of 21 se(π₯) = 164⁄√120 = 14.971 To compare confidence intervals using z and t, first use the t distribution to determine the πππΈ. Using Excel, the t score is: π‘0.025,119 = 1.98 The margin of error is: πππΈ = 1.980 × 14.971 = 29.64 and πΏ, π = 320 ± 29.64 = (290.36,349.64) Alternatively, since n is large, you may use the z distribution. πππΈ = 1.96 × 14.971 = 29.34 and πΏ, π = 320 ± 29.34 = (290.66,349.34) Note that using z alters the boundaries of the interval by only negligible amounts. 3.5. Determining the Minimum Sample Size to Obtain a Desired Margin of Error Before embarking on building a confidence interval for the population mean, a researcher must decide the appropriate sample size. What is the appropriate sample size? The answer to this question depends upon the desired precision of the confidence interval. A wide confidence interval provides an imprecise and, therefore, inadequate information. You may narrow the interval by reducing the confidence coefficient 1 − α. But this would increase the error probability, raising the chances that the interval built would not capture the population mean. The best alternative is then to increase the sample size. However, sampling requires expenditure of resources. Therefore, you must choose the sample size subject to the resource constraint. If the benefits of a more precise estimation outweigh the additional cost of a larger sample, then sample size should be increased. The width of a confidence interval is determined by the margin of error πππΈ = π‘πΌ⁄2,(π−1) π √π As the formula shows, the margin of error is inversely related to the sample size. Solving for n, we have: π‘α⁄2,(π−1) π 2 π=( ) πππΈ To determine the appropriate sample size for a desired margin of error, therefore, we have to solve for π in the πππΈ formula. The problem with using the above formula is that it requires the use of the t distribution for which a sample size must be known in order to determine the degrees of freedom. Also, the formula involves π , the sample standard deviation, which implies to find the sample size, we must know the sample size! Chapter 5—Interval Estimates Page 14 of 21 Therefore, in the π formula above we need to replace π‘α⁄2,(π−1) and π with alternate quantities. The alternative to π‘α⁄2,(π−1) is clearly π§α⁄2 . And in place of π , we use a planning value for the standard deviation. Denoting the planning value by πΜ (π-hat), the formula for solving for π becomes: π§πΌ⁄2 π 2 π=( ) πππΈ We can determine a value for πΜ by using any of the following methods: ο· Use a sample standard deviation from a previous sample. ο· Conduct a pilot study to determine a preliminary estimate. ο· In some cases, if the minimum and maximum value of the population data set is known or can be reasonably estimated, an acceptable value for πΜ can be obtained by dividing the range (maximum – minimum) by 4. Example 8 In the example above regarding profitability of used car sales, the margin of error for a 95% confidence interval for the population average profit per used car sales, using a sample size of π = 120, was: πππΈ = 1.96(164⁄√120 = 29.34 Suppose we are interested in building a more precise 95% confidence interval with a margin of error of πππΈ = ±$10. What is the minimum sample size needed to achieve this margin of error? Use the sample standard deviation $164 as a planning value (πΜ = 164). π§πΌ⁄2 π 2 1.96 × 164 2 π=( ) =( ) = 1,033.24 πππΈ 10 Since you cannot have fractional samples, and because you are interested in the minimum sample size, you have to round up the resulting π value to a full integer. Thus, π = π, πππ. 4. Confidence Interval For the Population Proportion π To estimate the population proportion π, the sample statistic πΜ is used. The population proportion measures the relative frequency of occurrence of some characteristic in the population. For example, the percentage of voters voting for a candidate is the frequency of the number of voters voting for the candidate relative to the total number of voters (π = π₯ ⁄π). If you take a sample of π voters, then the sample proportion of voters is the number of voters favoring the candidate in the sample divided by the sample size: πΜ = π₯ ⁄π. The sample statistic πΜ is the estimator of the population parameter π. Since the value of πΜ is obtained from a random sample, then the sample statistic πΜ is a random variable. And, since there are infinite number of possible samples, there are infinite number of values of πΜ . If these sample proportions are normally distributed, then we can develop confidence intervals for the population proportion π in the same manner we build intervals for the population mean μ. For large sample sizes, the sampling distribution of πΜ is (approximately) normal. In Chapter 4 it was explained that the binomially distributed random variable π₯ is approximately normal if ππ ≥ 5 and ππ (1 − π) ≥ 5 . The following shows the distribution of π₯ for π = 100 and π = 0.50. Chapter 5—Interval Estimates Page 15 of 21 Binomial Probability Distribution of x n = 100, π = 0.50 x Note that π₯ in a given sample represents the number of “successes” (say, the number of voters in a sample favoring a candidate). Since the sample size is large, π₯ is approximately normal. Therefore, the random variable πΜ = π₯ ⁄π is also approximately normally distributed. (πΜ is a linear transformation of π₯, since π₯ is multiplied by a constant 1⁄π.) The mean (the expected value) and standard error of πΜ , as shown in Chapter 4, are: E(πΜ ) = π π(1 − π) se(πΜ ) = √ π Example 9 Suppose in the population of households 20% of households spend $200 or more per week on groceries. A sample of π = 500 households is taken. What is the probability that between 15%-25% of households in this sample spend $200 or more per week on groceries? P(0.15 < πΜ < 0.25) = ______ Here the sample proportion is normally distributed with a mean of πΈ(πΜ ) = π = 0.20, and standard error of 0.20(1 − 0.20) se(πΜ ) = √ = 0.0179 500 Using π§= πΜ − π se(πΜ ) and computing π§ = ±2.79, then P(−2.79 < z < 2.79) = 0.9947. Chapter 5—Interval Estimates Page 16 of 21 Example 10 In the previous example, find the interval of πΜ values which contains the middle 95% of sample proportions. To determine the interval, as the diagram below shows, you need to determine the margin of error MOE = z0.025se(pΜ ) = 0.95 0.95 π pΜ 95% of all pΜ values fall within this interval Since π§0.025 = 1.96 and se(πΜ ) = 0.0179, then πππΈ = 1.96(0.0179) = 0.035 Thus, 95% of πΜ values fall within the interval 0.20 ± 0.035, or, 95% of πΜ ’s fall in the interval (0.165,0.235) P(0.20 − 0.035 < πΜ < 0.20 + 0.035) = 0.95 P(0.165 < πΜ < 0.235) = 0.95 Recall from the discussion of the confidence interval for μ that we can use the margin of error to build intervals around each value of sample statistic, π₯Μ ± πππΈ, and state, with a given level of confidence, that this interval would capture the population parameter. We can do the same here. Since there are infinite number of πΜ values available, we can build infinite number of intervals πΜ ± πππΈ. Of these, 95% would capture the population parameter π and 5% would not. The fraction of intervals that would not capture the population parameter π is denoted by α. To build confidence intervals for the population proportion π, the margin of error used in the confidence interval formula πΏ, π = πΜ ± πππΈ is πππΈ = π§πΌ⁄2 se(πΜ ) where, Chapter 5—Interval Estimates Page 17 of 21 πΜ (1 − πΜ ) se(πΜ ) = √ π Note that the formula used for the se(πΜ ) differs from the one use in the above example where π(1−π) se(πΜ ) = √ π . The reason we cannot use this formula in the margin of error for the confidence interval is that this formula requires the knowledge of π, the unknown parameter for which we are developing an interval estimate. Example 11 In a poll of 1,000 registered voters, 380 favored candidate A. Construct a 95 percent confidence interval for the proportion of all registered voters favoring candidate A. Here you must first compute the proportion of voters in the sample who favor candidate A. Assigning “1” to those who favor the candidate, then the sample proportion is: πΜ = ∑π₯ 380 = = 0.38 π 1000 πΜ (1 − πΜ ) 0.38(1 − 0.38) se(πΜ ) = √ =√ = 0.0153 π 1000 Since you are building a 95% confidence interval, the confidence coefficient is 1 − α = 0.95. Then α = 0.05 and zαβ2 = z0.025 = 1.96 . The margin of error is MOE = 1.96(0.0153) = 0.03 The boundaries of the 95% confidence interval are: L = 0.38 − 0.03 = 0.35 U = 0.38 + 0.03 = 0.41 Note that this is a typical example of the poll results reported in the media. The news media report the result as: The poll results show that 38% of potential voters prefer candidate A with a margin of error ±3 percentage points. The media, however, do not generally report that this is a 95% confidence interval. The New York Times, as an exception, usually provides an explanation in a box, stating, for example, that: "In theory, in 19 cases out of 20 the results based on such samples will differ by no more than three percentage points in either direction from what would have been obtained by seeking out all American adults." The term "in 19 cases out of 20" is what we understand as a 95% confidence level (19 β 20 = 0.95). That is, 19 of every 20 (95% of ) intervals built around all possible πΜ values would contain the actual population proportion π. Example 12 A medical researcher working for one of the pharmaceutical firms developing anti-ulcer drugs needs to estimate the percentage of ulcers that are caused by bacteria. In a random sample of 860 ulcer patients 758 had ulcers caused by bacteria. Build a 95% confidence interval for the percentage of ulcers caused by bacteria. πΜ = 758 = 0.881 860 Chapter 5—Interval Estimates Page 18 of 21 0.881(1 − 0.881) se(πΜ ) = √ = 0.0110 860 MOE = z0.025 se(pΜ ) = 1.96 × 0.0110 = 0.022 L = 0.881 − 0.022 = 0.859 U = 0.881 + 0.022 = 0.903 4.1. Minimum Sample Size Needed to Estimate π to Within a Desired Margin of Error Like the confidence interval for μ, the precision of the confidence interval for π is determined by the margin of error. For a given confidence level, the confidence interval for π is then L, U = pΜ ± MOE where, πΜ (1 − πΜ ) MOE = π§0.025 √ π Solving for n, we have π=( π§α⁄2 2 ) πΜ (1 − πΜ ) MOE This formula would allow us to find the required sample size for a desired margin of error. However, there is one problem. To find n, you need a value for the sample proportion, pΜ = ∑x/n. This requires a prior knowledge of n. Thus, we are dealing with a circular formula. To avoid this problem the formula for n is changed to the following: π=( π§α⁄2 2 ) πΜ(1 − πΜ) MOE Here πΜ is replaced by πΜ , which is the planning value for πΜ . The planning value may be obtained from the past surveys. If no previous surveys are available, use π Μ = π. ππ for the planning value. For a given confidence level, this will result in the largest sample for the desired MOE. Example 13 What is the minimum sample size to estimate the population of registered voters who favor candidate A to within ±3 percentage points with a 95 percent level of confidence. Assume a planning value of 0.50 for population proportion. MOE = 0.03 zα β 2 = z0.025 = 1.96 πΜ = 0.50 1.96 2 π=( ) (0.5)(1 − 0.5) = 1067.11 0.03 Rounded up, n = 1,068. Chapter 5—Interval Estimates Page 19 of 21 Summary of Formulas Confidence Interval for μ Confidence Interval for π L, U = xΜ ± MOE L, U = pΜ ± MOE a) When n ≥ 100 b) When n < 100 MOE = zα β2 se(xΜ ) MOE = tα β2,(n − 1) se(xΜ ) se(π₯Μ ) = π =√ π √π πΜ (1 − πΜ ) se(πΜ ) = √ π ∑(π₯ − π₯Μ )2 π−1 Minimum Sample Size for a Given MOE π§α⁄2 σ 2 π=( ) MOE Chapter 5—Interval Estimates pΜ = ∑x β n MOE = zα β2 se(pΜ ) Minimum Sample Size for a Given MOE π=( π§α⁄2 2 ) π(1 − π) MOE Page 20 of 21 df 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Chapter 5—Interval Estimates 0.10 3.078 1.886 1.638 1.533 1.476 1.440 1.415 1.397 1.383 1.372 1.363 1.356 1.350 1.345 1.341 1.337 1.333 1.330 1.328 1.325 1.323 1.321 1.319 1.318 1.316 1.315 1.314 1.313 1.311 1.310 1.309 1.309 1.308 1.307 1.306 1.306 1.305 1.304 1.304 1.303 1.303 1.302 1.302 1.301 1.301 1.300 1.300 1.299 1.299 1.299 0.05 6.314 2.920 2.353 2.132 2.015 1.943 1.895 1.860 1.833 1.812 1.796 1.782 1.771 1.761 1.753 1.746 1.740 1.734 1.729 1.725 1.721 1.717 1.714 1.711 1.708 1.706 1.703 1.701 1.699 1.697 1.696 1.694 1.692 1.691 1.690 1.688 1.687 1.686 1.685 1.684 1.683 1.682 1.681 1.680 1.679 1.679 1.678 1.677 1.677 1.676 Right Tail Area 0.025 0.01 12.706 31.821 4.303 6.965 3.182 4.541 2.776 3.747 2.571 3.365 2.447 3.143 2.365 2.998 2.306 2.896 2.262 2.821 2.228 2.764 2.201 2.718 2.179 2.681 2.160 2.650 2.145 2.624 2.131 2.602 2.120 2.583 2.110 2.567 2.101 2.552 2.093 2.539 2.086 2.528 2.080 2.518 2.074 2.508 2.069 2.500 2.064 2.492 2.060 2.485 2.056 2.479 2.052 2.473 2.048 2.467 2.045 2.462 2.042 2.457 2.040 2.453 2.037 2.449 2.035 2.445 2.032 2.441 2.030 2.438 2.028 2.434 2.026 2.431 2.024 2.429 2.023 2.426 2.021 2.423 2.020 2.421 2.018 2.418 2.017 2.416 2.015 2.414 2.014 2.412 2.013 2.410 2.012 2.408 2.011 2.407 2.010 2.405 2.009 2.403 0.005 63.657 9.925 5.841 4.604 4.032 3.707 3.499 3.355 3.250 3.169 3.106 3.055 3.012 2.977 2.947 2.921 2.898 2.878 2.861 2.845 2.831 2.819 2.807 2.797 2.787 2.779 2.771 2.763 2.756 2.750 2.744 2.738 2.733 2.728 2.724 2.719 2.715 2.712 2.708 2.704 2.701 2.698 2.695 2.692 2.690 2.687 2.685 2.682 2.680 2.678 df 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 0.10 1.298 1.298 1.298 1.297 1.297 1.297 1.297 1.296 1.296 1.296 1.296 1.295 1.295 1.295 1.295 1.295 1.294 1.294 1.294 1.294 1.294 1.293 1.293 1.293 1.293 1.293 1.293 1.292 1.292 1.292 1.292 1.292 1.292 1.292 1.292 1.291 1.291 1.291 1.291 1.291 1.291 1.291 1.291 1.291 1.291 1.290 1.290 1.290 1.290 1.290 Right Tail Area 0.05 0.025 0.01 1.675 2.008 2.402 1.675 2.007 2.400 1.674 2.006 2.399 1.674 2.005 2.397 1.673 2.004 2.396 1.673 2.003 2.395 1.672 2.002 2.394 1.672 2.002 2.392 1.671 2.001 2.391 1.671 2.000 2.390 1.670 2.000 2.389 1.670 1.999 2.388 1.669 1.998 2.387 1.669 1.998 2.386 1.669 1.997 2.385 1.668 1.997 2.384 1.668 1.996 2.383 1.668 1.995 2.382 1.667 1.995 2.382 1.667 1.994 2.381 1.667 1.994 2.380 1.666 1.993 2.379 1.666 1.993 2.379 1.666 1.993 2.378 1.665 1.992 2.377 1.665 1.992 2.376 1.665 1.991 2.376 1.665 1.991 2.375 1.664 1.990 2.374 1.664 1.990 2.374 1.664 1.990 2.373 1.664 1.989 2.373 1.663 1.989 2.372 1.663 1.989 2.372 1.663 1.988 2.371 1.663 1.988 2.370 1.663 1.988 2.370 1.662 1.987 2.369 1.662 1.987 2.369 1.662 1.987 2.368 1.662 1.986 2.368 1.662 1.986 2.368 1.661 1.986 2.367 1.661 1.986 2.367 1.661 1.985 2.366 1.661 1.985 2.366 1.661 1.985 2.365 1.661 1.984 2.365 1.660 1.984 2.365 1.660 1.984 2.364 0.005 2.676 2.674 2.672 2.670 2.668 2.667 2.665 2.663 2.662 2.660 2.659 2.657 2.656 2.655 2.654 2.652 2.651 2.650 2.649 2.648 2.647 2.646 2.645 2.644 2.643 2.642 2.641 2.640 2.640 2.639 2.638 2.637 2.636 2.636 2.635 2.634 2.634 2.633 2.632 2.632 2.631 2.630 2.630 2.629 2.629 2.628 2.627 2.627 2.626 2.626 Page 21 of 21