Confidence Intervals In statistics, we often prefer to construct an interval estimate of the unknown population parameter. An interval estimate proposes a range of values to account for the inherent uncertainty in any estimation process. The most frequent type of interval estimate is the confidence interval, which employs an interval formula that, a priori, contains the true parameter a high probability of the time. Unfortunately, once a particular interval is calculated from sample data, we never know if it is among those that contained the true parameter or those that did not. From our knowledge of the Central Limit Theorem and the standard normal distribution (Z), we know that the following probability statement is approximately true for large enough n: X P z / 2 z / 2 = 1 - n where the “cutoff” point z / 2 is chosen so that / 2 area is nestled in the upper tail. Standard Normal (Z) Area in this tail is exactly / 2 z / 2 Z-Values Area in this tail is exactly / 2 Q: How would you compute the z-value needed here? A: Take the given and compute either z / 2 (1) =NORMSINV(1- / 2 ) or (2) =NORMINV(1- / 2 , 0, 1) The preceding probability statement can be rearranged to state P( X z / 2 n X z / 2 n ) =1- . This says the formula X z / 2 / n will “work” (contain the true mean ) 100 (1 ) % of the time in the long run. If we replace X with an observed sample mean x , then we get a sample interval x z / 2 , which is called a 100 (1 ) % confidence interval for the mean. n Observe that we still do not know if the computed interval is one of the instances where the Copyright John Semple 2007 22 formula “worked.” If the underlying population we draw from is normally distributed, the confidence interval is said to be an exact 100 (1 ) % confidence interval. Otherwise we call it an approximate 100 (1 ) % confidence interval. The probability is assumed to be small, typically .05 or less. This is the probability that the formula fails in the long run. The confidence interval formula stated earlier is not very useful unless (or equivalently 2 ) is known. Soon we’ll see how to replace 2 with a sample estimate, which creates only minor changes in the formula. Example. Assuming 2 4900 , construct a 95% confidence interval for the mean repair cost in the preceding car repair example. Recall that the sample mean was x $215.66 . Solution. Confidence Intervals for a Population Proportion (*Optional) If you haven’t seen this formula in the real world, then you don’t watch TV. We are awash in TV polls spewing sample statistics regarding the percentage of people that believe in something, suffer from something, would vote for someone, etc. The next time you see a poll on TV, glance at the bottom of the screen. They usually state a margin of error (the “sampling error”). Suppose you want to determine the percentage of Texans who own firearms. If you take a random sample of n people from the state, a certain sample percentage will own guns. How does this sample percentage compare to the true population percentage? Let p denote the true population percentage and let p̂ denote the sample percentage. Each individual is just a random draw from a population whose distribution is X 1 (Own gun) 0 (Do not own gun) Probability p 1-p You can check that the mean of this population is p, and the variance is p(1-p). Applying the CLT (standard normal version) to this problem means pˆ p p (1 p ) n Copyright John Semple 2007 ~Z. 23 p(1 p) . This n expression still involves the unknown parameter p under the square root. The most common way to handle this is to replace p with p̂ under the root. This approximation is acceptable provided npˆ 5 and n(1 pˆ ) 5 . This results in a (1 )100% confidence interval for p given by pˆ z / 2 Confidence Interval for a Proportion A 100(1 )% confidence interval for the true population proportion p is given by pˆ z / 2 pˆ (1 pˆ ) , where p̂ is the sample percentage. n Example (This problem was supplied by Chris Weldon, SMU MBA Class P40). A large North Texas company acquired a new business unit to deliver company mail to each of their 15 North Texas sites (3-5 deliveries per day) in the fall of 1997. Mail included U.S. mail, small manufactured parts, internal documents, packages, etc. The cost of this unit is about $1,000,000 a year, and so management felt it was necessary to monitor the performance of their deliveries (i.e., getting the right package to the right location). It was too costly to keep records of every item, and so they decided to randomly sample items (usually by auditor's at each site's mail drop) to estimate the true delivery accuracy. In October of 1997, 8717 units were randomly checked, of which 69 were delivered in error. Calculate a 95% confidence interval for the true proportion of accurate deliveries. Estimating an Unknown Population Variance In the vast majority of real life data situations, you will not know the population variance 2 . The formulas for confidence intervals developed earlier do not apply “as is.” There are a number of ways to handle the problem, but most focus on using an estimate of the true population variance 2 . Given a random sample ( X1 , X2 ,...., Xn ) from a large population, the most common estimate of the population variance is the sample variance, denoted by s 2 , and given by the formula 1 n s2 ( xi x ) 2 . n 1 i 1 This value is computed in Excel with the VAR(:) function. The argument (:) references the spreadsheet cells containing the sample data. The sample standard deviation is s (the square root of the sample variance), and it is given by the Excel function STDEV(:). Alternatively, s can be calculated by taking the square root of VAR(:). Copyright John Semple 2007 24 Confidence Intervals with an Unknown Population Variance: The tdistribution In our earlier confidence interval formulas, we assumed that the standard deviation was known (see the hail dent analysis). This allowed us to use z-values from a standard normal and compute confidence intervals. If we do not know , it is tempting to simply insert the sample estimate s in its place. Fortunately, we can do this with only minor modifications. However, this requires an understanding of a new distribution called the t distribution. If a random sample is drawn from a normal distribution, the distribution of ( X ) ( s / n ) follows a t distribution with n-1 degrees of freedom (df). Note that the denominator has an s instead of a . The “degrees of freedom” is a parameter that you do not need to estimate since it depends directly on the sample size n. The t distribution is symmetric about 0 and looks a lot like the normal distribution (especially as the degrees of freedom increase). The general shape for various degrees of freedom is depicted below. t with 20 df t distribution vs. z distribution t with df >30 (almost normal) t with 10 df 0 To calculate probabilities for the t distribution, we use the TDIST function in Excel. The value p=TDIST(x, df, 1) is the area (=probability) above a nonnegative value x from a t distribution with df degrees of freedom. The value p=TDIST(x, df, 2) is the combined area above x and below –x. In a picture, Probability = p = TDIST(x, df, 2) Probability = p = TDIST(x, df, 1) -x Copyright John Semple 2007 x 25 To convert a probability into a t-value, we use the TINV function. In Excel, TINV(p,df) is the value x such that p/2 is the area above x in a t distribution with df degrees of freedom. In a picture, AREA = p/2 x=TINV(p,df) One must remember that TINV splits the given probability (p) in half when calculating the x value. Confidence Intervals using the t Distribution If we continue to assume that the population we are sampling from is normal (or approximately normal), then we can construct a 100 (1 ) % confidence interval (CI) for the mean using the formula s x t / 2, df n Example. (Brent Pope, SMU MBA Class 46P). Airco has inspected 81 armature hubs and measured their roughness. The data is provided in Hubs.xls. Calculate a 95% confidence interval (CI) for the mean roughness. Calculate a 98% confidence interval (CI) for the mean roughness. Copyright John Semple 2007 26 Example. (Tamir Ayad, SMU MBA Class P43P) A new production process is being tested to reduce the number of “large” particles (2 microns and higher) that contaminate silicon wafers. A random sample of n=81 wafers is drawn and the large contaminating particles counted. The following sample statistics were computed: x 1.88 s 2 2.71 . Construct a 99% interval for the mean of the new process. Note: When there is sufficient data, you should use the HISTOGRAM function in DATA ANALYSIS (under TOOLS) to see if the data appear to be roughly normally distributed when using the t distribution. Double click on HISTOGRAM and fill out your options (select “Chart Output” to get the visual display or histogram). FAQ: Selecting a Sample Size (*Optional) Suppose you want to construct a confidence interval to estimate the mean of some population. Note that the width of the confidence interval x z /2 depends on quantities that do not n change with the sample mean. In fact, the width only depends on z /2 , and n. If a level of confidence is specified ( 1 ) and is known (or can be estimated), then the width of the confidence interval can be made narrower by selecting a larger value of n. Example. Suppose in the KIA Motor’s example we want a 95% confidence interval whose width is $20. How big a sample should we take? $10 for n. Solution. This means the half-width is $10. We need to solve z / 2 n The general formula is z n /2 D 2 where 1) z /2 is determined from an N(0,1) table after the level 1 is specified 2) is the population standard deviation which is known (or approximated) 3) D is the desired half-width of the confidence interval. In practice, the standard deviation can often be approximated by the formula Range , 4 where the Range is the difference (in absolute value) between the biggest value and the smallest value in a random sample. Copyright John Semple 2007 27 Assignment 3 Central Limit Theorem Problems 1. Book, 7.24 (Use CLT for n ≥ 30) NOTE: On part (d), simply determine if the sample size in part (c) is adequate. What would you recommend with respect to the sample size n=45? 2. Book, 7.25 (Use CLT for n ≥ 30) Confidence Interval Problems 3. 4. 5. Book, 8.17 (Remember, if you use s for , use t!) Book, 8.21 Book, 8.22 Bonus (5 pts) Book 8.37 (this requires calculating a confidence interval for a proportion) Copyright John Semple 2007 28