Chapter 8 — Confdence Intervals 1 Inferential Statistics: Population data are usually difcult to gather. For instance, let’s suppose that we want to know the average height of all Americans. It is unrealistic for a researcher to measure the height of each and every American — it’s virtually impossible. In situations such as this, it’s prudent to estimate the height of all American’s using a more manageable (smaller) sample. The assignment of value(s) to a population parameter based on a value of the corresponding sample statistic is called . Point Estimate: Interval Estimate: Example 0.1. Suppose a marine biologist would like to estimate the mean birthweight of the Loggerhead sea turtles on the Treasure Coast. However, it’s practically impossible to fnd and weigh every Loggerhead sea turtle hatchling on the Treasure Coast, therefore the biologist must gather a random sample of Loggerhead turtle birth weights and use that sample data to estimate the birthweight of all Loggerhead hatchlings. Imagine the average of one such sample of 100 Loggerhead hatchlings yielded a mean birthweight of 20 grams. The sample mean x̄ = 20 is referred to as the point estimate. That being said, it’s unlikely that the true average birthweight of all Loggerhead sea turtles on the Treasure Coast is precisely 20 grams, so the researcher may instead construct an interval estimate and say that the average birthweight is likely somewhere between 18 grams and 22 grams. A is an interval that is constructed around a point estimate and it is stated that we are XX% confdent that the interval contains the true population mean. Thus, a confdence interval provides a range of reasonable values in which we expect the population parameter (µ or p) to fall. There is no guarantee that any single confdence interval will contain the unknown population parameter — only a 0.XX probability that it does. We will consider confdence intervals of the form Point Estimate ± Margin of Error That is, to calculate a confdence interval, we simply need to calculate the point estimate (µ or p), the margin of error, and then add/subtract these values to determine the lower and upper bound of the confdence interval. Note: since a confdence interval is obtained by adding/subtracting the margin of error from the point estimate, the overall width of a confdence interval will always (in this course) be double the size of the margin of error. Three Basic Scenarios for Confdence Intervals 1. Confdence Interval for a population mean when σ is known 2. Confdence Interval for a population mean when σ is unknown 3. Confdence Interval for a population proportion We’ll soon introduce a new statistical table to use for scenario 2. How do we determine what statistical table to use? Use the following fow chart. By the way, we’ll see three similar scenarios in Chapter 9. CI for µ or p? µ p Is σ known? Yes Use z-table Use z-table No Use t-table Important Note: A confdence interval is constructed with regard to a specifc ‘confdence level.’ The confdence level indicates the percentage of intervals that should contain the true, unknown population mean, after numerous samples are drawn from the same population and their confdence intervals are each considered. It is not guaranteed that any one confdence interval will contain the true unknown population mean. For instance, if the confdence level is 95%, then we would expect about 95 out of every 100 confdence intervals to contain the true population mean. In other words, there is no guarantee that any of our individual solutions in this chapter are ‘correct’ and contain the true population mean. Chapter 8 — Confdence Intervals 1 2 Estimating a Population Mean When σ is Known The frst scenario for which we would like to calculate a confdence interval is when we wish to estimate a population mean when σ is known. That is, we want to estimate the population mean using sample data, when we happen to know the population standard deviation already (perhaps from previous research). Of course, for the Central Limit Theorem to apply to the sampling distribution of x̄ and guarantee normality, the sample size must be or larger. Alternatively, the sampling distribution of x̄ will be normal if the population distribution is known to be . See Chapter 7 for details. The confdence interval for a population mean (µ) when σ is known is given by x̄ ± E where E = z · σx̄ . The value of z is corresponds to the desired level of confdence and σx̄ = √σ . n How exactly does one determine what value of z to use in the expression above? To determine the z-value that corresponds to a XX% confdence level, sketch a standard normal curve with XX% of the area under the curve centered about the origin, then fnd the value of z the separates the shaded middle region from the unshaded right tail. Example 1.1. Determine the z-value that corresponds to a 90% confdence level. To determine the z-value that corresponds to a 90% confdence level, we frst sketch a standard normal curve and shade the middle 90% — centered about the mean. Recall that the total area under the standard normal curve is 1, thus the area of the middle 90% is 0.90. If the middle shaded portion has area 0.90, it follows that the undshaded portion (the tails) of the curve is 1 − 0.90 = 0.10 or 10%. Furthermore, due to the symmetric nature of the 0.90 standard normal distribution, each unshaded tail must have an area of 0.05 (half of 0.10). Lastly, the desired z-value is the z-value that separates the shaded middle 90% from the 0.05 0.05 unshaded 0.05% in the right tail. Sound familiar? We did this exact exercise in Chapter 6. Before we can solve, we need to note the total area to the left of the desired z-value, which is 0.05 + 0.90 = 0.95. Once the area to the left is known, the z-value can fnally be found by −4 −3 −2 −1 0 1 2 3 using the z-table (fnd 0.05 in the interior of the z-table) or by using the ‘invnorm’ feature z=? of a graphing calculator. Either method will yield the solution z = 1.64. Thus, the z-value that is used for a 90% confdence interval is 1.64. The z-value that corresponds to any other confdence level can be found using the exact method outlined in Example 1.1 above. Example 1.2. Determine the z-value that correspond to each of the common confdence levels: 95%, 96%, 97%, 98%, and 99%. −4 −4 −3 −3 −2 −2 −1 −1 0 0 1 1 2 2 3 3 4 4 −4 −4 −3 −3 −2 −2 −1 −1 0 0 1 1 2 2 3 3 4 4 −4 −3 −2 −1 0 1 2 3 Confdence Level z -value 90% z = 1.64 95% z = 1.96 96% z = 2.05 97% z = 2.17 98% z = 2.33 99% z = 2.58 4 4 Chapter 8 — Confdence Intervals 3 Example 1.3. The nursing department needs to create a informational brochure for student interested in a nursing career. The average salary of a nurse on the Treasure Coast needs to be included in this brochure. Of course, it’s not practical to contact every nurse on the Treasure Coast, ask for his or her salary, and then compute the true population average salary. Instead, the nursing department contacts a random sample of 36 nurses, asks for the salary information of each, and then determines that the sample average of those 36 nurses is $56,000. Since the sample data did not include every nurse on the Treasure Coast, the population mean is most likely not $56,000 exactly, but close. Thus, it’s appropriate to construct a confdence interval around x̄ = 56, 000, which will yield a range of likely values of the average salary of all nurses on the Treasure Coast. Suppose the standard deviation of all nursing salaries on the Treasure Coast is $6,000 and the nursing department would like to use a 99% confdence level for their estimate. A. Determine the point estimate for the average salary of all nurses on the Treasure Coast. B. Determine the margin of error for a 99% confdence interval. C. Construct a 99% confdence interval for the average salary of all nurses on the Treasure Coast. Example 1.4. Reference Example 1.3 above and suppose the nursing department decided to instead construct a 95% confdence interval for the mean salary of nurses on the Treasure Coast, in lieu of the original 99% confdence interval. If all other aspects of Example 1.3 remain the same, explore how the confdence interval is afected by the lowering of the confdence level. A. Determine the point estimate for the average salary of all nurses on the Treasure Coast. B. Determine the margin of error for a 95% confdence interval. C. Construct a 95% confdence interval for the average salary of all nurses on the Treasure Coast. Example 1.4 demonstrates that if the level of confdence is directly related to the margin of error. If two confdence intervals with diferent levels of confdence are constructed from the same sample data, the confdence interval with the lower level of confdence will have the smaller margin of error. In general, a lower margin of error is desired — but arbitrarily lowering the level of confdence is not recommended. There is a better way to decrease the margin of error. Example 1.5. Once again, reference Example 1.3 above and suppose the nursing department decided that the original margin of error was too large, but they do not want to decrease the level of confdence (as in Example 1.4). Instead, they decide to increase their sample size from the original 36 to 100. Presume the average of this new sample of 100 nurses on the Treasure Coast yielded the same average salary of $56,000. If all aspects of example 1.3 remain unchanged except for sample size, explore how the confdence interval is afected by the increased sample size. A. Determine the point estimate for the average salary of all nurses on the Treasure Coast. B. Determine the margin of error for a 99% confdence interval. C. Construct a 99% confdence interval for the average salary of all nurses on the Treasure Coast. Therefore, the best way to decrease the margin of error of a confdence interval is to the . Decreasing the level of confdence is will also decrease the margin of error, but in ‘real life’ one would never change the confdence level to achieve a smaller margin of error. Confdence levels are determined by discipline (e.g. medical research may us 99% but chemists may use 95%). Chapter 8 — Confdence Intervals 4 In Examples 1.3, 1.4, and 1.5, the interaction between sample size, confdence level, and the margin of error of a confdence interval was investigated. We noted that using a larger sample size or smaller confdence level resulted in a relative lower margin of error for a confdence interval. Taking this relationship one step further, we can actually fnd the sample size required to produce a desired margin of error, given a predetermined level of confdence, maximum margin of error, and population standard deviation. The sample size required to produce a confdence interval for µ with a maximum margin of error E is given by z2 σ2 n= E2 where z corresponds to the desired level of confdence and σ is the population standard deviation. Note: If the initial calculation of n does not yield a whole number, then the value must be rounded up to the nearest whole number. Example 1.6. Reference Example 1.3. Suppose the nursing department has yet to collect any data for their confdence interval of average nursing salaries on the Treasure Coast. Determine the sample size necessary to produce a confdence interval with a maximum margin of error of $1,000 if the confdence level desired 99% and the population standard deviation is known to be $6,000. 2 Estimating a Population Mean When σ is Unknown In the previous section, we calculated confdence intervals using the expression x̄ ± E where the margin of error was calculated using the formula E = z · σx̄ . However, this begs the question, ‘How does one calculate a confdence interval if the population standard deviation (σ) is not known.’ This very question was answered by William Gosset in 1908. Using the pseudonym Student, Gosset published the t-distribution (often referred to as ‘Students t-distribution), which is actually a family of distributions that takes into account the sampling error of the standard deviation for relatively small sample sizes. Why is this necessary? For small sample N (0, 1) sizes, the diference between the population and sample standard deviations is more t(2) t(8) pronounced, leading to greater variability in the sampling distributions for small n. As shown in the graphic to the right, the t distribution ‹ is symmetric about its mean t = 0, ‹ is bell shaped, and ‹ has ‘heavier’ tails compared to N (0, 1) −4 −3 −2 −1 0 1 2 3 4 Furthermore, the t-distribution has a single parameter — degrees of freedom. For the purposes of this course, degrees of freedom (df ) can be calculated as df = n − 1. As illustrated by the graphic above, the t-distribution is similar, but noticeably diferent, compared to the z-distributions when n is small. However, as n → ∞, the t-distribution becomes practically indistinguishable from the z-distribution. In fact, the t-distribution table we will use in this class only includes calculations up to df = 75 (or n = 76), since the t- and z-distributions are basically the same at that point. Chapter 8 — Confdence Intervals 5 To construct a confdence interval for a sample mean (x̄) when the population standard deviation (σ) is not known, the t distribution is used instead of the standard normal (z) distribution. Otherwise, the method is virtually the same. The confdence interval for a population mean (µ) when σ is not known is given by x̄ ± E where E = t · sx̄ . The value of t is corresponds to the desired level of confdence and sx̄ = √s . n Notice that the confdence intervals is calculated using the same basic expression x̄ ± E. The only diference lies in the way E is calculated. If σ is known, E = z · σx̄ (see previous section). However, if σ is not known, E = t · sx̄ . Important Note: In practice (outside of this class), the population standard deviation will be unknown just like the population mean is unknown. Therefore, the z-distribution is rarely used for confdence intervals apart from their introduction in a class like this. The t-distribution is usually required when calculating confdence intervals out in the ‘real world.’ So how does one determine the proper t-value to use when calculating the margin of error? Use the following method. Method to determine the t-value use for a confdence interval when σ is not known 1. Calculate the degrees of freedom. Recall, df = n − 1. 2. Locate the row equal to the result from Step 1. 3. Locate the column equal to the desired confdence level (90%, 95%, etc.). 4. The desired t-value is the intersection of the row from Step 2 and the column from Step 3. Notice that the only confdence levels included on the t-table are 80%, 90%, 95%, 98%, 99%, and 99.9%. This is due to the fact that the t-distribution is actually a family of distributions — each value of df defnes a separate distribution. Imagine that each df value defnes its own table, similar to the z-table. To construct a manageable table of t-values for students to use, only select confdence levels were included. Example 2.1. The American Automobile Association, better known as AAA or ‘Triple A,’ publishes the average daily gas price for the entire nation as well as individual states. To estimate the average gas price in Florida, AAA contacted a random sample of 50 gas stations from around the state and found that the average gas price was $2.46 per gallon with a standard deviation of $0.15. A. Determine the point estimate for the average gas price per gallon in all of Florida. B. Determine the margin of error for a 99% confdence interval. C. Construct a 99% confdence interval for the average gas price per gallon in Florida. Example 2.2. The U.S. Bureau of Labor Statistics (BLS) releases a monthly report on the average hourly wage of American employees. A recent sample of 500 American employees produced a mean hourly wage of $28.18 with a standard deviation of $1.35. A. Determine the point estimate for the average hourly wage of all American employees. B. Determine the margin of error for a 98% confdence interval. C. Construct a 98% confdence interval for the average hourly wage of all American employees. Chapter 8 — Confdence Intervals 6 Example 2.3. The college would like to estimate the average number of hours students work per week. Suppose 15 students are randomly selected and asked how many hours they work per week. The results are given below. 10, 8, 0, 12, 15, 30, 25, 20, 0, 40, 10, 12, 35, 20, 15 Assume the hours students work per week are known to be normally distributed. A. Determine the point estimate for the average number of hours worked per week by all students at the college. B. Determine the margin of error for a 90% confdence interval. C. Construct a 90% confdence interval for the average number of hours worked per week by all students at the college. 3 Estimating a Population Proportion The fnal type of estimation we will consider is that of proportions. Recall from Chapter 7 that a proportion is simply a percent in decimal form. Example 3.1. Suppose 80 out of 100 students in a large class pass a test. The percentage of students who passed the test is 80% while the proportion of students who passed the test is 0.80. Since this proportion involves the entire class, we refer to 0.80 as the population proportion and write p = 0.80. Example 3.2. Fifty students at the college were asked if they plan to transfer to a four-year degree program after earning their associate’s degree. Thirty-fve of the students responded afrmatively. Thus, the percentage of students that plan to transfer to a four-year degree program is 35 50 · 100 = 70% while the corresponding proportion is 0.70. Since this proportion was calculated from sample data, it is referred to as a sample proportion and we write p̂ = 0.70. Even though we are now working with proportions, we are still doing the same basic task that we have been focused on all chapter — using sample data to estimate a population parameter using the expression Point Estimate ± Margin of Error Previously, we used a sample mean (x̄) to estimate a population mean (µ). In this last section, we will use a sample proportion (p̂) to estimate a population proportion (p). The confdence interval for a population proportion (p) is given by p̂ ± E where E = z · sp̂ . The value of z is corresponds to the desired level of confdence and sp̂ = q p̂q̂ n. As discussed in Chapter 7, the prerequisites for the application of the Central Limit Theorem to sampling distributions of p̂ are that and . Thus, one should always ensure that both statements are true before constructing a confdence interval for a population proportion. Chapter 8 — Confdence Intervals 7 Example 3.3. Each month, the Bureau of Labor Statistics (BLS) estimates the national unemployment rate. The unemployment rate is defned as the percentage of the labor force that is not currently employed but could be. Suppose a recent BLS survey found that 54 out of a random sample of 1500 Americans were jobless. A. Determine the point estimate for the proportion of all unemployed Americans. B. Determine the margin of error for a 97% confdence interval. C. Construct a 97% confdence interval for the proportion of unemployed Americans. Example 3.4. Each autumn, the consulting frm PWC surveys 2000 Americans regarding their holiday spending plans. This year, they found that 54% of respondents planned to do the majority of their holiday shopping online. A. Determine the point estimate for the proportion of all Americans that plan to do the majority of their holiday spending online. B. Determine the margin of error for a 90% confdence interval. C. Construct a 90% confdence interval for the proportion of all Americans that plan to do the majority of their holiday spending online. Example 3.5. A political campaign would like to estimate the proportion of likely voters that plan to support their candidate. A campaign stafer selected a random sample of 30 likely voters and asked them if they planned to vote for this candidate. The responses of these likely voters are given below. Yes Yes No No Yes Yes Yes Yes No Yes No No No Yes Yes Yes Yes No No No Yes Yes Yes Yes No Yes Yes Yes Yes No A. Determine the point estimate for the proportion of all ‘Yes’ voters. B. Determine the margin of error for a 96% confdence interval. C. Construct a 96% confdence interval for the proportion of all ‘Yes’ voters. D. Do you think the campaign manager will be satisfed with the confdence interval? If not, how might the confdence interval be improved? The sample size required to produce a confdence interval for the population proportion with a maximum margin of error E is given by n= z 2 p̂q̂ E2 where z corresponds to the desired level of confdence and p̂ is the sample proportion. Note: If the initial calculation of n does not yield a whole number, then the value must be rounded up to the nearest whole number. Chapter 8 — Confdence Intervals 8 The value of p̂ is gained by collecting preliminary sample data. On the other hand, if no preliminary sample data is available, a most conservative (worst case) estimate of sample size required can be obtained by setting p̂ = 0.50. Example 3.6. Suppose the campaign stafer in Example 3.5 knew about the above result before collecting any data for the confdence interval. A. Determine the most conservative estimate for the sample size required for the 96% confdence interval in Example 3.5 if the maximum margin of error allowed is 2%. B. Suppose the data collected in Example 3.5 serves as a preliminary estimate of p̂. Determine the sample size required for the 96% confdence interval in Example 3.5 if the maximum margin of error allowed is 2%. As was found earlier in the chapter, the best way to decrease the margin of error and thus shrink the overall width of the confdence interval is to . Although the level of confdence will also shrink the margin of error, this practice is not advisable since confdence levels are set by discipline and should always be adhered to. Very Important Note about the Precision of Solutions Since Chapter 6, we have seen that rounding errors can greatly afect the fnal solution of a probability calculation. For example, when calculating probability ‘by hand,’ we round z-scores to two decimal places and t-scores to four decimal places and anytime one rounds at an intermediate step of a multistep calculation, the fnal solution is afected. We will experience this same issue with confdence intervals in Chapter 8 (and again with hypothesis tests in Chapter 9). If you have a graphing calculator, it’s best to use its built-in functionality to calculate a confdence interval. This mitigates the error that occurs from rounding. On the other hand, a graphing calculator is not required for this course. Thus, if you are doing calculations ‘by hand’ using the z and t-charts, be sure to realize that your solutions will be a little diferent than solutions found via a graphing calculator (or other software). If you’re working with large data values, this diference could possibly be quite pronounced. However, this does not mean a solution reached ’by hand’ is incorrect. Due to the variation of ‘correct’ solutions to these these types of questions, homework an quiz exercises are coded accept a range of possible solutions (referred to as solution tolerance). This ensures that solutions reached via a graphing calculator or ‘by hand’ are counted as correct. As far as multiple-choice tests are concerned, students must understand that the correct solution will often be generated by software, which means it will be equal to a graphing calculator solution (where little rounding took place). Solutions reached ‘by hand’ may not be exactly the same — but will be close. It should be obvious which choice to select on a multiple-choice question. Lastly, since homework and quiz exercises will often accept a range of correct solutions, this sometimes allows solutions that are actually incorrect to be scored as correct. This can potentially be confusing. For instance, it’s possible that a margin of error calculation is not correct, but is scored as correct by the homework software because it’s ‘close enough’ to to the correct answer. However, when that same (incorrect) margin of error is used to calculate a confdence interval, the confdence interval solution might not be close enough to be scored as the correct answer. In reality, both solutions are incorrect and should receive no credit — but the frst solution is graded as correct because it’s ‘close enough’ while the second solution is not. This does not happen often, but it happens occasionally and students should be aware of the possibility. There’s really no way around the issue, as long as some student are doing calculations ‘by hand’ and others are using a graphing calculator. If you are ever working on a homework or quiz question and you feel that your solution is correct, but not being graded as such, please let me know and I will investigate the issue. Occasionally, a homework or quiz question’s coding will need to be tweaked.