Point and Interval Estimates Hope Sabao(PhD) University of Lusaka 9th July, 2021 1 Estimation Estimation is a procedure by which a numerical value or values are assigned to a population parameter based on the information collected from a sample. Definition Estimation: The assignment of value(s) to a population parameter based on a value of the corresponding sample statistic is called estimation. In inferential statistics, µ is called the true population mean and p is called the true population proportion. There are many other population parameters, such as the median, mode, variance, and standard deviation. 2 Example Example The central statistics office in Zambia may want to find the mean housing expenditure per month incurred by households. The mean housing expenditure per month incurred by households is an illustration of estimating the true population mean µ. If we can conduct a census (a survey that includes the entire population) each time we want to find the value of a population parameter, then the estimation procedures explained in this and subsequent chapters are not needed. For example, if the central statistics office can contact every household in the Zambia to find the mean housing expenditure incurred by households, the result of the survey (which will actually be a census) will give the value of µ and the procedures learned in this chapter will not be needed. However, it is too expensive, very time consuming, or virtually impossible to contact every member of a population to collect information to find the true value of a population parameter. Therefore, we usually take a sample from the population and calculate the value of the appropriate sample statistic. Then we assign a value or values to the corresponding population parameter based on the value of the sample statistic. This chapter (and subsequent chapters) explains how to assign values to population parameters based on the values of sample statistics. To estimate the mean housing expenditure per month incurred by all households in the Zambia, the Census Bureau will take a sample of certain households, collect the information on the housing expenditure that each of these households incurs per month, and compute the value of the sample mean, x̄. Based on this value of the bureau will then assign values to the population mean, µ. 4 Definition Estimate and Estimator: The value(s) assigned to a population parameter based on the value of a sample statistic is called an estimate. The sample statistic used to estimate a population parameter is called an estimator. The estimation procedure involves the following steps. 1. Select a sample. 2. Collect the required information from the members of the sample. 3. Calculate the value of the sample statistic. 4. Assign value(s) to the corresponding population parameter. 5 Point and Interval Estimates A Point Estimate If we select a sample and compute the value of the sample statistic for this sample, then this value gives the point estimate of the corresponding population parameter. Definition Point Estimate: The value of a sample statistic that is used to estimate a population parameter is called a point estimate. Thus, the value computed for the sample mean, x̄ from a sample is a point estimate of the corresponding population mean,µ. For the example mentioned earlier, suppose the Census Bureau takes a sample of 10,000 households and determines that the mean housing expenditure per month, x̄. for this sample is $1970. Then, using, x̄ as a point estimate of µ 6 The Bureau can state that the mean housing expenditure per month ,µ ,for all households is about $1970. Thus Point estimate of a population parameter=Value of the corresponding sample statistic Each sample selected from a population is expected to yield a different value of the sample statistic. Thus, the value assigned to a population mean, µ, based on a point estimate depends on which of the samples is drawn. Consequently, the point estimate assigns a value to µ that almost always differs from the true value of the population mean. 7 An Interval Estimate In the case of interval estimation, instead of assigning a single value to a population parameter, an interval is constructed around the point estimate, and then a probabilistic statement that this interval contains the corresponding population parameter is made. Definition Interval Estimation: In interval estimation, an interval is constructed around the point estimate, and it is stated that this interval is likely to contain the corresponding population parameter. For the example about the mean housing expenditure, instead of saying that the mean housing expenditure per month for all households is $1970, we may obtain an interval by subtracting a number from $1970 and adding the same number to $1970. Then we state that this interval contains the population mean,µ. For purposes of illustration, suppose we subtract $340 from $1970 and add $340 to $1970. Consequently, we obtain the interval ($1970 – $340) to ($1970+$340), or $1630 to $2310. 8 Then we state that the interval $1630 to $2310 is likely to contain the population mean,µ, and that the mean housing expenditure per month for all households in Zambia is between $1630 and $2310. This procedure is called interval estimation. The value $1630 is called the lower limit of the interval, and $2310 is called the upper limit of the interval. The number we add to and subtract from the point estimate is called the margin of error. The question arises: What number should we subtract from and add to a point estimate to obtain an interval estimate? The answer to this question depends on two considerations: 1. The standard deviation σx̄ of the sample mean, x̄ 2. The level of confidence to be attached to the interval 9 First, the larger the standard deviation of x̄ the greater is the number subtracted from and added to the point estimate. Thus, it is obvious that if the range over which x̄ can assume values is larger, then the interval constructed around x̄ must be wider to include µ. Second, the quantity subtracted and added must be larger if we want to have a higher confidence in our interval. We always attach a probabilistic statement to the interval estimation. This probabilistic statement is given by the confidence level. An interval constructed based on this confidence level is called a confidence interval. Definition Confidence Level and Confidence Interval Each interval is constructed with regard to a given confidence level and is called a confidence interval. The confidence interval is given as Point estimate ± Margin of error The confidence level associated with a confidence interval states how much confidence we have that this interval contains the true population parameter. The confidence level is denoted by (1 − α)100%, where α is the Greek letter alpha. When expressed as probability, it is called the confidence coefficient and is denoted by 1 − α. Although any value of the confidence level can be chosen to construct a confidence interval, the more common values are 90%, 95%, and 99%. The corresponding confidence coefficients are .90, .95, and .99, respectively. Estimation of a Population Mean: σ Known We now look at how to construct a confidence interval for the population mean when the population standard deviation is known. Here, there are three possible cases, as follows: Case I. If the following three conditions are fulfilled: 1. The population standard deviation is known 2. The sample size is small (i.e. n < 30) 3. The population from which the sample is selected is normally distributed, then we use the normal distribution to make the confidence interval for µ. 12 Case II. If the following two conditions are fulfilled: 1. The population standard deviation σ is known 2. The sample size is large (n ≥ 30) then, again, we use the normal distribution to make the confidence interval for µ. 13 Case II. If the following two conditions are fulfilled: 1. The population standard deviation σ is known 2. The sample size is large (n ≥ 30) then, again, we use the normal distribution to make the confidence interval for µ. Case III. If the following three conditions are fulfilled: 1. The population standard deviation σ is known. 2. The sample size is small (i.e n < 30) 3. The population from which the sample is selected is not normally distributed (or its distribution is unknown), then we use a nonparametric method to make the confidence interval for µ. Such methods are beyond the scope of this course. Confidence Interval for µ The (1 − α)100% confidence interval for µ under case I and II is x̄ ± zσx̄ where σ σx̄ = √ n The value of z used here is obtained from the standard normal distribution table. The quantity zσx̄ in the confidence interval formula is called the margin of error and is denoted by E. Definition The margin of error for the estimate for µ, denoted by E , is the quantity that is subtracted from and added to the value of x̄ to obtain a confidence interval for µ. Thus E = zσx̄ 15 The value of z in the confidence interval formula is obtained from the standard normal distribution table for the given confidence level. To illustrate, suppose we want to construct a 95% confidence interval for µ. A 95% confidence level means that the total area under the normal curve for x̄ between two points (at the same distance) on different sides of µ is 95% or 0.95 as shown in the figure below: 16 Note that we have denoted these two points by z1 and z2 . To find the value of z for a 95% confidence level, we first find the areas to the left of these two points, z1 and z2 . Then we find the z values for these two areas from the normal distribution table. Note that these two values of z will be the same but with opposite signs. To find these values of z, we perform the following two steps: 17 Step 1 The first step is to find the areas to the left of z1 and z2 , respectively. Note that the area between z1 and z2 is denoted by 1 − α. Hence, the total area in the two tails is α because the total area under the curve is 1.0. Therefore, the area in each tail is α2 . In our example, 1 − α = 0.95. Hence, the total area in both tails is α = 1 − 0.95 = 0.05. Consequently, the area in each tail is α 0.05 2 = 2 = 0.025. Then, the area to the left of z1 is 0.0250 and the area to the left of z2 is 0.0250 + 0.95 = 0.9750. Step 2 Now find the z values from standard Normal Table such that the areas to the left z1 of and z2 are .0250 and .9750, respectively. These z values are -1.96 and 1.96, respectively. Thus, for a confidence level of 95%, we will use z = 1.96 in the confidence interval formula. The following table lists the z values for some of the most commonly used confidence levels. Note that we always use the positive value of z in the formula. 19 Example Example A publishing company has just published a new college textbook. Before the company decides the price at which to sell this textbook, it wants to know the average price of all such textbooks in the market. The research department at the company took a sample of 25 comparable textbooks and collected information on their prices. This information produced a mean price of $145 for this sample. It is known that the standard deviation of the prices of all such textbooks is $35 and the population of such prices is normal. (a) What is the point estimate of the mean price of all such college textbooks? (b) Construct a 90% confidence interval for the mean price of all such college textbooks. Solution Here, σ is known. Although n < 30, the population is normally distributed. Hence, we can use the normal distribution. From the given information, n = 25, x̄ = $145 and σ = $35 The standard deviation for x̄ is given by 35 σ σx̄ = √ = √ = $7.00 n 25 (a) The point estimate of the mean price of all such college textbooks is $145; that is, Point estimate of µ = x̄ = $145 (b) The confidence level is 90%, or .90. First we find the z value for a 90% confidence level. Here, the area in each tail of the normal distribution curve is α/2 = (1 − 0.90)/2 = 0.05. In the normal look for the areas .0500 and .9500 and find the corresponding values of z. These values are z = −1.65 and z = 1.65. Next, we substitute all the values in the confidence interval formula for µ. The 90% confidence interval for µ is x̄ ± zσx̄ = 145 ± 1.65(7.00) = 145 ± 11.55 = (145 − 11.552) to (145 + 11.552) = $133.45 to $156.55 Thus, we are 90% confident that the mean price of all such college textbooks is between $133.45 and $156.55. Determining the Sample Size for the Estimation of Mean One reason we usually conduct a sample survey and not a census is that almost always we have limited resources at our disposal. In light of this, if a smaller sample can serve our purpose, then we will be wasting our resources by taking a larger sample. For instance, suppose we want to estimate the mean life of a certain auto battery. If a sample of 40 batteries can give us the confidence interval we are looking for, then we will be wasting money and time if we take a sample of a much larger size, say, 500 batteries. In such cases, if we know the confidence level and the width of the confidence interval that we want, then we can find the (approximate) size of the sample that will produce the required result. 23 From earlier discussion, we learned that E = z · σx̄ is called the margin of error of estimate for µ. As we know, the standard deviation of the sample mean is equal to √σn Therefore, we can write the margin of error of estimate for µ as: σ E =z·√ n Suppose we predetermine the size of the margin of error, E, and want to find the size of the sample that will yield this margin of error. From the above expression, the following formula is obtained that determines the required sample size n. Determining the Sample Size for the Estimation of µ Given the confidence level and the standard deviation of the population, the sample size that will produce a predetermined margin of error E of the confidence interval estimate of µ is n= z 2σ2 E2 Example An alumni association wants to estimate the mean debt of this year’s college graduates. It is known that the population standard deviation of the debts of this year’s college graduates is $11,800. How large a sample should be selected so that the estimate with a 99% confidence level is within $800 of the population mean? Solution The alumni association wants the 99% confidence interval for the mean debt of this year’s college graduates to be x̄ ± $800 Hence, the maximum size of the margin of error of estimate is to be $800; that is, E = $800. 25 The value of z for a 99% confidence level is 2.58. The value of σ is given to be $11,800. Therefore, substituting all values in the formula and simplifying, we obtain n= z 2σ2 (2.58)2 (11, 800)2 = = $1448.18 ≈ 1449 E2 8002 Thus, the required sample size is 1449. If the alumni association takes a sample of 1449 of this year’s college graduates, computes the mean debt for this sample, and then makes a 99% confidence interval around this sample mean, the margin of error of estimate will be approximately $800. Note that we have rounded the final answer for the sample size to the next higher integer. This is always the case when determining the sample size. 26 Estimation of a Population Mean: σ Not Known This section explains how to construct a confidence interval for the population mean µ when the population standard deviation σ is not known. Here, again, there are three possible cases: Case I. If the following three conditions are fulfilled: 1. The population standard deviation σ is not known 2. The sample size is small (ie n < 30) 3. The population from which the sample is selected is normally distributed then we use the t distribution to make the confidence interval for µ Case II. If the following two conditions are fulfilled: 1. The population standard deviation σ is not known 2. The sample size is large (n > 30) then again we use the t distribution to make the confidence interval for µ. 27 Case III: If the following three conditions are fulfilled: 1. The population standard deviation σ is not known. 2. The sample size is small (i.e n < 30) 3. The population from which the sample is selected is not normally distributed (or its distribution is unknown), then we use a nonparametric method to make the confidence interval for µ. Such procedures are beyond the scope of this course. The t-Distribution • The t distribution was developed by W. S. Gosset in 1908 and published under the pseudonym Student. As a result, the t distribution is also called Student’s t distribution. • The t distribution is similar to the normal distribution in some respects. Like the normal distribution curve, the t distribution curve is symmetric (bell shaped) about the mean and never meets the horizontal axis. • The total area under a t distribution curve is 1.0, or 100%. However, the t distribution curve is flatter than the standard normal distribution curve. • In other words, the t distribution curve has a lower height and a wider spread (or, we can say, a larger standard deviation) than the standard normal distribution. However, as the sample size increases, the t distribution approaches the standard normal distribution. The units of a t distribution are denoted by t. 29 The shape of a particular t distribution curve depends on the number of degrees of freedom (df). The number of degrees of freedom for a t distribution is equal to the sample size minus one, that is, df = n − 1 Definition The t Distribution The t distribution is a specific type of bell-shaped distribution with a lower height and a wider spread than the standard normal distribution. As the sample size becomes larger, the t distribution approaches the standard normal distribution. The t distribution has only one parameter, called the degrees of freedom (df ). The mean of the t distribution is equal to 0, and its standard deviation is r df df − 2 The following figure shows the standard normal distribution and the t distribution for 9 degrees of freedom. The standard deviation of the standard normal distribution is 1.0, and the standard deviation of the t distribution is r 9 = 1.134 9−2 31 Meaning of Degrees of Freedom • The number of degrees of freedom for a t distribution for the purpose of this chapter is n − 1. • The number of degrees of freedom is defined as the number of observations that can be chosen freely. • As an example, suppose we know that the mean of four values is 20. Consequently, the sum of these four values is 20(4) = 80. • Now, how many values out of four can we choose freely so that the sum of these four values is 80? 32 • The answer is that we can freely choose 4 − 1 = 3. Suppose we choose 27, 8, and 19 as the three values. Given these three values and the information that the mean of the four values is 20, the fourth value is 80-27-8-19=26. Thus, once we have chosen three values, the fourth value is automatically determined. Consequently, the number of degrees of freedom for this example is df = 4 − 1 = 3 We subtract 1 from n because we lose 1 degree of freedom to calculate the mean. 33 Example Find the value of t for 16 degrees of freedom and .05 area in the right tail of a t distribution curve. Solution • In the t distribution table, we locate 16 in the column of degrees of freedom (labeled df ) and .05 in the row of Area in the right tail under the t distribution curve at the top of the table. • The entry at the intersection of the row of 16 and the column of .05, which is 1.746, gives the required value of t. Determining t for 16 df and .05 Area in the Right Tail 35 The value of t for 16 df and .05 area in the right tail. 36 The value of t for 16 df and .05 area in the left tail. 37 Confidence interval for µ using the t-distribution The (1 − α)100% confidence interval for µ is x̄ ± tsx̄ where s sx̄ = √ n The value of t is obtained from the t distribution table for n − 1 degrees of freedom and the given confidence level. Here tsx̄ is the margin of error of the estimate; that is, E = tsx̄ 38 Example Example Sixty-four randomly selected adults who buy books for general reading were asked how much they usually spend on books per year. The sample produced a mean of $1450 and a standard deviation of $300 for such annual expenses. Determine a 99% confidence interval for the corresponding population mean. Solution From the given information, n = 64, x̄ = $1450 and s = $300. and Confidence level = 99% or 0.99 Here σ is not known but sample size is large (n > 30). Hence, we will use the t distribution to make a confidence interval for µ. First we calculate the standard deviation of x̄. the number of degrees of Here σ is not known but sample size is large (n > 30). Hence, we will use the t distribution to make a confidence interval for µ. First we calculate the standard deviation of x̄. the number of degrees of freedom, and the area in each tail of the t distribution. 300 s sx̄ = √ = √ = 37.50 n 64 df = n − 1 = 64 − 1 = 63 1 − 0.99 = 0.005 Area in each tail = 2 From the t distribution table, t = 2.656 for 63 degrees of freedom and .005 area in the right tail. The 99% confidence interval for µ is x̄ ± tsx̄ = $1450 ± 2.656(37.50) = $1450 ± $99.60 = $1350.40 to $1549.60 Thus, we can state with 99% confidence that based on this sample the mean annual expenditure on books by all adults who buy books for general reading is between $1350.40 and $1549.60.