Chapter 9--Estimation.Doc STATISTICS 301—APPLIED STATISTICS, Statistics for Engineers and Scientists, Walpole, Myers, Myers, and Ye, Prentice Hall Goal: In this section we will investigate the concept of “Estimation” in which our goal is to use sample information (assumed to be a random sample from the population of interest) to arrive at a reasonable guess of a population parameter. Estimation is done in two ways— point estimation (or single value) and interval estimation (an interval or range of likely values). INTERVAL ESTIMATION (aka CONFIDENCE INTERVALS) The advantage of point estimation and point estimates is their simplicity—a single number. However, this simplicity has a price. Consider the following. In a follow-up to the Dean’s request about the proportion of MU undergrads who plan to attend graduate school, he checks with another faculty member who also collects data. This faculty member reports to the Dean that 78.45% of the students he has asked plan to attend graduate school. Whom does the Dean believe since the results are slightly different, me (recall my estimate was pˆ = 60.2% ) or the other faculty member? In other words, what DON’T point estimates tell you about the estimate and sample? This is the downside of point estimates—they provide no sense of how large the sample is nor how variable the estimate is. We know that the spread of every sampling distribution is dependent upon the sample size so that smaller sample sizes yield sampling distributions with larger spread and for larger sample sizes, the sampling distribution is less variable. What would the SE of my estimate of p, the proportion of MU students going to grad school? SE(0.602) = √(0.602)(1-0.602)/123 = 0.044 What would the SE of the other estimate of p from the other faculty member? What do you need to know? Suppose n = 30. SE(0.7845) = √(0.7845)(1-0.7845)/30 = 0.075 The SE of our estimate is almost ½ of the other! Hence, if we are only given the estimate, the accuracy of point estimates is not evident! D:\687318929.doc 2/5/2016 1 Just as we can obtain point estimates for every population parameter we have discussed thus far, we can also obtain Confidence Intervals for these parameters. However we will only give the Confidence Interval (CI) or interval estimates for the population mean and proportion ( and later for the difference between two means and two proportions). Defn: Most interval estimates for parameters are of the form: Point Estimate of Parameter ± Multiplier * SE(Point Estimate) PE ± Multiplier * SE(PE) or [ PE - M * SE(PE), PE + M * SE(PE) ] where the Multiplier is an upper percentile point from the sampling distribution of the point estimator used. Hence to form an interval estimate or confidence interval for a parameter we need: 1. A point estimate of the parameter, 2. the distribution of the point estimator, 3. and an estimate of the Standard Error of the point estimate. CONFIDENCE INTERVAL FOR POPULATION MEAN (popln or ) Using our basic confidence interval form for the population mean we know that: 1. Our point estimate of is x . 2. The SE of X is σ n (≈ s n if were unknown, which is the “usual” case!). 3. Lastly, the distribution of X depends on several things: a. If the population is Normal and is known, then the Multiplier is a z value. b. If n is large, then X is approximately Normal and our Multiplier is z value c. And if n is small, is unknown, and the population is Normal, the Multiplier is a t value with n – 1 degrees of freedom. D:\687318929.doc 2/5/2016 2 Thm: If X1, X2, …, Xn are a random sample from a population with mean = , variance = 2, then a ( 1 - ) 100% confidence interval for is: i. x ± z ( 2) if the population is Normally distributed and is known n ii. x ± z ( 2 ) iii. x ± t( 2;n-1) s if n is large ( n > 30) n s if n is small, is unknown, & the population is Normal. n EXAMPLE #1 Recall our Milky Way candy data in which we found that the average weight of the 40 candy bars was 59.97 grams with a standard deviation of 1.92 grams. Find a 95% confidence interval for the mean weight of all Milky Way candy bars. Parameter: = mean weight of all Milky Way candy bars Point Estimate: x = Standard Error of our Point Estimate: σ n but since is unknown we use s n = value: Since 95% = ( 1 - ) 100% = Multiplier: Since n is large, our multiplier is z/2) = Our 95% confidence interval for the mean weight of all Milky Way candy bars is D:\687318929.doc 2/5/2016 3 Example #2: Exercise 9.6 from WMMY 8th page 286 A random sample of 50 college students yields a sample average hgt of 174.5 cm and a standard deviation of 6.9 cm. Obtain a 98% CI for the mean hgt of college students. D:\687318929.doc 2/5/2016 4 CONFIDENCE INTERVAL INTERPRETATION We just found a 95% CI for = mean weight of all Milky Way candy bars was (59.4, 60.6). Now some True/False questions. T F a. The probability that is in the CI is 95%. T F b. The probability that X is in the CI is 95%. T F c. The probability that is in the CI is either 0% or 100%. T F d. 95% of all such CI’s contain . T F e. We can conclude that is closer to the center of the CI than the ends. T F f. 95% of all candy bars weigh between (59.4, 60.6gms). Before we answer these T/F, here are some more questions: Is a constant or does it vary? Is 2 known or unknown? D:\687318929.doc 2/5/2016 5 Based on our RS of n = 40, what is the distribution of X and what does it look like? If we took a different sample of Milky Way candy bars, would we get the same 95% CI? Would change? Would x change? Would s change? Would z/2) change? D:\687318929.doc 2/5/2016 6 1. For each CI, is in the interval? So what’s the probability is in any ONE CI? 2. What % of ALL CI’s contain ? 3. What would the population distribution look like? D:\687318929.doc 2/5/2016 7 We just found that a 95% CI for = mean weight of all Milky Way candy bars was (59.4, 60.6gms). True/False Answers: T F a. The probability that is in the CI is 95%. T F b. The probability that x is in the CI is 95%. T F c. The probability that is in the CI is either 0% or 100%. T F d. 95% of all such CI’s contain . T F e. We can conclude that is closer to the center of the CI than the ends. T F f. 95% of all candy bars weigh between (59.4, 60.6gms). D:\687318929.doc 2/5/2016 8 What is(are) the population(s)? What is(are) the parameter(s)? 1995: 2006: D:\687318929.doc 2/5/2016 9 CONFIDENCE INTERVAL FOR POPULATION PROPORTION, DIFFERENCE BETWEEN TWO MEANS, & DIFFERENCE BETWEEN TWO PROPORTIONS We present the CI’s forms for the above three different cases in the following theorems, then present several examples. Thm: If X1, X2, …, Xn are a random sample from a population with proportion, p, then a ( 1 - ) 100% confidence interval for p is ˆ - p) ˆ p(1 n p̂ ± z ( 2) * if np > 5 and n(1-p) > 5 OR n p̂ > 5 and n(1- p̂ ) > 5. Thm: Let x1 and s1 and x2 and s2 be the sample average and sample standard deviation, respectively, of two independent random samples of sizes n1 and n2, respectively, from two populations with means 1 and 2, then a ( 1 - ) 100% confidence interval for ( 1 - 2) is 2 (x1 -x2 ) ± t( , df) 2 s12 s22 + n n s12 s22 + , where df= 12 2 2 . n1 n2 s12 s22 n1 + n2 n1 -1 n2 -1 Thm: If p̂1 and p̂2 are sample proportions from two independent random samples of size n1 and n2 from two populations with proportions p1 and p2, then a ( 1 - ) 100% confidence interval for (p1 - p2) is pˆ1 -pˆ2 ±z ( D:\687318929.doc * 2) pˆ1 (1-pˆ1 ) pˆ2 (1-pˆ2 ) + n1 n2 if n1 p̂1 > 5, n1(1- p̂1 ) > 5, n2 p̂2 > 5, and n2(1- p̂2 ) > 5. 2/5/2016 10 EXAMPLE #1 Underweight Milky Way Candy Bars Let’s let p be the proportion of “vending-sized” Milky Way candy bars that are below the stated Net Weight of 58.1 grams. 62.2 59.6 60.4 59.7 62.4 59.7 61.6 64.5 57.4 56.0 60.7 59.1 61.3 57.2 58.4 Candy Wgt 58.6 57.1 61.5 61.5 59.9 64.6 58.6 61.6 61.9 59.5 60.2 60.5 61.3 59.2 59.7 62.1 59.6 58.3 57.1 58.7 60.3 60.7 60.0 58.2 57.7 We find that 6 of the 40 candy bars weighed less than 58.1 grams. Our point 6 = 15.0% . Let’s also estimate of the proportion of underweight Milky Ways is pˆ = 40 obtain a 95% CI for p. Checking to insure our sample size is large enough 1. n( p̂ ) = 40(0.15) = 6 > 5 AND 2. n(1- p̂ ) = 40(1-0.15) = 34 > 5! So our 95% CI for p is p̂± z ( 2) ˆ - p) ˆ p(1 0.150(1 - 0.150) = 0.150 ± z (0.025) = 0.150 ± 1.96(0.0565) n 40 = 0.150 ± 0.1107 = (0.0393, 0.2607) We can then conclude, with a very high degree of confidence (95% !) that between 4% and 26% of Milky Way candy bars are underweight. Do you believe MW’s claim that no candy bar that is less than 58.1 gm goes out of the assembly line? Why? D:\687318929.doc 2/5/2016 11 Example #2: Phone Battery Data Lithium Ion Batteries: 12.24 13.86 15.78 17.65 12.51 14.15 15.9 17.85 Nickel Metal Hydride: 12.59 13.48 14.07 15.04 12.68 13.5 14.15 15.07 9.75 12.9 14.2 16.06 17.9 10.17 13.15 14.25 16.25 11.77 13.16 14.42 16.42 11.77 13.61 14.57 16.43 11.87 13.63 14.84 16.46 11.90 13.63 14.92 16.82 12.12 13.66 14.93 17.04 12.15 13.75 14.95 17.08 12.18 13.81 15.63 17.58 10.08 12.85 13.51 14.19 15.10 11.98 12.88 13.52 14.49 15.22 12.19 13.06 13.67 14.53 15.28 12.36 13.07 13.83 14.59 15.3 12.37 13.18 13.85 14.61 15.38 12.4 13.18 13.86 14.81 15.53 12.45 13.35 13.9 14.85 15.54 12.46 13.38 14.02 14.99 15.59 12.54 13.47 14.05 15.01 15.72 nLI = 45, xLI =14.348, sLI = 2.0693, se( xLI ) = 2.0693/45 = 0.3085 n NIMH = 53, xNIMH =13.826, sNIMH = 1.1819, se( xNIMH ) = 1.1819/53 = 0.1623 2 (1-) 100% CI for 1 - 2 is (x1 -x2 ) ± t( , df) 2 se(x1 -x2 ) = s12 s22 + = n1 n2 s12 s22 + n n s12 s22 + , where df= 12 2 2 . n1 n2 s12 s22 n1 + n2 n1 -1 n2 -1 2.06932 1.18192 + = 0.357 45 53 2 2 s12 s22 2.06932 1.18192 + + n1 n2 45 53 and df = = = 67.38, so call it 68. 2 2 2 2 s12 s22 2.06932 1.18192 45 + 53 n1 + n2 45 53 n1 -1 n2 -1 Obtain a 90% CI for NIMH - LI: (13.826 – 14.348) t(0.05, df)*0.357 (13.826 – 14.348) t(0.05, 68)*0.357 -0.522 1.990*0.357 -0.522 0.711 [ -1.233, 0.189 ] Interpretation? D:\687318929.doc 2/5/2016 12 Example #3: Hair Color & Pain Threshold Data Light Blonde: 62 60 71 55 48 Dark Brunette: 32 39 51 30 35 nLB = 5, xLB =59.2, sLB = 8.5264, se( xLB ) = 8.5264/5 = 3.8131 nDB = 5, xDB =37.4, sDB = 8.3247, se( xDB ) = 8.3247/5 = 3.7229 2 (1-) 100% CI for 1 - 2 is (x1 -x2 ) ± t( , df) 2 s12 s22 + = n1 n2 s12 s22 + n n s12 s22 + , where df= 12 2 2 . n1 n2 s12 s22 n1 + n2 n1 -1 n2 -1 8.52642 8.3247 2 + = 5.3292 5 5 2 8.52642 8.3247 2 + 5 5 and df = = 7.9954, call it 8. 2 2 2 8.5264 8.3247 2 5 5 + 5-1 5-1 Obtain a 99% CI for LB - DB: (59.2 –37.4) t(0.005, df)*5.3292, (59.2 –37.4) t(0.005, 8)*5.3291 21.8 3.355*5.3292 21.8 17.8795 [ 3.92, 39.68 ] Interpretation? D:\687318929.doc 2/5/2016 13 Example #4: Example 9.6 from WMMY 8th page 289 Compare the gas mileage of two car types (compact and sub-compact). We have two independent RS’s with summary information: nSC = 75, xSC =42, sSC = 8 nC = 50, xC =36, sS = 6 Obtain a 96% CI for C - SC. D:\687318929.doc 2/5/2016 14 Example #5: Exercise 9.65 from WMMY 8th page 305 Compare the proportion of females and males with a certain minor blood disorder. We have independent RS’s of size 1,000 and found 275 females with the disorder and 250 males with the disorder. Obtain a 95% confidence interval for the difference in proportions. D:\687318929.doc 2/5/2016 15 Example #6: D:\687318929.doc 2/5/2016 16 Example #7: D:\687318929.doc 2/5/2016 17 Confidence Interval for Difference of Two Means Non-Independent Samples—What’s the Effect? (9.44 WMMY 8th) A taxi company is trying to decide whether to purchase Brand A or Brand B tires for its fleet of taxis. A tire from each brand is assigned at random to the rear wheels of 8 taxis and the following distances, in km, recorded until a tire had only 1/8” of tread remaining. Taxi 1 2 3 4 5 6 7 8 n Average ST Dev Brand A 34,400 45,500 36,700 32,000 48,400 32,800 38,100 30,100 8 33,112 6546.7549 Brand B 36,700 46,800 37,700 31,100 47,800 36,400 38,900 31,500 8 34,101 6181.0627 Are these two samples independent? Now let’s calculate the se(xA - xB ) assuming independent samples. Recall that se(x1 -x2 ) = s12 s22 + . n1 n2 The problem is that since the samples are NOT independent, our se(xA - xB ) could either overestimate or under-estimate that true standard error! D:\687318929.doc 2/5/2016 18 Confidence Intervals for Difference of Two Means Paired Data Case Thm: Assuming a sample of “n” paired observations (x1i, y2i), a (1-) 100% CI for 1 - 2 is d ± t( , n-1) se(d), 2 where di = (x1i - y2i) and se(d) = Taxi 1 2 3 4 sd n . 5 6 7 8 n Average ST Dev Brand A 34,400 45,500 36,700 32,000 48,400 32,800 38,100 30,100 8 33,112 6546.7549 Brand B 36,700 46,800 37,700 31,100 47,800 36,400 38,900 31,500 8 34,101 6181.0627 -1112.5 1454.4881 Difference -2,300 -1,300 -1,000 900 600 -3,600 -800 -1,400 8 Hence, for our data, a 95% CI for the difference in mileage for the two Brands of tires (A - B) is: D:\687318929.doc 2/5/2016 19 NOTES AND COMMENTS ON CONFIDENCE INTERVALS 1. While n > 30 will work well in most instances, larger sample sizes would be needed if the population is known to be severely skewed. If the population is symmetric or approximately so, then CI’s for the mean () based on samples of size 30 are adequate. Populations that are known to be severely skewed, in either direction, would require a larger sample size. 2. INTERPRETATION OF CI’S: A 95% CI would be interpreted as follows: We are 95% confident that the parameter of interest falls somewhere within the stated interval. Notice that we do NOT say, “The probability is 95% that the parameter of interest falls somewhere within the stated interval” since this is not true. Hence avoid using the term “probability” in the interpretation of CI’s. 3. The Degree of Confidence of the CI is a statement about how sure or confident we are in our CI. The higher the degree of confidence, the more certain we are with our statement; the lower the degree of confidence the less sure we are. While higher confidence in general is better, the sacrifice is a wider CI and hence more possible values for the parameter. One usually uses 90%, 95%, or 99% in most cases. What would a 100% confidence interval be? How informative is it? D:\687318929.doc 2/5/2016 20 4. CI’s are a statement about a population parameters value. It does NOT say anything about what percent or proportion of the population falls in the interval. Hence for a 95% CI, you can NOT conclude “95% of the population falls within the CI.” Rather it is an interval in which the population parameter is likely to lie. 5. The Margin of Error of a confidence interval is the ½ width of the confidence interval. So for our candy bar example, since the 95% confidence interval was [ 59.97 ± 0.595 ], the Margin of Error would be 0.595. The Margin of Error provides some evidence of how large an “error” is involved with our estimate or how far away our estimate is from the true parameter. 6. If the degree of confidence is not stated, it’s assumed to be 95%. So if a Margin of Error is given with no indication of the degree of confidence, assume it is 95%. D:\687318929.doc 2/5/2016 21 USING SAS TO OBTAIN CONFIDENCE INTERVALS Recall we found a 95% CI for = mean weight of all Milky Way candy bars was (59.4, 60.6). SAS CI for a Single Mean OPTIONS LS=110 PS=60 PAGENO=1 NODATE FORMDLIM='+'; TITLE 'CI.SAS'; TITLE2 'EXAMPLE OF ONE AND TWO SAMPLE CI OF MEANS IN SAS'; TITLE3 'MILKY WAY WGT DATA FROM CLASS'; DATA MWDATA; INPUT MW_WGT @@; DATALINES; 62.2 59.7 60.7 58.6 57.1 60.2 62.1 60.3 59.6 61.6 59.1 61.5 61.5 60.5 59.6 60.7 60.4 64.5 61.3 59.9 64.6 61.3 58.3 60.0 59.7 57.4 57.2 58.6 61.6 59.2 57.1 58.2 62.4 56.0 58.4 61.9 59.5 59.7 58.7 57.7 ; PROC TTEST DATA= MWDATA ALPHA=0.05; VAR MW_WGT; RUN; ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ CI.SAS 1 EXAMPLE OF ONE AND TWO SAMPLE CI OF MEANS IN SAS MILKY WAY WGT DATA FROM CLASS The TTEST Procedure Statistics Variable MW_WGT N Lower CL Mean Mean Upper CL Mean Lower CL Std Dev Std Dev Upper CL Std Dev Std Err Minimum Maximum 40 59.351 59.965 60.579 1.573 1.9203 2.4657 0.3036 56 64.6 T-Tests D:\687318929.doc Variable DF t Value Pr > |t| MW_WGT 39 197.50 <.0001 2/5/2016 22 SAS CI of the Difference of Two Means—Independent Samples PROC IMPORT DATAFILE='C:\MyDocs\Class\1 Winter 2007\STA 301\Data Sets\Battery.xls' OUT=BATTERY; PROC TTEST DATA=BATTERY ALPHA=0.10; TITLE3 'BATTERY DATA'; CLASS BATTERYTYPE; VAR TIME; ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ CI.SAS 2 EXAMPLE OF ONE AND TWO SAMPLE CI OF MEANS IN SAS BATTERY DATA The TTEST Procedure Statistics Variable BatteryType N Time Time Time LithiumIon NickelMetalHydride Diff (1-2) 53 45 Lower CL Mean Mean Upper CL Mean Lower CL Std Dev Std Dev Upper CL Std Dev Std Err 13.554 13.83 -1.078 13.826 14.348 -0.522 14.098 14.867 0.0328 1.0199 1.765 1.4757 1.1819 2.0693 1.649 1.4119 2.5149 1.8731 0.1623 0.3085 0.3343 Obtain a 90% CI for NIMH - LI: [ -1.233, 0.189 ] Why the difference? T-Tests Variable Method Variances Time Time Pooled Satterthwaite Equal Unequal DF t Value Pr > |t| 96 67.4 -1.56 -1.50 0.1214 0.1387 Equality of Variances D:\687318929.doc Variable Method Time Folded F Num DF Den DF F Value Pr > F 44 52 3.07 0.0001 2/5/2016 23 SAS CI of the Difference of Two Means—Paired Data DATA PAIRED; TITLE3 'PAIRED TIRE DATA'; INPUT BRANDA BRANDB; DIFF = BRANDA-BRANDB; DATALINES; 34400 36700 45500 46800 36700 37700 32000 31100 48400 47800 32800 36400 38100 38900 30100 31500 ; PROC TTEST DATA=PAIRED; VAR DIFF; RUN; ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ CI.SAS 3 EXAMPLE OF ONE AND TWO SAMPLE CI OF MEANS IN SAS PAIRED TIRE DATA The TTEST Procedure Statistics Variable N Lower CL Mean Mean Upper CL Mean Lower CL Std Dev Std Dev Upper CL Std Dev Std Err Minimum Maximum DIFF 8 -2328 -1113 103.48 961.67 1454.5 2960.3 514.24 -3600 900 T-Tests Variable DIFF D:\687318929.doc DF t Value Pr > |t| 7 -2.16 0.0673 2/5/2016 24 Approximate 95% Margin of Error for proportions is 1/√n … So MoE is D:\687318929.doc 2/5/2016 25 D:\687318929.doc 2/5/2016 26