Chapter 12 Significance Tests in Practice AP Statistics Hamilton/Mann Introduction • In Chapter 11, we made the unrealistic claim that we knew the population standard deviation σ. • In this chapter, we will stop making this assumption. • Just like with confidence intervals, when we no longer know the population standard deviation σ, we must use a t distribution to carry out a significance test. • Remember the use and abuse of tests from 11.3. • Also remember that we could be making a Type I or Type II error whenever we use a significance test. • Think before you calculate! CHAPTER 12 SECTION 1 Tests about a Population Mean HW: 12.2, 12.3, 12.4, 12.5, 12.6, 12.7, 12.9, 12.10, 12.11 Guinness Brewery • William S. Gosset was involved in experiments and statistics to understand data for the Guinness brewery in Dublin, Ireland. He was trying to determine the best varieties of barley and hops for brewing? He ran into the problem of not knowing the population standard deviation σ. He observed that replacing σ by s and calling the result roughly Normal wasn’t accurate enough. After much work, Gosset developed what we now call the t distribution. Guinness allowed Gosset to publish his discoveries, but not under his own name. He used the name “Student,” and, as a result, it is sometimes referred to as Student’s t distribution. • The t statistic has the same interpretation as any standardized statistic: it says how far is from its mean μ in standard deviation units. • We are now going to learn to find P-values for a significance test about μ using the t table. Determining P-values • Suppose we carry out a significance test of based on a sample of size 20 and obtain t = 1.81. • Since there were 20 observations, we would have df = 19. So look along the row with df = 19. Our t statistic falls between 1.729 and 2.093 which correspond to upper-tail values of 0.05 and 0.025. • We can conclude that our P-value is between 0.025 and 0.05 because we were performing a one-sided test and it would be the area to the right, hence it would be the upper-tail value. Determining P-values • Suppose we carry out a significance test of based on a sample of size 37 and obtain t = -3.17. • Since there were 37 observations, we would have df = 36. Since there is no 36 in the table, we look along the row with df = 30. Our t statistic falls between 3.030 and 3.385 which correspond to upper-tail values of 0.0025 and 0.001. • Since it is a two-sided test, we have to find the probability less than -3.17 or greater than 3.17. Due to symmetry, the lower tail value would be the same as the upper-tail value, so we just double both of them. So we can conclude that our P-value is between 0.002 and 0.005. The One-Sample t-test • In significance tests as in confidence intervals, we allow for unknown σ by using the standard error and replacing z by t. Testing • Now we can do a realistic analysis of data produced to test a claim about an unknown population mean. • Again, we need to follow the steps in our Inference Toolbox. 1. Hypotheses – What is the population of interest and what are the hypotheses we are testing? 2. Conditions – SRS, Normality and Independence 3. Calculations – Find P-value 4. Interpretation – Connection, Conclusion, Context Sweet Cola • Diet colas use artificial sweeteners to avoid sugar. These sweeteners gradually lose their sweetness over time. Manufacturers therefore test new colas for loss of sweetness before marketing them. Trained tasters (wouldn’t you love this job) sip the cola along with drinks of standard sweetness and score the cola on a “sweetness scale” of 1 to 10. The cola is then stored for a month at high temperature to imitate the effect of storing it at room temperature for four months. Each taster will then score the cola again after storage. Our data are the differences (score before storage minus score after storage) in the tasters’ scores. The bigger these differences, the bigger the loss of sweetness. Sweet Cola • Here are the sweetness losses for a new cola, as measured by 10 different sweetness tasters. 2.0 0.4 0.7 2.0 -0.4 2.2 -1.3 1.2 1.1 2.3 • Notice that most are positive, indicating a loss of sweetness. • Is there good evidence that the cola lost sweetness in storage? • Step 1: We are interested in the average difference in sweetness between the before storage and after storage sweetness score. Sweet Cola • Step 2 – Conditions – Since we do not know the standard deviation of sweetness loss in the population of tasters, we must use a one-sided ttest. – SRS – We must be willing to treat our 10 tasters as an SRS from the population of tasters if we want to draw conclusions about tasters in general. The tasters all have the same training. So even though we don’t have an actual SRS, we are willing to act as if we did. – Normality – The sample is too small to effectively check normality. The stemplot and a boxplot show left skewedness, but no gaps or outliers. – Independence – It is reasonable that different tasters would have results independent from each other. Sweet Cola • Step 3 – Calculations • Since our t-value is 2.70 and we have 10 observations, our P-value would be between 0.01 and 0.02. Using the calculator, we can get the exact value of 0.0123. • Step 4 – Interpretation – A P-value between 0.01 and 0.02 is quite small and gives good evidence against H0. Therefore, we reject H0, and conclude that the cola has lost sweetness during storage. T-Procedures • Because t-procedures are so common, all statistical software packages will do the calculations for you. • The next two slides have the calculations as performed by 4 different statistical software packages: 1. DataDesk 2. Fathom 3. Minitab 4. CrunchIt! Two-Tailed Example • An investor with a stock portfolio worth several hundred thousand dollars sued his broker because lack of diversification in his portfolio led to poor performance. The table below gives the rates of return for the 39 months the account was managed by the broker. -8.36 1.63 -2.27 -2.93 -2.70 -2.93 -9.14 -2.64 6.82 -2.35 -3.58 6.13 7.00 -15.25 -8.66 -1.03 -9.16 -1.25 -1.22 -10.27 -5.11 -0.80 -1.44 1.28 -0.65 4.34 12.22 -7.21 -0.09 7.34 5.04 -7.24 -2.14 -1.01 -1.41 12.03 -2.56 4.33 2.35 • An arbitration panel compared these returns with the average of the Standard & Poor’s 500 stock index for the same period. Consider the 39 monthly returns as a random sample from the monthly returns the broker would generate if he managed the account forever. Are these returns compatible with a population mean of the S&P 500 average? Two-Tailed Example • Step 1 – Hypotheses where μ is the mean return for all possible months that the broker could manage this account • Step 2 – Conditions – Since we don’t know σ, we must use t-procedures. – SRS – We were told to assume it was a random sample. – Normality – Since we have 39 observations, the Central Limit Theorem says that it is approximately Normal. A boxplot and a histogram verify no outliers and no strong skewness. – Independence – This is a matter of judgment. Would these 39 months represent independent observations? Two-Tailed Example • Step 3 – Calculations • Since our t-value is -2.14 and we have 39 observations, we would get an upper-tail value between 0.02 and 0.025. Since it is a two-tailed test, our P-value would be between 0.04 and 0.05 • Step 4 – Interpretation – The mean monthly returns for this client’s account differs significantly from the S&P 500 for the same period (t = -2.14, P<0.05). • Software outputs are on the next two slides! Estimating Mean Stock Return – C.I. • The mean monthly return on the client’s portfolio was and the standard deviation was Our resulting 95% confidence interval is • Because the S&P 500 return, 0.95%, falls outside of our interval, we know that μ differs significantly from 0.95% at the α = 0.05 level. Since the S&P 500 showed a mean gain of 0.95% during this time period, we can say with 95% confidence that the underperformance of this portfolio is between 0.09% and 4.01% per month. This estimate helps to determine the compensation owed to the investor. Paired t Tests • In the taste test Example, the same 10 tasters rated before and after sweetness. Since the data were paired by taster, we performed a one-sample t test on the differences. • That is, we used a paired t test. • We are now going to look at another example of a paired t test. Floral Scents and Learning • We hear that listening to Mozart improves students’ performance on tests. Perhaps pleasant odors have a similar effect. To test this idea, 21 subjects worked a paper-and-pencil maze while wearing a mask. The mask was either unscented or carried a floral scent. The response variable is their average time on three trials. Each subject worked the maze with both masks, in a random order. The randomization is important because subjects tend to improve their times as they work a maze repeatedly. The table on the next slide gives the subjects’ average times with both masks. Floral Scents and Learning Subject Unscented Scented Difference Subject Unscented Scented Difference 1 30.60 37.97 -7.37 12 58.93 83.50 -24.57 2 48.43 51.57 -3.14 13 54.47 38.30 16.17 3 60.77 56.67 4.10 14 43.53 51.37 -7.84 4 36.07 40.47 -4.40 15 37.93 29.33 8.60 5 68.47 49.00 19.47 16 43.50 54.27 -10.77 6 32.43 43.23 -10.80 17 87.70 62.73 24.97 7 43.70 44.57 -0.87 18 53.53 58.00 -4.47 8 37.10 28.40 8.70 19 64.30 52.40 11.90 9 31.17 28.23 2.94 20 47.37 53.63 -6.26 10 51.23 68.47 -17.24 21 53.67 47.00 6.67 11 65.40 51.10 14.30 • To analyze these data, subtract the scented times from the unscented times. Therefore a positive value indicates that the subject did better wearing the scented mask. Floral Scenting and Learning • Step 1: Hypotheses – μ is the mean difference in the population from which the subjects were drawn. • Step 2: Conditions – Since we don’t know σ, we must use t procedures. – SRS – The data come from a randomized matched pairs design, which means we can attribute any difference to the treatment. We can only generalize the results to the population if the sample is an SRS. – Normality – Since it is not large enough, we must look at a stemplot and histogram to see if it is reasonably Normal. It has no outliers or gaps and appears Normal. – Independence – It seems reasonable that one subjects difference in average mean completion time with the two different masks would be independent of another subjects. Floral Scenting and Learning • Step 3 – Calculations – For this t value, our P-value would be greater than 0.25. • Step 4 – Interpretation – Since this P-value is large, we fail to reject H0. Therefore, there is not enough evidence to conclude that scented masks improve performance. • The next three slides contain statistical software printouts for the significance test. One Sample t Test: Robustness and Power • Recall from Section 10.2 that t procedures are robust against non-Normality of the population except when outliers or strong skewness are present. • As the sample size increases, the Central Limit Theorem ensures that the distribution of the sample mean becomes more nearly Normal and that the t distribution becomes more accurate for calculating P-values. • Review the guidelines in the box “Using the t Procedures” on p. 655. One Sample t Test: Robustness and Power • The power of a statistical test measures its ability to detect deviations from the null hypothesis. • In practice, we carry out the test in the hope of showing that the null hypothesis is false, so higher power is important. • The power of the one-sample t test against a specific alternative value of the population mean μ is the probability that the test will reject the null hypothesis when the mean has this alternative value. CHAPTER 12 SECTION 2 Tests about a Population Proportion HW: 12.23, 12.24, 12.26, 12.29, 12.30, 12.31, 12.32 Tests about a Population Proportion • When the three important conditions are met, the sampling distribution of is approximately Normal with mean and standard deviation – SRS – Normality – np and n(1-p) both greater than 10 – Independence • For confidence intervals since we were trying to estimate p, we replaced p with in the standard deviation formula which gave us the standard error. • Tests about a Population Proportion • Now, we are performing a significance test. In a significance test, our null hypothesis specifies a value for p which we call p0. • So we will use p0 to find the standard deviation since we are assuming that it is correct. • This means that our statistic is Significance Test for a Proportion Work Stress • According to the National Institute for Occupational Safety and Health, job stress poses a major threat to the health of workers. A national survey of restaurant employees found that 75% said that work stress had a negative impact on their personal lives. A random sample of 100 employees from a large chain finds that 68 answer “Yes” when asked, “Does work stress have a negative impact on you personal life?” Is this good reason to think that the proportion of all employees in this chain who would say “Yes” differs from the national proportion of 0.75? Work Stress • Step 1 – Hypotheses • Step 2 – We should use a one-proportion z-test. – SRS – We are told it was an SRS. – Normality – The expected number of “Yes” and “No” responses are 75 and 25 which are both larger than 10. – Independence – This large chain must have at least 1000 employees. Work Stress • Step 3 – Calculations • Since we are testing that it is not equal to, we must find the probability that it is less than -1.62 or greater than 1.62. • Step 4 – Interpretation – Since our P-value of 0.1052 is fairly large, we would fail to reject the null hypothesis. Therefore, there is no reason to believe that the proportion of workers at the large restaurant chain who suffer from work stress is different than the national survey result of 0.75. Work Stress Work Stress Work Stress • For the work stress example, we arbitrarily chose a response of “Yes” to be a success and “No” to be a failure. • What would happen if we reversed these. • Let’s repeat the significance test with “No” being a success. The national comparison value for the significance test will now be 0.25, the proportion in the national sample who said “No.” Work Stress, Again • Step 1 – Hypotheses • Step 2 – We should use a one-proportion z-test. – SRS – We are told it was an SRS. – Normality – The expected number of “No” and “Yes” responses are 25 and 75 which are both larger than 10. – Independence – This large chain must have at least 1000 employees. Work Stress, Again • Step 3 – Calculations • Since we are testing that it is not equal to, we must find the probability that it is less than -1.62 or greater than 1.62. • Step 4 – Interpretation – Since our P-value of 0.1052 is fairly large, we would fail to reject the null hypothesis. Therefore, there is no reason to believe that the proportion of workers at the large restaurant chain who suffer from work stress is different than the national survey result of 0.75. Work Stress, Again • When we interchanged “Yes” and “No,” we simply changed the sign of the test statistic z. Our P-value remained the same. • These results are true in general. Our conclusion does not depend on an arbitrary choice of success and failure. Significance Test Results • The results of a significance test will often have limited use. • Obviously, we would never expect the experiences of a sample to be exactly the same as the overall population. • If our sample is sufficiently large, however, we will have sufficient power to detect a very small difference. • On the other hand, if our sample is very small, we may be unable to detect differences that could be very important. • This is why we prefer to include a confidence interval as part of our analysis. Confidence Intervals Provide More Info • A confidence interval allows us to see what other values of p are compatible with the sample results. • This is why we will calculate a confidence interval. Estimating Work Stress • The restaurant worker survey found that 68 out of 100 employees agreed that work stress had a negative impact on their personal lives. So we want to create a 95% confidence interval. • Step 1 – Population – We want to estimate the proportion of restaurant workers who believe that work stress had a negative impact on their personal lives. • Step 2 – We should use a one-proportion z-interval. – SRS – We are told it was an SRS. – Normality – We have 68 successes and 32 failures in our sample. Therefore it is approximately Normal. – Independence – This large chain must have at least 1000 employees. Estimating Work Stress • Note: Checking normality for a confidence interval and a significance test are different. For the confidence interval we check that while we have to check that is true for a significance test since we have to assume that the null hypothesis is true. Work Stress, Again • Step 3 – Calculations • Step 4 – Interpretation – We are 95% confident that between 59% and 77% of the restaurant chain’s employees feel that work related stress is damaging their personal lives. Estimating Work Stress • The confidence interval gives us much more information than the significance test. • The confidence interval tells us which values of p are consistent with the sample results. • Notice that we use the standard error to create a confidence interval while we use the hypothesized value to calculate the z statistic. • As a result, we do not have the nice relationship between a confidence interval and a two-tailed significance test like we did for means. The results are still very close, but are not as exact as they were for means. • Our confidence interval (0.59, 0.77) gives an approximate range of p0’s that would not be rejected by a test at the α = 0.05 significance level.