Statistics 512 Notes 6 Hypothesis Testing Continued Quick Review on Hypothesis Testing Goal: Decide between two hypotheses about a parameter of interest H 0 : 0 H1 : 1 , where 0 1 . Null vs. Alternative Hypothesis: The alternative hypothesis is the hypothesis we are trying to see if there is strong evidence for. The null hypothesis is the default hypothesis that we will retain unless there is strong evidence for the alternative hypothesis. Test statistic and critical region: Test is defined by test statistic and critical region. Critical region is region of values of test statistic for which we will reject the null hypothesis. Errors in hypothesis testing: Type I and Type II errors. Size of test, power of test: Power function of test = C ( ) P (W ( X1 , , X n ) C ) = Probability of rejecting null hypothesis when true parameter is . Size of test = max 0 C ( ) Power at an alternative 1 = C ( ) Neyman-Pearson paradigm: Choose size of test to be reasonably small to protect against Type I error, typically 0.05 or 0.01. Among tests which have prescribed size, choose the most powerful test. P-value: For a test statistic W ( X1 , , X n ) , consider a family of critical regions {C : } each with different sizes. For the observed value of the test statistic Wobs from the sample, consider the subset of critical regions for which we would reject the null hypothesis, {C : Wobs C } . The p-value is the maximum size of the tests in the subset {C : Wobs C } , p-value = max Size(test with critical region C ) . {C :Wobs C } The p-value is a measure of how much evidence there is against the null hypothesis; it is the maximum significance level for which we would still reject the null hypothesis. Consider the family of critical regions Ci {i 1 X i i} for the motivating example. Since the graphologist made 6 correct identifications, we reject the null hypothesis for critical regions Ci , i 6 . The maximum size of the critical regions Ci , i 6 is for i=6 and equals 0.377. The p-values is thus 0.377. 10 Scale of evidence p-value <0.01 0.01-0.05 0.05-0.10 >0.1 evidence very strong evidence against the null hypothesis Strong evidence against the null hypothesis weak evidence against the null hypothesis little or no evidence against the null hypothesis Large sample binomial hypothesis tests: For large samples, we can use the Central Limit Theorem to construct a test with size approaching a prescribed value as the sample becomes large. The Sports Illustrated Jinx: Many athletes believe that there is a Sports Illustrated jinx: appearing on the cover of Sports Illustrated tends to lead to a subsequent decline in performance. Gluckson and Leone (1984) put the Sports Illustrated jinx to the test. Let p denote the probability that the performance level of a cover subject declines. If the performance is such that in normal circumstances performance is as likely to decline as not, then the hypotheses that Gluckson and Leone set out to test can be written 1 H0 : p 2 (SI cover has no effect) 1 H1 : p 2 (SI jinx exists) Included in the study were some 271 subjects appearing on SI covers during the years 1954 through 1983. Let Y denote the number of subjects whose performance subsequently declined. We use Y as our test statistic. We would like to do a test of size approximately 0.05. Consider critical regions of the form C {Y : Y y*} . To choose y* so that the test has size 0.05, we need to solve: y 271 y 271 271 1 1 Pp 0.5 (Y y*) 0.05 1 y y* y 2 2 As written, solving this equation would be difficult. The task can be greatly simplified by using the Central Limit Theorem which says that Y np D N (0,1) np(1 p) for a binomial random variable. Thus, y * 271(0.5) Pp 0.5 (Y y*) P Z 271(0.5)(1 0.5) for a standard normal random variable Z . Since P( Z 1.64) 0.05 , it follows that y * 135.5 1.64 271(0.5)(0.5) . Specifically, y*=149. The observed number of declines was found to be 114. Since 114<149, we do not reject the null hypothesis. There is no strong evidence of a Sports Illustrated jinx. p-value: Consider the test with critical region Y 271(0.5) C {Y : } 271(.5)(1 .5) The approximate size of the test is P( Z ) 1 ( ) . Yobs 271*.5 114 271*.5 We have 271*.5*.5 271*.5*.5 2.61 . Thus, we reject H 0 for all tests with critical region C with 2.61 . The maximum size among these tests for which we reject H 0 is for C2.61 with size =0.995. This the p-value, p-value = 0.995. No evidence against the null hypothesis – no evidence of a Sports Illustrated jinx. Choosing the sample size In the Neyman-Pearson, we choose the size of the test to be small to protect against Type I errors, typically we set the size to be 0.05. This constrains the power of the test. To achieve both a small size and a high power, we can choose the sample size. 1 H : p Example: Suppose we want to test 0 2 vs. 1 H1 : p from an iid Bernoulli sample. Suppose we want 2 the size to be 0.05 and the power to be 0.8 for the alternative p 0.6 . How large a sample size do we need if we use the large sample binomial test? Let Y be the number of successes. Using the large sample binomial test, the test statistic is Y n(0.5) W n(0.5)(1 0.5) For large n, W has approximately a standard normal distribution when p 0.5 . Thus, a test of size 0.05 has critical region C {W :W 1.64} . The power of this test when p 0.6 is Y n(0.5) Y n(0.6) n(0.1) P 1.645 P 1.645 n(0.5)(0.5) n(0.5)(0.5) n(0.5)(0.5) n(0.1) P Z 1.645 n(0.5)(0.5) 1 We have (0.8) 0.842 where is the standard normal CDF. Thus, we want to choose the sample size n so that n(0.1) 1.645 0.842 n(0.5)(0.5) The smallest sample size n that achieves this is found by n(0.1) 1.645 0.842 and solving for setting n(0.5)(0.5) n resulting in n 16.12 . Thus, the smallest sample size n needed is 17. Testing a normal mean 2 Suppose X 1 , , X n iid N ( , ) with the variance known. We want to test H 0 : 0 vs. H1 : 0 . X 0 z Consider the test statistic and critical region n C {z : z c} . What do we need to choose c to be so that the size of the test is 0.05? X 0 P c P( Z c) 0 n where Z is a standard normal random variable. Thus, we want to choose c to be the 0.95 quantile of the standard normal distribution which equals 1.645. Suppose we wanted to test H 0 : 0 vs. H1 : 0 . X 0 z The size of the test with test statistic and n critical region C {z : z c} is X 0 max 0 P c . We have n X X 0 P c P c 0 n n n P Z c 0 n c 0 P Z c 0 1 is an Because n n increasing function of , the size of the test is X 0 P c 0 . Thus a test of size 0.05 for testing n H 0 : 0 vs. H1 : 0 is the same as the test of size 0.05 for testing H 0 : 0 vs. H1 : 0 -- the critical region is C {z : z c} where z X 0 . n Two sided tests: Suppose we want to test H 0 : 0 vs. X 0 z H1 : 0 . Using the test statistic still seems n reasonable but now it makes sense to reject for both very large and very small values of z . We can use a critical region of the form C {z :| z | c} . A test of size 0.05 has critical region C {z :| z | 1.96} because X 0 P 0 c P 0 | Z | c n Duality between tests and confidence intervals Suppose we want to test H 0 : 0 vs. H1 : 0 and use the rejection region C {z :| z | 1.96} . Then, the set of 0 for which the H 0 : 0 is not rejected is {0 : X 0 1.96} {0 : 1.96 X 0 n {0 : X 1.96 1.96} n 0 X 1.96 } n n which is the 95% confidence interval for that we have used. In general, there is a duality between tests and confidence intervals. Suppose we have a family of tests of size of H 0 : 0 vs. H a : 0 for each . Then {0 : test of H0 : 0 vs. H1 : 0 is not rejected} is a (1 ) confidence interval for . Proof: Conversely, suppose we have a (1 ) confidence interval for . Then a test of size of H 0 : 0 vs. H a : 0 is to reject the null hypothesis if and only if 0 does not belong to the confidence region. Proof: