Module 7: Hypothesis Testing I Statistics (OA3102) Professor Ron Fricker Naval Postgraduate School Monterey, California Reading assignment: WM&S chapter 10.1-10.5 Revision: 2-12 1 Goals for this Module • Introduction to testing hypotheses – Steps in conducting a hypothesis test – Types of errors – Lots of terminology • Large sample tests for the means and proportions – Aka z-tests – Type II error probability calculations – Sample size calculations Revision: 2-12 2 A Simple Example • You hypothesize a coin is fair – To test, take a coin and start flipping it – If it is fair, you expect that about half of the flips will be heads and half tails – After a large number of flips, if you see either a large fraction of heads or a large fraction of tails you tend to disbelieve your hypothesis • This is an informal hypothesis test! • How to decide when the fraction is too large? Revision: 2-12 3 Is the Coin Fair? • Back to the in-class exercise from Module 1: flip a coin 10 times and count the number of heads • Assuming the coin is fair (and flips are independent), the number of heads has a binomial distribution with n=10 and p=0.5 • So, if our assumption is true, the distribution of the number of heads is: • And, if your assumption is not true, what do we expect to see? Either too few or too many heads Revision: 2-12 4 Setting Up a Test for the Coin • Idea: Make a rule, based on the number of heads observed, from which we will conclude either that our assumption (p=0.5) is true or false • Here’s one rule: Assume p=0.5 if 3 < x < 7 – Otherwise, conclude p≠0.5 • If p=0.5, what’s our chance of making a mistake? ~11% • If p=0.8, what’s our chance of detecting the biased coin? Revision: 2-12 ~68% 5 CIs vs. Hypothesis Testing • Previously we would have answered this question with a confidence interval: – Say we observed y n = 0.7 : we’re 95% confident that the interval [0.5, 0.9] covers the true p • Now we look to answer the question using a hypothesis test: – If the true probability of heads is p = 0.5 (i.e., the coin is fair), how unlikely would it be to see 7 heads out of ten flips? • If we see an outcome inconsistent with our hypothesis (the coin is fair), then we reject it Revision: 2-12 6 Why Do Hypothesis Tests? • Confidence intervals provide more information than hypothesis tests • But often we need to test a particular theory – E.g., “The mod decreases the mean down-time.” • Sometimes confidence intervals are hard/impossible – When the theory you’re testing has several dimensions • E.g., regression slope and intercept – When “intervals” don’t make sense • E.g., are two categorical variables independent? Revision: 2-12 7 Elements of a Statistical Test • Null hypothesis – Denoted H0 – We believe the null unless/until it is proven false • Alternative hypothesis – Denoted Ha – It’s what we would like to prove • Test statistic – The test statistic is the empirical evidence from the data • Rejection region – If the test statistic falls in the rejection region, we say “reject the null hypothesis” or “we’ve proven the alternative” – Otherwise, we “fail to reject the null” Revision: 2-12 8 Errors in Hypothesis Testing true situation conclusion don’t reject reject Revision: 2-12 H0 H0 H0 true no error type I error Ha true type II error no error 9 Probability of a Type I Error • a=Pr(Type I error) = Pr(H0 is rejected when it is true) • Also called the significance level or the level of the test • Experimenter chooses prior to the test – Conventions: a=0.10, 0.05, and 0.01 – Can increase or decrease depending on particular problem • Choice of a defines rejection region (or size of “p-value”) that results in rejection of null Revision: 2-12 10 Probability of a Type II Error • b=Pr(Type II error) = Pr(not rejecting H0 when it is false) • It is a function of the actual alternative distribution and the sample size • 1-b called the power of a test – It’s Pr(rejecting H0 when it is false) • Ideal is both small a and b (i.e., high power), but for a fixed sample size they trade off – By convention, we control a by choice and b with sample size (bigger sample, more power) Revision: 2-12 11 Choosing the Null and Alternative Based on the Severity of Error • In hypothesis testing, we get to control the Type I error – So, if one error is more severe than the other, set the test up so that it’s the Type I error • E.g., in the US, we consider sending an innocent person to jail more serious than letting a guilty person go free – Hence, the burden of proof of guilt is placed on the prosecution at trial (“innocent until proven guilty”) – I.e., the null hypothesis is a person is innocent, and the trial process controls the chance of sending an innocent person to jail Revision: 2-12 12 Example 10.1 • Consider a political poll of n=15 people – Want to test H0: p = 0.5 vs. Ha: p < 0.5, where p is the proportion of the population favoring a candidate – The test statistic is Y, the number of people in the sample favoring the candidate • If the RR={y < 2}, find a • Solution: Revision: 2-12 13 Example 10.1 (continued) Revision: 2-12 14 Example 10.2 • Continuing with the previous problem, if p=0.3 what is the probability of a Type II error (b)? – I.e., what’s the probability of concluding that the candidate will win? • Solution: Revision: 2-12 15 Example 10.2 (continued) Revision: 2-12 16 Example 10.3 • Still continuing with the previous problem, if p=0.1 then what is the probability of a Type II error (b)? • Solution: Revision: 2-12 17 Example 10.3 (continued) Revision: 2-12 18 Example 10.4 • Still continuing the problem, if RR={y < 5}: – What is the level of the test (a)? – If p = 0.3, what is b? • Solution: Revision: 2-12 19 Example 10.4 (continued) Revision: 2-12 20 Terminology • Null and alternative hypotheses – We will believe the null until it is proven false • Acceptance vs. rejection region – The null is proven false if the test statistic falls in the rejection region • Type I vs. Type II error – Type I: Rejecting the null hypothesis when it is really true – Type II: Accepting the null hypothesis when it is really false • Significance level or level of the test (a) – Probability of a Type I error • Pr(Type II error) = b and 1-b is called the power – It’s a function of the actual alternative distribution Revision: 2-12 21 Another Example • As a program manager, you want to decrease mean down-time m for a type of equipment – Manufacturer suggests modification – Goal is to see if mod actually does decrease m • Implement modification on a sample of equipment (n=25) and measure the downtime (say, per month) – Currently, equipment down-time is 75 mins/month – Standard deviation is s=9 mins Revision: 2-12 22 Intuition Behind the Test 1. We will assume the status quo is true… a. In this case, that there is no change in the mean down-time s=9/5 2. …unless there is sufficient evidence to contradict it a. In this case, meaning we observe a much smaller sample mean than we would expect to see assuming m=75 Revision: 2-12 X m=75 23 Steps in Hypothesis Testing 1. 2. 3. 4. 5. 6. Identify the parameter of interest State null and alternative hypotheses Determine form of test statistic Calculate rejection region Calculate test statistic Determine test outcome by comparing test statistic to rejection region Revision: 2-12 24 1. Identify Parameter of Interest • In hypothesis tests, we are testing the parameter of a distribution – E.g., m or s for a normal distribution – E.g., p for a binomial distribution – E.g., a and b for a gamma distribution • So, the first step is to identify the parameter of interest – Often we’ll be testing the mean m of a normal, since the CLT often applies to the sample mean • E.g., in the equipment down-time example, we’re interested in testing the mean down-time – For the coin and election examples, we’re testing p Revision: 2-12 25 2. State Null and Alternative Hypotheses • H0: The null hypothesis is a specific theory about the population that we wish to disprove – We will believe the null until it is disproved – Example: Mean down-time is equal to 75 • In notation, H 0: m = 75 • Ha: The alternative hypothesis is what we want to prove – What we will believe if the null is rejected – Example: Mean down-time is less than 75 • In notation, H a: m 75 Revision: 2-12 26 Null and Alternative Hypotheses are Fundamentally Different • The null hypothesis is what you have assumed – Generally, it’s the status quo or less desirable test outcome – Failing to prove the alternative does not mean the null is true, only that you don’t have enough evidence to reject it • The alternative is proven based on empirical evidence – It’s the desired test outcome and/or the outcome upon which the burden of proof rests – The significance level (a) is set so that the chance of incorrectly “proving” the alternative is tolerably low • Having proved the alternative is a much stronger outcome than failing to reject the null – Thus, structure the test so the alternative is what needs proving Revision: 2-12 27 In the Example • In the example, we want to test is whether the mod decreases mean down-time – So, the null hypothesis is the status quo and the alternative carries the burden of proof to show a decrease – We write this out as H 0: m = 75 H a: m 75 • The other possibilities are H 0: m = 75 H a: m 75 and H 0: m = 75 H a: m 75 – What would you be testing with these hypotheses? Revision: 2-12 28 Expressing the Null as an Equality • We will express the null hypothesis as an equality and the alternative as an inequality – E.g., H 0: m = 75 versus H a: m 75 • In reality, the hypotheses divide the real line into two separate regions – E.g., H 0: m 75 versus H a: m 75 • However, the most powerful test occurs when the null hypothesis is at the boundary of its region – Hence, we write the null as an equality Revision: 2-12 29 3. Determine Test Statistic (and its Sampling Distribution) • The test statistic is (a function of) the sample statistic corresponding to population parameter you are testing – Population and sample statistic examples: • Population mean Sample mean • Difference of two population means Difference of two sample means – It is sometimes “a function of” the sample statistic as we may rescale the sample statistic • We use the sampling distribution to determine whether the observed statistic is “unusual” Revision: 2-12 30 In the Example • In the example, we are testing the mean m so the obvious choice for the sample statistic is X • In this case, it’s easier to make the test statistic the rescaled sample statistic, X m0 Z= , s/ n where m0 is the null hypothesis mean • Why? Because, assuming the null hypothesis is true, we know the sampling distribution of Z: Z~N(0,1) – This is exactly true if X has a normal distribution and approximately true via the CLT for large sample sizes Revision: 2-12 31 4. Calculate Rejection Region • Rejection region depends on the alternative hypothesis • Set the significance level a so that the Pr(fall in rejection region | null hyp. is true) = a • Means you will have to determine the appropriate quantile or quantiles of the sampling distribution Revision: 2-12 32 The Way to Think About It (for a “two-sided” test) Rejection region – unlikely under the null (i.e., probability a) If test statistic falls in this region, reject the null Acceptance region – likely under the null hypothesis If test statistic falls in this region, fail to reject null Revision: 2-12 33 In the Example • In our example, the test statistic Z~N(0,1) • In addition, we know H a: m 75 , so this is a “one-tailed” or “one-sided” test • So, we need to find za so that Pr(Z < za) = a • Choosing a=0.05: – From Table 4 we get Pr(Z < -1.645) = 0.05 – Using Excel: =NORMSINV(0.05) – And in R: qnorm(0.05) Revision: 2-12 34 Picturing the Rejection Region Z~N(0,1) pdf • For the rescaled test statistic, the rejection region is in red • Probability of falling in that region, assuming the null is true, is a=0.05 -1.645 0 X~N(75,81/25) pdf • Here’s the equivalent test without rescaling • Probability of falling in that region, assuming the null is true, is still a=0.05 75 - 1.645 x 9/5 = 72 Revision: 2-12 75 35 5. Calculate the Test Statistic • Now, plug the necessary quantities into the formula and calculate the test statistic • E.g., in the example, imagine we tested 25 pieces of equipment for a month, measuring the total down-time for each: • Calculating the sample mean gives X = 68.6 • Thus, the test statistic is X m0 68.6 75 z= = = 3.8 s/ n 9 / 25 Revision: 2-12 36 6. Determine the Outcome • Now, compare the test statistic with the rejection region – If it falls within the rejection region you have “rejected the null hypothesis” • Equivalently, “proven the alternative” – If it falls in the acceptance region, you have “failed to reject the null hypothesis” • Equivalently, “failed to prove the alternative” Revision: 2-12 37 In the Example • We observed z = 3.8 which falls in the rejection region • The picture: Very unusual to see this, if the null is true z = 3.8 -1.645 0 • Thus, conclude that the mod is effective at reducing mean down-time for the equipment Revision: 2-12 38 One Way to Think About It • The test statistic tells us our observation was 3.8 standard errors way from the hypothesized mean – The calculation X m0 68.6 75 z= = = 3.8 s/ n 9 / 25 • How likely is this? – If the null were true, it would be very unlikely Pr(Z 3.8 | H 0 is true) = 3.8 = 0.000072 – Another way to think about it is about one chance in 14,000 to see something like this or more extreme (i.e., 1/0.000072) – Pretty convincing evidence that the mod decreases mean down-time Revision: 2-12 39 Logic of Hypothesis Testing It’s proof by contradiction: – Suppose I am a general/flag officer – People would salute me on Tuesdays – No one ever salutes me – I must not be a general/flag officer Revision: 2-12 Null Hypothesis What I expect if null hypothesis is true What I see is inconsistent with what I expect My initial assumption must be wrong 40 Now, for the Example… – Assume m is 75 – Sample mean (n=25) normally distributed w/ mean 75, s.e. 9 / 25 = 1.8 – Observe a sample mean of 68.6 that is 3.8 s.e.s below assumed mean – Real mean m must be less than 75 Revision: 2-12 Null Hypothesis What you expect to see if null is true Not very likely to happen if null is true (p =0.00007) Null must be false 41 Large-Sample or z-Tests • The statistic is (approximately) normally distributed ˆ 0 • The rescaled test statistic is Z = s ˆ • The null hypothesis is H 0: = 0 • Three possible alternative hypotheses and tests: Alternative Hypothesis Rejection Region for Level a Test H a: 0 z za (upper-tailed test) H a: 0 z za (lower-tailed test) H a: 0 z za / 2 or z za / 2 Revision: 2-12 (two-tailed test) 42 Picturing z-Tests Upper-tailed test Lower-tailed test Revision: 2-12 * Figure from Probability and Statistics for Engineering and the Sciences, 7th ed., Duxbury Press, 2008. Two-tailed test 43 Example 10.5 • VP claims mean contacts/week is less than15 – Data collected on random sample of n=36 people – Given y = 17 and s2=9, is there evidence to refute the claim at a significance level of a=0.05? • Specify the hypotheses to be tested: Revision: 2-12 44 Example 10.5 (continued) • Specify the test statistic and the rejection region Revision: 2-12 45 Example 10.5 (continued) • Conduct the test and state the conclusion Revision: 2-12 46 Large Sample Tests for Population Proportion (p) • If we have a large sample, then via the CLT, pˆ = y / n has an approximate normal distribution • For the null hypothesis H0: p = p0 there are three possible alternative hypotheses Alternative Hypothesis Rejection Region for Level a Test H a: p p0 z za (upper-tailed test) H a: p p0 z za (lower-tailed test) H a: p p0 z za / 2 or z za / 2 where the test statistic is z = Revision: 2-12 (two-tailed test) pˆ p0 p0 (1 p0 ) / n 47 What’s a “Large Sample”? • Remember that the y in pˆ = y / n is a sum of indicator variables – If you sum enough, then the CLT “kicks in” and you can assume p̂ has an approximately normal distribution • So use the same rule we used back in the CLT lecture: max( p0 , q0 ) n 9 min( p0 , q0 ) Revision: 2-12 48 Example 10.6 • Machine must be repaired is produces more than 10% defectives – A random sample of n=100 items has 15 defectives – Is there evidence to support the claim that the machine needs repairing at a significance level of a=0.01? • Specify the hypotheses to be tested: Revision: 2-12 49 Example 10.6 (continued) • Specify the test statistic and the rejection region Revision: 2-12 50 Example 10.6 (continued) • Conduct the test and state the conclusion Revision: 2-12 51 Example 10.7 • Reaction time study on men and women conducted – Data on independent random samples of 50 men and 50 women collected giving 2 2 ymen = 3.6, smen = 0.18, ywomen = 3.8, and swomen = 0.14 – Is there evidence to suggest a difference in the true mean reaction times between men and women at the a=0.01 level? • Specify the hypotheses to be tested: Revision: 2-12 52 Example 10.7 (continued) • Specify the test statistic and the rejection region Revision: 2-12 53 Example 10.7 (continued) • Conduct the test and state the conclusion Revision: 2-12 54 Calculating Type II Error Probabilities • The probability of a Type II error (b) is the probability a test fails to reject the null when the alternative hypothesis is true – Note that it depends on a particular alternative hypothesis – We can write it mathematically as Pr(reject H0 | Ha is true with =a) – To determine, first figure out the rejection region (a function of the null hypothesis), then calculate the probability of falling in the acceptance region when =a Revision: 2-12 55 Calculating Type II Error Probabilities, continued • For example, consider the test H 0: = 0 versus H a: 0 – Then the rejection region is of the form RR = ˆ : ˆ k for some value of k • So, the probability of a Type II error is = Pr ˆ k | = b a = Pr ˆ not in RR when H a true with = a a Revision: 2-12 ˆ a k a = Pr sˆ s ˆ 56 Example 10.8 • Returning to Example 10.5, find b if ma=16 – Remember that m0=15, a=0.05, n=36, y = 17 and s2=9 • Pictorially, we have: Revision: 2-12 57 Example 10.8 (continued) • Now solving: Revision: 2-12 58 Example 10.8 (continued) • An alternative way to solve: Revision: 2-12 59 Example 10.8 (continued) Revision: 2-12 60 Alternatively, the Power of a Test • Remember, that’s the power of a test is the probability a test will reject the null for a particular alternative hypothesis – It’s just 1-b • Why is this important? – Prior to doing a test, natural problem is that you want to make sure you have sufficient power to prove “interesting” alternatives – Sometimes after a test results in a null result, you might want to know the probability of rejecting at the observed level Revision: 2-12 61 Sample Size Calculations • The sample size n for which a level a test also has b at the alternative value ma is s z z 2 b a m a m0 n= 2 s za /2 zb ma m0 for a one-tailed test for a two-tailed test (approximate sol'n) – Here za and zb are the quantiles of the normal distribution for a and b Revision: 2-12 62 Example 10.9 • Returning to Exercise 10.5, assuming s2=9, what sample size n is required to test H 0: m = 15 vs. H a: m = 16 with a=b=0.05? • Solution: Revision: 2-12 63 Another Example • Consider a test: H0: m =30,000 vs Ha: m ≠ 30,000, where – The desired significance level is a=0.01 – The population has a normal distribution with s =1,500 • Find the required sample size so that at ma=31,000, the probability of a Type II error is 0.1: s za 2 zb = n= ma m0 2 Revision: 2-12 64 Relationship Between Hypothesis Tests and Confidence Intervals • The large sample hypothesis test (z-test) is based on the statistic ˆ 0 Z= s ˆ where the acceptance region is ˆ 0 RR = za 2 za 2 s ˆ which we can write as RR = ˆ z s ˆ ˆ z s ˆ Revision: 2-12 a 2 0 a 2 65 Relationship Between Hypothesis Tests and Confidence Intervals • Now, remember the general form for a twosided large sample confidence interval: 100 1 a % CI = ˆ za 2s ˆ ,ˆ za 2s ˆ • Note the similarities to the acceptance region: RR = ˆ z s ˆ ˆ z s ˆ a 2 0 a 2 • So, if 0 ˆ za 2s ˆ ,ˆ za 2s ˆ we would fail to reject the hypothesis test – Can interpret 100(1-a)% CI as the set of all values for 0 for which H0: =0 is “acceptable” at level a Revision: 2-12 66 What We Covered in this Module • Introduced hypothesis testing – Steps in conducting a hypothesis test – Types of errors – Lots of terminology • Large sample tests for the means and proportions – Aka z-tests – Type II error probability calculations – Sample size calculations Revision: 2-12 67 Homework • WM&S chapter 10 – Required: 2, 3, 20, 21, 27, 34, 38, 41, 42 – Extra credit: None • Useful hints: Always first write out the null and alternative hypotheses. I also recommend drawing the null distribution and then highlighting the rejection region. This can be particularly helpful when calculating b... Exercises 21 and 27: We didn’t do these types of problems in class, but just use what you learned with two-sample confidence intervals… • The relevant hypotheses are H 0: m1 m2 = 0 H 0: p1 p2 = 0 and H a: m1 m2 0 H a: p1 p2 0 Revision: 2-12 68