Theory of Hypothesis Testing Inference is divided into two broad categories: • Estimation • Testing Chapter 7 devoted to point estimation. Will discuss interval estimation in Chapter 9. Testing is the subject of Chapter 8. Definition 14 A hypothesis is a statement about population parameters. There are usually two hypotheses, called the null and alternative hypotheses. 125 The null hypothesis is denoted H0, and the alternative is denoted H1. Usually, H1 is the negation, or complement, of H0. Often the two hypotheses have the form H0 : θ ∈ Θ0 H1 : θ ∈ Θc0. Θ0 ⊂ Θ and Θc0 is the complement of Θ0. A test of the hypotheses is carried out as follows: We are to observe a value of random vector X having distribution f ( · |θ ). Before observing the data, we formulate a decision rule of the form: Reject H0 if the observed value x of X is in some set R. Don’t reject H0 if x ∈ Rc. 126 R is called the rejection region or critical region. We could also phrase the decision rule in terms of a sufficient statistic. Any statistic whose values are used to define the rejection region is called a test statistic. Suppose that Θ0 contains only a single element. Then H0 : θ ∈ Θ0 is called a simple hypothesis. Otherwise it is called a composite hypothesis. We have two hypotheses H0 and H1. After observing the data we either take action a0 or a1. 127 a0 = ”accept, or fail to reject, H0” a1 = ”accept H1” Consequences of actions Action True state of nature a0 a1 θ ∈ Θ0 Correct Type I Error θ ∈ Θc0 Type II Error Correct 128 Example 29 Suppose X1, . . . , Xn is a random sample from N (θ, 1). We’re interested in testing the hypotheses H0 : θ ≤ 10 Θ = (−∞, ∞) H1 : θ > 10. Θ0 = (−∞, 10] Θc0 = (10, ∞) For example: • Xi might be a measure of product quality when a new process is used. • The average quality measure using the old process is 10. • H0 says that the new process is no better than the old. • H1 says the new process is better than the old. 129 A sufficient statistic in this model is n 1 X X̄ = Xi. n i=1 Also, we know that X̄ is both the MLE and the UMVUE of θ. A sensible test would have the following form: Take action a1 if x̄ ≥ cn, and take action a0 if x̄ < cn, where x̄ is the observed value of X̄ and cn is some constant larger than 10. Type I error: Conclude new process is better when it isn’t. Type II error: Conclude new process is no better than the old when in fact it is better. 130 A general approach to testing When θ is the true parameter value, let β(θ) = Pθ (rejecting H0) = Pθ (X ∈ R), where R is the rejection region of the test. We have ( β(θ) = Pθ (Type I error), 1 − Pθ (Type II error), if θ ∈ Θ0 if θ ∈ Θc0. Note that for θ ∈ Θc0, β(θ) = Pθ (making correct decision). The function β restricted to Θc0 is called the power function of the test, and β(θ) is called the power at θ. 131 Define α = sup β(θ). θ∈Θ0 α is called the size of the test. Casella and Berger say that the test is of level η if α ≤ η. Notation A test can be characterized by a test function φ. ½ φ(x) = 1, 0, if x ∈ R if x ∈ Rc Note that β(θ) = Eθ [φ(X )]. Example 29 (continued) Suppose we use a test function φ as follows: ( φ(x) = 1, 0, √ if x̄ ≥ 10 + 1.645 n otherwise. 132 Ã 1.645 β(θ) = Pθ X̄ ≥ 10 + √ n ! Ã ! √ X̄ − θ = Pθ √ ≥ n(10 − θ) + 1.645 1/ n = P (Z ≥ √ n(10 − θ) + 1.645), where Z ∼ N (0, 1). Remarks • β(10) = 0.05 for each n. • β(θ) increases monotonically, from 0 at θ = −∞ to 1 at θ = ∞. • So, the size of the test is 0.05, no matter the value of n. 133 1.0 Power curves for Example 29 5 0.0 0.2 0.4 β(θ) 0.6 0.8 40 20 10 8 9 10 11 12 θ The numbers beside the curves indicate sample size, n. 134 Goal in constructing a test Make α as small as possible while making the power as large as possible. Clearly we can construct a test with α = 0 by taking R = empty set! Also, we can make the power equal to 1 for all θ ∈ Θc by using a test with R = sample space. However, neither of these tests attains the goal of making α small and power large. Our main approach will be to consider tests with level of significance set at some desired value (such as 0.05), and to try and select from these tests one that maximizes power. 135 By taking α to be fairly small, this approach implicitly says that a Type I error is more serious than a Type II error. Choosing α to be small makes a Type I error unlikely, but may mean that our test has low power, and hence a high probability of Type II error. Example 29 (continued) The way we set up the hypotheses, a Type I error means concluding that the new process is better when it really isn’t. Taking α small may reflect our reluctance to change to the new process unless there’s very convincing evidence that it’s really better. 136 Type II error: missing out on a better process. Type I error: make a (perhaps costly) switch to a new process that is no better than the old one. If the latter error is more serious, it would be advisable to choose α small and to live with the resulting Type II error probability. Testing a simple null against a simple alternative Definition 15 A test φ is said to be a most powerful test of size α for testing H0 : θ = θ0 H 1 : θ = θ1 if Eθ0 [φ(X )] = α and for any other test φ∗ of size α, Eθ1 [φ(X )] ≥ Eθ1 [φ∗(X )]. 137 We observe a value of X , whose distribution is either f (x|θ0) or f (x|θ1). Suppose we observe X to be x. Consider the likelihood ratio f (x|θ1) L(x|θ0, θ1) = . f (x|θ0) It would be sensible to reject H0 in favor of H1 only when L(x|θ0, θ1) is sufficiently large. Neyman-Pearson Lemma Let X be a random vector with pdf or pmf f (x|θ), where θ ∈ Θ = {θ0, θ1}. Define the hypotheses H 0 : θ = θ0 H1 : θ = θ1. 138 a) Any test φ of the form ( φ(x) = 1, 0, if f (x|θ1) > kf (x|θ0) if f (x|θ1) < kf (x|θ0) with Eθ0 [φ(X )] = α is a most powerful level α test of H0 vs. H1. b) If there exists a test of the form in a) with k > 0, then any other most powerful level α test, call it φ∗, has size α and is such that φ(x) = φ∗(x) for almost all x in {x : f (x|θ1) 6= kf (x|θ0)}. 139