8 The Likelihood Ratio Test 8.1 The likelihood ratio We often want to test in situations where the adopted probability model involves several unknown parameters. Thus we may denote an element of the parameter space by θ = (θ 1 , θ 2 , . . . θ k ) Some of these parameters may be nuisance parameters, (e.g. testing hypotheses on the unknown mean of a normal distribution with unknown variance, where the variance is regarded as a nuisance parameter). We use the likelihood ratio, λ(x), defined as sup {L(θ; x) : θ ∈ Θ0} , x ∈ Rn . λ(x) = X sup {L(θ; x) : θ ∈ Θ} The informal argument for this is as follows. For a realisation x, determine its best chance of occurrence under H0 and also its best chance overall. The ratio of these two chances can never exceed unity, but, if small, would constitute evidence for rejection of the null hypothesis. A likelihood ratio test for testing H0 : θ ∈ Θ0 against H1 : θ ∈ Θ1 is a test with critical region of the form C1 = {x : λ(x) ≤ k} , where k is a real number between 0 and 1. Clearly the test will be at significance level α if k can be chosen to satisfy sup {P (λ(X) ≤ k; θ ∈ Θ0)} = α. If H0 is a simple hypothesis with Θ0 = {θ0}, we have the simpler form X) ≤ k; θ0) = α. P (λ( To determine k, we must look at the c.d.f. of the random variable λ(X), where the random sample X has joint p.d.f. fX (x; θ0). 69 Example Exponential distribution Test H0 : θ = θ0 against H1 : θ > θ0. Here Θ0 = {θ0}, Θ1 = [θ0, ∞). The likelihood function is ( ( ; x) = n L θ =1 f xi ; θ) = θn e θ − x i . i The numerator of the likelihood ratio is L(θ0 ; x) = θ0 e n nθ 0 x − . We need to find the supremum as θ ranges over the interval [θ0, ∞). Now l(θ; so that x) = n log θ − nθx x which is zero θ < 1 /x and n ∂l(θ; ) = − nx ∂θ θ only when θ = 1 /x . Since L(θ; ) decreasing for θ > 1 /x , sup {L(θ; x) : θ ∈ Θ} = x x ne n , θn0 e nθ0 x − − − 70 is an increasing function for if 1 /x ≥ θ0 if 1 /x < θ0 . L(θ;x) sup{ L(θ;x) : θ Θ } ∋ θ θ0 1/x L(θ;x) sup{ L(θ;x) : θ Θ } ∋ 0 1 0 θ0 x) λ( Since θ 1/x θn e nθ0 x , x ne n , − = − θn xn e = 1 /x ≥ θ0 1 /x < θ0 − − nθ 0 x n e , 1, 1 /x ≥ θ 0 1 /x < θ0 d xn e nθ0 x = nxn 1 e nθ0 x (1 − θ0 x) dx is positive for values of x between 0 and 1 /θ0 where θ0 > 0, it follows that λ( ) is a non-decreasing function of x. Therefore the critical region of the − x − − likelihood ratio test is of the form 1= x: n C =1 xi ≤c . i Example The one-sample t-test The null hypothesis is H0 : θ = θ0 for the mean of a normal distribution with unknown variance σ2. 71 We have and Θ Θ0 = {(θ, σ2) : θ ∈ R, σ2 ∈ R+ } = {(θ, σ2) : θ = θ0, σ2 ∈ R+} √ 1 2 exp − 1 2 (x − θ)2 2σ 2πσ The likelihood function is f (x; θ, σ 2) = 1 L(θ, σ 2 ; x) = (2πσ 2 ) n/2 exp − 2 2σ ( 1 2σ2 ( n − Since l(θ0 , σ 2 ; and x) = − 2 n ∂l ∂σ 2 = which is zero when log(2πσ2) − σ2 = we conclude that sup ( 0 L θ ,σ + 2σ1 4 − n2 2σ 1 n ( ( n n =1 R. x∈ , x i − θ )2 =1 i n xi − θ0 )2 =1 i xi − θ0 )2 , i xi − θ0 )2 =1 i 2 ; x) = 2 ( n π n =1 2 n − xi − θ0 ) /2 e − /2 . n i For the denominator, we already know from previous examples that the m.l.e. of θ is x, so sup ( 2 2 ; x) = ( n L θ, σ π xi − x)2 n i=1 n − /2 e n − /2 =1( − 0)2 /2 (x) = ( − )2 =1 This may be written in a more convenient form. Note that ( − 0)2 = (( − ) + ( − 0))2 =1 =1 ( − )2 + ( − 0)2 = and n i n i λ n i xi θ xi xi n i n =1 i 72 θ x xi xi − n . x x x n x θ θ so that λ( x) = The critical region is 1+ n(x − θ0 )2 n 2 i=1 (xi − x) /2 n − . = {x : λ(x) ≤ k} so it follows that H0 is to be rejected when the value of |x − θ 0 | n 2 i=1 (xi − x) C1 exceeds some constant. Now we have already seen that X−θ √ S/ n where S2 = n −1 1 ∼ t(n − 1) ( n =1 Xi − X )2 . i Therefore it makes sense to write the critical region in the form |x −√θ0| C1 = x : ≥c s/ n which is the standard form of the two-sided t -test for a single sample. 73 8.2 The likelihood ratio statistic Since the function −2 log λ(x) is a decreasing function, it follows that the critical region of the likelihood ratio test can also be expressed in the form C1 = {x : −2 log λ(x) ≥ c} . Writing Λ(x) = −2 log λ(x) = 2 l(θ : x) − l(θ0 : x) the critical region may be written as C1 = {x : Λ(x) ≥ c} and Λ(X) is called the likelihood ratio statistic. We have been using the idea that values of θ close to θ are well supported by the data so, if θ0 is a possible value of θ, then it turns out that, for large samples, D Λ(X) → χ2p where p = dim(θ). Let us see why. 8.2.1 The asymptotic distribution of the likelihood ratio statistic Write and, remembering that () = 0, we have Λ ( − 0)2 − () = ( − 0)2 () = ( − 0)2 ( 0) (( )) 0 But ( − 0) ( 0)1/2 → (0 1) and (( )) → 1 0 so ( − 0)2 ( 0) → 21 l(θ0 ) = l(θ) + (θ − θ0)l (θ) + 21 (θ − θ0)2l (θ) + . . . l θ θ θ θ θ J θ θ θ I θ D θ I θ θ l θ θ θ N J θ . I θ J θ I θ , I θ 74 D χ P and Slutsky’s theorem gives D χ21 Λ→ provided θ0 is the true value of θ. Example Poisson distribution Let X = (X1, . . . , Xn) be a random sample from a Poisson distribution with parameter θ, and test H0 : θ = θ0 against H1 : θ = θ0 at significance level 0.05. The p.m.f. is e θ θx , x = 0, 1, . . . p(x; θ) = x! so that n n l(θ : x) = −nθ + xi log θ − log xi ! − =1 =1 i and ∂l(θ : ∂θ giving θ = x. Therefore Λ = 2n x) = −n + 1 x i n θ θ0 − x + x log =1 i i x θ0 . The distribution of Λ under H0 is approximately χ21 and χ21(0.95) = 3.84, so the critical region of the test is C1 = x : 2n θ0 − x + x log 75 x θ0 ≥ 3.84 . 8.3 Testing goodness-of-fit for discrete distributions The data below were collected by the ecologist E.C. Pielou, who was interested in the pattern of healthy and diseased trees. The subject of her research was Armillaria root rot in a plantation of Douglas firs. She recorded the lengths of 109 runs of diseased trees and these are given below. Run length 1 2 3 4 5 6 Number of runs 71 28 5 2 2 1 On biological grounds, Pielou proposed a geometric distribution as a probability model. Is this plausible? Let’s try to answer this by first looking at the general case. Suppose we have k groups with ni in the ith group. Thus Group 1 2 3 4 · · · k Number n1 n2 n3 n4 · · · nk where i ni = n. Suppose further that we have a probability model such that πi(θ), i = 1, 2, . . . , k, is the probability of being in the ith group. Clearly i πi(θ) = 1. The likelihood is k π i (θ)n L(θ) = n! n! =1 i i and the log-likelihood is ( )= k l θ i =1 ( ) + log ! − log k ni log π i θ n =1 i i ni ! Suppose θ maximises l(θ), being the solution of l (θ) = 0. The general alternative is to take πi as unrestricted by rthe model and subject only to i πi = 1. Thus we maximise (π) = k l =1 ni log πi + log ! − log k n i =1 ni ! with g (π ) = i Using Lagrange multiplier γ we obtain the set of k equations ∂l ∂πi −γ ∂g ∂πi = 0, 1 ≤ i ≤ k, 76 i πi = 1. or ni πi Writing this as − γ = 0, 1 ≤ i ≤ k. = 0, 1 ≤ i ≤ k and summing over i we find γ = n and ni − γπi = ni . n πi The likelihood ratio statistic is Λ = 2 log − =1 =1 2 log () =1 k ni i = ni n k i k ni ni nπi θ i ni log π i (θ) . General statement of asymptotic result for the likelihood ratio statistic Testing H0 : θ ∈ Θ0 ⊂ Θ against H1 : θ ∈ Θ, the likelihood ratio statistic D χ2p , Λ = 2 sup l(θ) − sup l(θ) → Θ Θ θ∈ where p= θ∈ 0 dim Θ − dim Θ0 In the general case above where Λ=2 k =1 ni log ni nπi (θ ) i , the restriction ki=1 πi = 1 means that dim Θ = k − 1. Clearly dim Θ0 = 1 so p = k − 2 and D Λ→ χ2k 2 . − Example Pielou’s data These are Run length 1 2 3 4 5 6 Number of runs 71 28 5 2 2 1 77 and Pielou proposed a geometric model with p.m.f. p(x) = (1 − θ)x − 1 θ, x = 1, 2, . . . where x is the observed run length. Thus, if xj , 1 ≤ j ≤ n, are the observed run lengths, the log-liklihood for Pielou’s model is ( )= ( n l θ j and, maximising, which gives xj =1 − 1) log(1 − θ ) + n log θ =1 x −n = − (1 −jθ) + nθ ∂l(θ) ∂θ n j = 1 θ x By the invariance property of m.l.e.’s . = (1 − ) 1 = ( πi (θ) θ i− x − 1) . xi θ i The data give x = 1.523. We can therefore use the expression for πi(θ) to calculate k = 3.547. Λ = 2 ni log ni nπi (θ) i=1 There are six groups, so p = 6 − 1 − 1 = 4. The approximate distribution of Λ is therefore χ24 and P (Λ ≥ 3.547) = 0.471. There is no evidence against Pielou’s conjecture that a geometric distribution is an appropriate model. Example Two-way contingency table Data are obtained by cross-classifying a fixed number of individuals according to two criteria. They are therefore displayed as nij in a table with r rows and c columns as follows. n11 ··· n1c n1. nr 1 n.1 ··· ··· nrc n.c nr. n ... ... ... 78 ... The aim is to investigate the independence of the two classifications. Suppose the kth individual goes into cell (Xk , Yk ), k = 1, 2, . . . , n, and that individuals are independent. Let P ((Xk , Yk ) = (i, j )) = θij , i = 1, 2, . . . , r; j = 1, 2, . . . , c, where ij θij = 1. The null hypothesis of independence of classifiers can be written H0 : θij = φiρj . This is on Problem Sheet 4 so here are a few hints. The likelihood function is L(θ) ! =n i,j so the log-likelihood is l(θ ) = nij log θij nij θij nij ! + log n! − i,j log nij ! i,j maximise with respect to Under= 0, put = 1=. You willandobtain = = Under 1, maximise with respect to subject to obtain = and, finally log Λ=2 H i φi j ρj θij φi ρ j ni. , n φi H ρj r =1 j =1 ij nij n c nij i nij n ni. n.j Example An historic data set - crime and drinking 79 and ρj subject to n.j n θij θij φi . θij = 1. You will These are Pearson’s 1909 data on crime and drinking. Crime Drinker Abstainer Arson 50 43 Rape 88 62 Violence 155 110 Stealing 379 300 Coining 18 14 Fraud 63 144 Is crime drink related? For these data, Λ = 50.52. Under H0, Λ ∼ χ2p, where p = dim Θ − dim Θ0. In the notation used earlier, there are apparently 6 values of φi to estimate, but in fact there are only 5 values because i φi = 1. Similarly there are 2 − 1 = 1 values of ρj . Thus dim Θ0 = 6. Because ij θij = 1, dim Θ = 12 − 1 = 11 so, therefore, p = 11 − 6 = 5. Testing against a χ2-distribution with 5 degrees of freedom, note that the 0.9999 quantile is 25.75 and we can reject at the 0.0001 level of significance. There there is overwhelming evidence that crime and drink are related. Degrees of freedom It is clear from the above that, when testing contingency tables, the number of degrees of freedom of the resulting χ2-distribution is given, in general, by p = rc − 1 − (r − 1) − (c − 1) = rc − r − c + 1 = (r − 1)(c − 1). 80 8.4 Pearson’s statistic For testing independence in contingency tables, let Oij be the observed number in cell (i, j ), i = 1, 2, . . . , r; j = 1, 2, . . . , c, and Eij be the expected number in cell (i, j ). Pearson’s statistic is (Oij − Eij )2 ∼ χ2 P = . Eij i,j (r 1)(c 1) − − The expected number Eij in cell (i, j ) is calculated under the null hypothesis of independence. If ni. is the total for the ith row and the overall total is n, then the probability of an observation being in the ith row is estimated by ni. . P (ith row) = n Similarly n.j P (j th column) = n and Eij = n × P (ith row) × P (j th column) = = ni.n.j . n Example Crime and drinking These are the data on crime and drinking with the row and column totals. Crime Drinker Abstainer Total Arson 50 43 93 Rape 88 62 150 Violence 155 110 265 Stealing 379 300 679 Coining 18 14 32 Fraud 63 144 207 Total 753 673 1426 The Eij are easily calculated. 93 × 753 = 49.11, and so on. E11 = 1426 Pearson’s statistic turns out to be P = 49.73, which is tested against a χ2distribution with (6 − 1) × (2 − 1) = 5 degrees of freedom and the conclusion is, of course, the same as before. 81 8.4.1 Pearson’s statistic and the likelihood ratio statistic = P (Oij − Eij )2 Eij i,j = nij i,j − n nn 2 i. .j ni. n.j n Consider the Taylor expansion of x log(x/a) about x = a. x log =( x a (x − a)2 − (x − a)3 + · · · 2a 6a2 x − a) + Now put x = nij and a = ni.nn.j so that nij Thus log = i,j nij n ni. n.j nij log = n−n or ni. n.j n nij − n.j n + 12 + nij − n nn 2 + ··· 2 n nn i. i. .j .j nij n ni. n.j i ni. n j i,j ΛP 82 (Oij − Eij )2 + · · · 1 P Eij 2