ML and REML Variance Component Estimation c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 1 / 58 Suppose y = Xβ + ε, where ε ∼ N(0, Σ) for some positive definite, symmetric matrix Σ. Furthermore, suppose each element of Σ is a known function of an unknown vector of parameters θ ∈ Θ. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 2 / 58 For example, consider Σ= m X σj2 Zj Z0j + σe2 I, j=1 where Z1 , . . . , Zm are known matrices, θ = [θ1 , . . . , θm , θm+1 ]0 = [σ12 , . . . , σm2 , σe2 ]0 , and Θ = {θ : θj > 0, j = 1, . . . , m + 1}. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 3 / 58 The likelihood function is 1 0 L(β, θ|y) = (2π)−n/2 |Σ|−1/2 e− 2 (y−Xβ) Σ −1 (y−Xβ) and the log likelihood function is 1 1 n l(β, θ|y) = − log(2π) − log |Σ| − (y − Xβ)0 Σ−1 (y − Xβ). 2 2 2 c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 4 / 58 Based on previous results, (y − Xβ)0 Σ−1 (y − Xβ) is minimized over β ∈ Rp , for any fixed θ, by β̂(θ) ≡ (X0 Σ−1 X)− X0 Σ−1 y. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 5 / 58 Thus, the profile log likelihood for θ is l∗ (θ|y) = sup l(β, θ|y) β∈Rp 1 1 n = − log(2π) − log |Σ| − (y − Xβ̂(θ))0 Σ−1 (y − Xβ̂(θ)). 2 2 2 c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 6 / 58 In the general case, numerical methods are used to find the maximizer of l∗ (θ|y) over θ ∈ Θ. Let θ̂ MLE = arg max{l∗ (θ|y) : θ ∈ Θ}. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 7 / 58 Let Σ̂MLE be the matrix Σ with θ̂ MLE in place of θ. The MLE of an estimable Cβ is then given by − 0 −1 Cβ̂ MLE = C(X0 Σ̂−1 MLE X) X Σ̂MLE y. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 8 / 58 We can write Σ as σ 2 V, where σ 2 > 0, V is PD and symmetric, σ 2 is known function of θ, and each entry of V is a known function of θ, e.g., Σ = m X σj2 Zj Z0j + σe2 I j=1 = c Copyright 2012 Dan Nettleton (Iowa State University) σe2 X m σj2 Zj Z0j σe2 j=1 +I Statistics 611 9 / 58 2 and V̂ MLE denote σ 2 and V with θ̂ MLE in place of θ, then If we let σ̂MLE 2 Σ̂MLE = σ̂MLE V̂ MLE and Σ̂−1 MLE = 1 2 σ̂MLE V̂ −1 MLE . It follows that − 0 −1 Cβ̂ MLE = C(X0 Σ̂−1 MLE X) X Σ̂MLE y − 1 −1 0 1 = C X 2 V̂ MLE X X0 2 V̂ −1 y σ̂MLE σ̂MLE MLE − 0 −1 = C(X0 V̂ −1 MLE X) X V̂ MLE y. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 10 / 58 Note that − 0 −1 Cβ̂ MLE = C(X0 V̂ −1 MLE X) X V̂ MLE y is the GLS estimator under the Aitken model with V̂ MLE in place of V. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 11 / 58 We have seen by example that the MLE of the variance component vector θ can be biased. For example, for the case of Σ = σ 2 I, where θ = [σ 2 ], the MLE of σ 2 is (y − Xβ̂)0 (y − Xβ̂) n−r 2 with expectation σ . n n c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 12 / 58 This MLE for σ 2 is often criticized for “failing to account for the loss of degrees of freedom needed to estimate β.” # " n−r 2 (y − Xβ̂)0 (y − Xβ̂) = σ E n n < σ 2 (y − Xβ)0 (y − Xβ) = E n c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 13 / 58 A familiar special case: Suppose i.i.d. y1 , . . . , yn ∼ N(µ, σ 2 ). Then Pn E i=1 (yi n Pn E − µ)2 i=1 (yi c Copyright 2012 Dan Nettleton (Iowa State University) n − ȳ)2 = σ 2 , but = n−1 2 σ . n Statistics 611 14 / 58 REML is an approach that produces unbiased estimators for these special cases and produces less biased estimators than ML estimators in general. Depending on whom you ask, REML stands for REsidual Maximum Likelihood or REstricted Maximum Likelihood. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 15 / 58 The REML method: Find n − rank(X) = n − r linearly independent vectors a1 , . . . , an−r such that a0i X = 00 for all i = 1, . . . , n − r. Find the maximum likelihood estimate of θ using w1 ≡ a01 y, . . . , wn−r ≡ a0n−r y as data. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 16 / 58 A = [a1 , . . . , an−r ] c Copyright 2012 Dan Nettleton (Iowa State University) w1 a01 y . . 0 . . w= . = . = A y. wn−r a0n−r y Statistics 611 17 / 58 If a0 X = 00 , then a0 y is known as an error contrast. Thus, w1 , . . . , wn−r comprise a set of n − r error contrasts. Because (I − PX )X = X − PX X = X − X = 0, the elements of (I − PX )y = y − PX y = y − ŷ are each error contrasts. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 18 / 58 Because rank(I − PX ) = n − rank(X) = n − r, there exists a set of n − r linearly independent rows of I − PX that can be used in step 1 of the REML method to get a1 , . . . , an−r . If we do use a subset of rows of I − PX to get a1 , . . . , an−r , the error contrasts w1 = a01 y, . . . , wn−r = a0n−r y will be a subset of the elements of (I − PX )y = y − ŷ, the residual vector. This is why it makes sense to call the procedure Residual Maximum Likelihood. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 19 / 58 Note that w = A0 y = A0 (Xβ + ε) = A0 Xβ + A0 ε = 0 + A0 ε = A0 ε. d Thus, w = A0 ε ∼ N(A0 0, A0 ΣA) = N(0, A0 ΣA), and the distribution of w depends on θ but not β. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 20 / 58 The log likelihood function in this case is lw (θ|w) = − n−r 1 1 log(2π) − log |A0 ΣA| − w0 (A0 ΣA)−1 w. 2 2 2 An MLE for θ, say θ̂, can be found in the general case using numerical methods to obtain the REML estimate of θ. Let θ̂ = arg max{lw (θ|y) : θ ∈ Θ}. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 21 / 58 Once a REML estimate of θ (and thus Σ) has been obtained, the BLUE of an estimable Cβ if Σ were unknown can be approximated by Cβ̂ REML = C(X0 Σ̂−1 X)− X0 Σ̂−1 y, where Σ̂ is Σ with θ̂ (the REML estimate of θ) in place of θ. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 22 / 58 Suppose A and B are each n × (n − r) matrices satisfying rank(A) = rank(B) = n − r and A0 X = B0 X = 0. Let w = A0 y and v = B0 y. Prove that θ̂ maximizes lw (θ|w) over θ ∈ Θ iff θ̂ maximizes lv (θ|v) over θ ∈ Θ. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 23 / 58 Proof: A0 X = 0 =⇒ X0 A = 0 =⇒ each column of A ∈ N (X0 ). dim(N (X0 )) = n − rank(X) = n − r. ∴ the n − r LI columns of A form a basis for N (X0 ). B0 X = 0 =⇒ X0 B = 0 =⇒ columns of B ∈ N (X0 ). ∴ ∃ an (n − r) × (n − r) matrix M 3 AM = B. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 24 / 58 Note that M, an (n − r) × (n − r) matrix, is nonsingular ∵ n − r = rank(B) = rank(AM) ≤ rank(M) ≤ n − r rank(M) = n − r. =⇒ ∴ v0 (B0 ΣB)−1 v = y0 B(M0 A0 ΣAM)−1 B0 y = y0 AMM−1 (A0 ΣA)−1 (M0 )−1 M0 A0 y = y0 A(A0 ΣA)−1 A0 y = w0 (A0 ΣA)−1 w. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 25 / 58 Also |B0 ΣB| = |M0 A0 ΣAM| c Copyright 2012 Dan Nettleton (Iowa State University) = |M0 ||A0 ΣA||M| = |M|2 |A0 ΣA|. Statistics 611 26 / 58 Now note that n−r 1 1 log(2π) − log |B0 ΣB| − v0 (B0 ΣB)−1 v 2 2 2 n−r 1 1 2 0 = − log(2π) − log(|M| |A ΣA|) − w0 (A0 ΣA)−1 w 2 2 2 n − r 1 1 log(2π) − log(|A0 ΣA|) = − log(|M|2 ) − 2 2 2 1 0 0 −1 − w (A ΣA) w 2 = − log |M| + lw (θ|w). lv (θ|v) = − Because M is free of θ, the result follows. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 27 / 58 Consider the special case y = Xβ + ε, ε ∼ N(0, σ 2 I). Prove that the REML estimator of σ 2 is σ̂ 2 = c Copyright 2012 Dan Nettleton (Iowa State University) y0 (I − PX )y . n−r Statistics 611 28 / 58 REML Theorem: Suppose y = Xβ + ε, where ε ∼ N(0, Σ), Σ a PD, symmetric matrix whose entries are known functions of an unknown parameter vector θ ∈ Θ. Furthermore, suppose rank(X) = r and X̃ is any n × r matrix consisting of any set of r LI columns of X. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 35 / 58 REML Theorem: Let A be any n × (n − r) matrix satisfying rank(A) = n − r and A0 X = 0. Let w = A0 y ∼ N(0, A0 ΣA). Then θ̂ maximizes lw (θ|w) over θ ∈ Θ ⇐⇒ θ̂ maximizes, over θ ∈ Θ, 1 1 1 g(θ) = − log |Σ| − log |X̃0 Σ−1 X̃| − (y − Xβ̂(θ))0 Σ−1 (y − Xβ̂(θ)), 2 2 2 where β̂(θ) = (X0 Σ−1 X)− X0 Σ−1 y. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 36 / 58 We have previously shown: 1. ∃ an n × (n − r) matrix Q 3 QQ0 = I − PX , Q0 X = 0, and Q0 Q = I, n×m and 2. Any n × (n − r) matrix A of rank n − r satisfying A0 X = 0 leads to the same REML estimator θ̂. Thus, without loss of generality, we may assume the matrix A in the REML Theorem is Q. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 37 / 58 We will now prove 3 lemmas and then use those lemmas to prove the REML Theorem. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 38 / 58 Lemma R1: Let G = Σ−1 X̃(X̃0 Σ−1 X̃)−1 and suppose A is and n × (n − r) matrix satisfying A0 A =m×m I and AA0 = I − PX . Then |[A, G]|2 = |X̃0 X̃|−1 . c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 39 / 58 Proof of Lemma R1: |[A, G]|2 = |[A, G]||[A, G]| = |[A, G]0 ||[A, G]| A0 A A0 G I 0 A G = 0 = G A G0 G G0 A G0 G = |I||G0 G − G0 AA0 G| (by our result on the determinant of a partitioned matrix) c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 40 / 58 = |G0 G − G0 (I − PX )G| = |G0 PX G| = |G0 PX̃ G| = |(X̃0 ΣX̃)−1 X̃0 Σ−1 X̃(X̃0 X̃)−1 X̃0 Σ−1 X̃(X̃0 Σ−1 X̃)−1 | = |[(X̃0 ΣX̃)−1 (X̃0 Σ−1 X̃)](X̃0 X̃)−1 [(X̃0 Σ−1 X̃)(X̃0 Σ−1 X̃)−1 ]| = |(X̃0 X̃)−1 | = |X̃0 X̃|−1 . c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 41 / 58 Lemma R2: |A0 ΣA| = |X̃0 Σ−1 X̃||Σ||X̃0 X̃|−1 , where A, X̃, and Σ are as previously defined. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 42 / 58 Proof of Lemma R2: Again define G = Σ−1 X̃(X̃0 Σ−1 X̃)−1 . Show that (1) A0 ΣG = 0, and (2) G0 ΣG = (X̃0 Σ−1 X̃)−1 . c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 43 / 58 Next show that |[A, G]|2 |Σ| = |A0 ΣA||X̃0 Σ−1 X̃|−1 . c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 45 / 58 Now use Lemma R1 to complete the proof of Lemma R2. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 47 / 58 Lemma R3: A(A0 ΣA)−1 A0 = Σ−1 − Σ−1 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1 , where A, Σ, and X̃ are as defined previously. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 49 / 58 Proof of Lemma R3: [A, G]−1 Σ−1 ([A, G]0 )−1 = ([A, G]0 Σ[A, G])−1 " #−1 " #−1 A0 ΣA A0 ΣG A0 ΣA 0 = = G0 ΣA G0 ΣG 0 G0 ΣG " #−1 A0 ΣA 0 = 0 (X̃0 Σ−1 X̃)−1 " # (A0 ΣA)−1 0 = . 0 X̃0 Σ−1 X̃ c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 50 / 58 Now multiplying on the left by [A, G] and on the right by [A, G]0 yields " −1 Σ = [A, G] (A0 ΣA)−1 0 0 X̃0 Σ−1 X̃ #" # A0 G0 = A(A0 ΣA)−1 A0 + GX̃0 Σ−1 X̃G0 . c Copyright 2012 Dan Nettleton (Iowa State University) (3) Statistics 611 51 / 58 Now GX̃0 Σ−1 X̃G0 = Σ−1 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1 = Σ−1 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1 . (4) Combining (3) and (4) yields A(A0 ΣA)−1 A0 = Σ−1 − Σ−1 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1 . c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 52 / 58 Now use Lemmas R2 and R3 to prove the REML Theorem. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 53 / 58 1 1 n−r log(2π) − log |A0 ΣA| − w0 (A0 ΣA)−1 w 2 2 2 1 = constant1 − log(|X̃0 Σ−1 X̃||Σ||X̃0 X̃|−1 ) 2 1 0 − y A(A0 ΣA)−1 A0 y 2 1 1 = constant2 − log |Σ| − log |X̃0 Σ−1 X̃| 2 2 1 0 −1 − y (Σ − Σ−1 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1 )y 2 lw (θ|w) = − c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 54 / 58 = constant2 − 1 1 log |Σ| − log |X̃0 Σ−1 X̃| 2 2 1 − y0 Σ−1/2 (I − Σ−1/2 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1/2 )Σ−1/2 y 2 1 1 = constant2 − log |Σ| − log |X̃0 Σ−1 X̃| 2 2 1 0 −1/2 − yΣ (I − PΣ−1/2 X̃ )Σ−1/2 y. 2 c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 55 / 58 Now C(Σ−1/2 X̃) = C(Σ−1/2 X) ∴ PΣ−1/2 X̃ = PΣ−1/2 X ∴ y0 Σ−1/2 (I − PΣ−1/2 X̃ )Σ−1/2 y = y0 Σ−1/2 (I − PΣ−1/2 X )Σ−1/2 y = k(I − PΣ−1/2 X )Σ−1/2 yk2 c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 56 / 58 = kΣ−1/2 y − Σ−1/2 X(X0 Σ−1 X)− X0 Σ−1 yk2 = kΣ−1/2 y − Σ−1/2 Xβ̂(θ)k2 = kΣ−1/2 (y − Xβ̂(θ))k2 = (y − Xβ̂(θ))0 Σ−1/2 Σ−1/2 (y − Xβ̂(θ)) = (y − Xβ̂(θ))0 Σ−1 (y − Xβ̂(θ)). c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 57 / 58 ∴ we have lw (θ|w) = constant2 − 1 log |Σ| 2 1 − log |X̃0 Σ−1 X̃| 2 1 − (y − Xβ̂(θ))0 Σ−1 (y − Xβ̂(θ)) 2 = constant2 + g(θ). ∴ θ̂ maximizes lw (θ|w) over θ ∈ Θ ⇐⇒ θ̂ maximizes g(θ) over θ ∈ Θ. c Copyright 2012 Dan Nettleton (Iowa State University) Statistics 611 58 / 58