ML and REML Variance Component Estimation Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 1 / 58 Suppose y = Xβ + ε, where ε ∼ N(0, Σ) for some positive definite, symmetric matrix Σ. Furthermore, suppose each element of Σ is a known function of an unknown vector of parameters θ ∈ Θ. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 2 / 58 For example, consider Σ= m X σj2 Zj Z0j + σe2 I, j=1 where Z1 , . . . , Zm are known matrices, θ = [θ1 , . . . , θm , θm+1 ]0 = [σ12 , . . . , σm2 , σe2 ]0 , and Θ = {θ : θj > 0, j = 1, . . . , m + 1}. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 3 / 58 The likelihood function is 1 0 L(β, θ|y) = (2π)−n/2 |Σ|−1/2 e− 2 (y−Xβ) Σ −1 (y−Xβ) and the log likelihood function is 1 1 n l(β, θ|y) = − log(2π) − log |Σ| − (y − Xβ)0 Σ−1 (y − Xβ). 2 2 2 Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 4 / 58 Based on previous results, (y − Xβ)0 Σ−1 (y − Xβ) is minimized over β ∈ Rp , for any fixed θ, by β̂(θ) ≡ (X0 Σ−1 X)− X0 Σ−1 y. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 5 / 58 Thus, the profile log likelihood for θ is l∗ (θ|y) = sup l(β, θ|y) β∈Rp 1 1 n = − log(2π) − log |Σ| − (y − Xβ̂(θ))0 Σ−1 (y − Xβ̂(θ)). 2 2 2 Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 6 / 58 In the general case, numerical methods are used to find the maximizer of l∗ (θ|y) over θ ∈ Θ. Let θ̂ MLE = arg max{l∗ (θ|y) : θ ∈ Θ}. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 7 / 58 Let Σ̂MLE be the matrix Σ with θ̂ MLE in place of θ. The MLE of an estimable Cβ is then given by − 0 −1 Cβ̂ MLE = C(X0 Σ̂−1 MLE X) X Σ̂MLE y. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 8 / 58 We can write Σ as σ 2 V, where σ 2 > 0, V is PD and symmetric, σ 2 is known function of θ, and each entry of V is a known function of θ, e.g., Σ = m X σj2 Zj Z0j + σe2 I j=1 = Copyright c 2012 Dan Nettleton (Iowa State University) σe2 X m σj2 Zj Z0j σe2 j=1 +I Statistics 611 9 / 58 2 and V̂ MLE denote σ 2 and V with θ̂ MLE in place of θ, then If we let σ̂MLE 2 Σ̂MLE = σ̂MLE V̂ MLE and Σ̂−1 MLE = 1 2 σ̂MLE V̂ −1 MLE . It follows that − 0 −1 Cβ̂ MLE = C(X0 Σ̂−1 MLE X) X Σ̂MLE y − 1 −1 0 1 = C X 2 V̂ MLE X X0 2 V̂ −1 y σ̂MLE σ̂MLE MLE − 0 −1 = C(X0 V̂ −1 MLE X) X V̂ MLE y. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 10 / 58 Note that − 0 −1 Cβ̂ MLE = C(X0 V̂ −1 MLE X) X V̂ MLE y is the GLS estimator under the Aitken model with V̂ MLE in place of V. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 11 / 58 We have seen by example that the MLE of the variance component vector θ can be biased. For example, for the case of Σ = σ 2 I, where θ = [σ 2 ], the MLE of σ 2 is (y − Xβ̂)0 (y − Xβ̂) n−r 2 with expectation σ . n n Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 12 / 58 This MLE for σ 2 is often criticized for “failing to account for the loss of degrees of freedom needed to estimate β.” # " n−r 2 (y − Xβ̂)0 (y − Xβ̂) = σ E n n < σ 2 (y − Xβ)0 (y − Xβ) = E n Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 13 / 58 A familiar special case: Suppose i.i.d. y1 , . . . , yn ∼ N(µ, σ 2 ). Then Pn E i=1 (yi n Pn E − µ)2 i=1 (yi Copyright c 2012 Dan Nettleton (Iowa State University) n − ȳ)2 = σ 2 , but = n−1 2 σ . n Statistics 611 14 / 58 REML is an approach that produces unbiased estimators for these special cases and produces less biased estimators than ML estimators in general. Depending on whom you ask, REML stands for REsidual Maximum Likelihood or REstricted Maximum Likelihood. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 15 / 58 The REML method: Find n − rank(X) = n − r linearly independent vectors a1 , . . . , an−r such that a0i X = 00 for all i = 1, . . . , n − r. Find the maximum likelihood estimate of θ using w1 ≡ a01 y, . . . , wn−r ≡ a0n−r y as data. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 16 / 58 A = [a1 , . . . , an−r ] Copyright c 2012 Dan Nettleton (Iowa State University) w1 a01 y . . 0 . . w= . = . = A y. wn−r a0n−r y Statistics 611 17 / 58 If a0 X = 00 , then a0 y is known as an error contrast. Thus, w1 , . . . , wn−r comprise a set of n − r error contrasts. Because (I − PX )X = X − PX X = X − X = 0, the elements of (I − PX )y = y − PX y = y − ŷ are each error contrasts. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 18 / 58 Because rank(I − PX ) = n − rank(X) = n − r, there exists a set of n − r linearly independent rows of I − PX that can be used in step 1 of the REML method to get a1 , . . . , an−r . If we do use a subset of rows of I − PX to get a1 , . . . , an−r , the error contrasts w1 = a01 y, . . . , wn−r = a0n−r y will be a subset of the elements of (I − PX )y = y − ŷ, the residual vector. This is why it makes sense to call the procedure Residual Maximum Likelihood. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 19 / 58 Note that w = A0 y = A0 (Xβ + ε) = A0 Xβ + A0 ε = 0 + A0 ε = A0 ε. d Thus, w = A0 ε ∼ N(A0 0, A0 ΣA) = N(0, A0 ΣA), and the distribution of w depends on θ but not β. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 20 / 58 The log likelihood function in this case is lw (θ|w) = − n−r 1 1 log(2π) − log |A0 ΣA| − w0 (A0 ΣA)−1 w. 2 2 2 An MLE for θ, say θ̂, can be found in the general case using numerical methods to obtain the REML estimate of θ. Let θ̂ = arg max{lw (θ|y) : θ ∈ Θ}. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 21 / 58 Once a REML estimate of θ (and thus Σ) has been obtained, the BLUE of an estimable Cβ if Σ were unknown can be approximated by Cβ̂ REML = C(X0 Σ̂−1 X)− X0 Σ̂−1 y, where Σ̂ is Σ with θ̂ (the REML estimate of θ) in place of θ. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 22 / 58 Suppose A and B are each n × (n − r) matrices satisfying rank(A) = rank(B) = n − r and A0 X = B0 X = 0. Let w = A0 y and v = B0 y. Prove that θ̂ maximizes lw (θ|w) over θ ∈ Θ iff θ̂ maximizes lv (θ|v) over θ ∈ Θ. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 23 / 58 Proof: A0 X = 0 =⇒ X0 A = 0 =⇒ each column of A ∈ N (X0 ). dim(N (X0 )) = n − rank(X) = n − r. ∴ the n − r LI columns of A form a basis for N (X0 ). B0 X = 0 =⇒ X0 B = 0 =⇒ columns of B ∈ N (X0 ). ∴ ∃ an (n − r) × (n − r) matrix M 3 AM = B. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 24 / 58 Note that M, an (n − r) × (n − r) matrix, is nonsingular ∵ n − r = rank(B) = rank(AM) ≤ rank(M) ≤ n − r rank(M) = n − r. =⇒ ∴ v0 (B0 ΣB)−1 v = y0 B(M0 A0 ΣAM)−1 B0 y = y0 AMM−1 (A0 ΣA)−1 (M0 )−1 M0 A0 y = y0 A(A0 ΣA)−1 A0 y = w0 (A0 ΣA)−1 w. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 25 / 58 Also |B0 ΣB| = |M0 A0 ΣAM| Copyright c 2012 Dan Nettleton (Iowa State University) = |M0 ||A0 ΣA||M| = |M|2 |A0 ΣA|. Statistics 611 26 / 58 Now note that n−r 1 1 log(2π) − log |B0 ΣB| − v0 (B0 ΣB)−1 v 2 2 2 n−r 1 1 2 0 = − log(2π) − log(|M| |A ΣA|) − w0 (A0 ΣA)−1 w 2 2 2 n − r 1 1 log(2π) − log(|A0 ΣA|) = − log(|M|2 ) − 2 2 2 1 0 0 −1 − w (A ΣA) w 2 = − log |M| + lw (θ|w). lv (θ|v) = − Because M is free of θ, the result follows. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 27 / 58 Consider the special case y = Xβ + ε, ε ∼ N(0, σ 2 I). Prove that the REML estimator of σ 2 is σ̂ 2 = Copyright c 2012 Dan Nettleton (Iowa State University) y0 (I − PX )y . n−r Statistics 611 28 / 58 Proof: I − PX is symmetric. Thus, by the Spectral Decomposition Theorem, I − PX = n X λj qj q0j j=1 where λ1 , . . . , λn are eigenvalues of I − PX and q1 , . . . , qn are orthonormal eigenvectors. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 29 / 58 Because I − PX is idempotent and of rank n − r, n − r of the λj values equal 1 and the other r equal 0. Define Q to be the matrix whose columns are the eigenvectors corresponding to the n − r nonzero eigenvalues. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 30 / 58 Then Q is n × (n − r), QQ0 = I − PX , and Q0 Q = I. Thus, rank(Q) = n − r and QQ0 X = (I − PX )X = X − PX X = 0. Multiplying on the left by Q0 =⇒ Q0 QQ0 X = Q0 0 =⇒ IQ0 X = 0 =⇒ Q0 X = 0. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 31 / 58 ∴ the REML estimator of σ 2 can be obtained by finding the maximizer of the likelihood based on the data w ≡ Q0 y ∼ N(Q0 Xβ, Q0 σ 2 IQ) d = N(0, σ 2 I). Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 32 / 58 n−r n−r 1 log(2π) − log(σ 2 ) − w0 w/σ 2 2 2 2 n − r w0 w = − + 4 2σ 2 2σ w0 w = 0 ⇐⇒ σ̂ 2 = n−r lw (σ 2 |w) = − ∂lw (σ 2 |w) ∂σ 2 2 ∂lw (σ |w) ∂σ 2 σ 2 =σ̂ 2 Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 33 / 58 ∴ the MLE of σ 2 based on w and the REML estimator of σ 2 based on y is σ̂ 2 = = Copyright c 2012 Dan Nettleton (Iowa State University) (Q0 y)0 (Q0 y) w0 w = n−r n−r y0 QQ0 y y0 (I − PX )y = . n−r n−r Statistics 611 34 / 58 REML Theorem: Suppose y = Xβ + ε, where ε ∼ N(0, Σ), Σ a PD, symmetric matrix whose entries are known functions of an unknown parameter vector θ ∈ Θ. Furthermore, suppose rank(X) = r and X̃ is any n × r matrix consisting of any set of r LI columns of X. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 35 / 58 REML Theorem: Let A be any n × (n − r) matrix satisfying rank(A) = n − r and A0 X = 0. Let w = A0 y ∼ N(0, A0 ΣA). Then θ̂ maximizes lw (θ|w) over θ ∈ Θ ⇐⇒ θ̂ maximizes, over θ ∈ Θ, 1 1 1 g(θ) = − log |Σ| − log |X̃0 Σ−1 X̃| − (y − Xβ̂(θ))0 Σ−1 (y − Xβ̂(θ)), 2 2 2 where β̂(θ) = (X0 Σ−1 X)− X0 Σ−1 y. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 36 / 58 We have previously shown: 1. ∃ an n × (n − r) matrix Q 3 QQ0 = I − PX , Q0 X = 0, and Q0 Q = I, n×m and 2. Any n × (n − r) matrix A of rank n − r satisfying A0 X = 0 leads to the same REML estimator θ̂. Thus, without loss of generality, we may assume the matrix A in the REML Theorem is Q. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 37 / 58 We will now prove 3 lemmas and then use those lemmas to prove the REML Theorem. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 38 / 58 Lemma R1: Let G = Σ−1 X̃(X̃0 Σ−1 X̃)−1 and suppose A is and n × (n − r) matrix satisfying A0 A =m×m I and AA0 = I − PX . Then |[A, G]|2 = |X̃0 X̃|−1 . Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 39 / 58 Proof of Lemma R1: |[A, G]|2 = |[A, G]||[A, G]| = |[A, G]0 ||[A, G]| = A0 A A0 G G0 A G0 G = I A0 G G0 A G0 G = |I||G0 G − G0 AA0 G| (by our result on the determinant of a partitioned matrix) Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 40 / 58 = |G0 G − G0 (I − PX )G| = |G0 PX G| = |G0 PX̃ G| = |(X̃0 ΣX̃)−1 X̃0 Σ−1 X̃(X̃0 X̃)−1 X̃0 Σ−1 X̃(X̃0 Σ−1 X̃)−1 | = |[(X̃0 ΣX̃)−1 (X̃0 Σ−1 X̃)](X̃0 X̃)−1 [(X̃0 Σ−1 X̃)(X̃0 Σ−1 X̃)−1 ]| = |(X̃0 X̃)−1 | = |X̃0 X̃|−1 . Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 41 / 58 Lemma R2: |A0 ΣA| = |X̃0 Σ−1 X̃||Σ||X̃0 X̃|−1 , where A, X̃, and Σ are as previously defined. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 42 / 58 Proof of Lemma R2: Again define G = Σ−1 X̃(X̃0 Σ−1 X̃)−1 . Show that (1) A0 ΣG = 0, and (2) G0 ΣG = (X̃0 Σ−1 X̃)−1 . Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 43 / 58 A0 ΣG = A0 ΣΣ−1 X̃(X̃0 Σ−1 X̃)−1 = A0 X̃(X̃0 Σ−1 X̃)−1 = 0 ∵ A0 X = 0 =⇒ A0 X̃ = 0. (1) G0 ΣG = (X̃0 Σ−1 X̃)−1 X̃0 Σ−1 ΣΣ−1 X̃(X̃0 Σ−1 X̃)−1 = (X̃0 Σ−1 X̃)−1 X̃0 Σ−1 X̃(X̃0 Σ−1 X̃)−1 = (X̃0 Σ−1 X̃)−1 . Copyright c 2012 Dan Nettleton (Iowa State University) (2) Statistics 611 44 / 58 Next show that |[A, G]|2 |Σ| = |A0 ΣA||X̃0 Σ−1 X̃|−1 . Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 45 / 58 |[A, G]|2 |Σ| = |[A, G]0 ||Σ||[A, G]| " # A0 Σ[A, G] = G0 = = A0 ΣA A0 ΣG G0 ΣA G0 ΣG A0 ΣA 0 0 (X̃0 Σ−1 X̃)−1 by (1) and (2) = |A0 ΣA||X̃0 Σ−1 X̃|−1 Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 46 / 58 Now use Lemma R1 to complete the proof of Lemma R2. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 47 / 58 We have |[A, G]|2 |Σ| = |A0 ΣA||X̃0 Σ−1 X̃|−1 =⇒ |A0 ΣA| = |X̃0 Σ−1 X̃||Σ||[A, G]|2 Copyright c 2012 Dan Nettleton (Iowa State University) = |X̃0 Σ−1 X̃||Σ||X̃0 X̃|−1 by Lemma R1. Statistics 611 48 / 58 Lemma R3: A(A0 ΣA)−1 A0 = Σ−1 − Σ−1 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1 , where A, Σ, and X̃ are as defined previously. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 49 / 58 Proof of Lemma R3: [A, G]−1 Σ−1 ([A, G]0 )−1 = ([A, G]0 Σ[A, G])−1 " #−1 " #−1 A0 ΣA A0 ΣG A0 ΣA 0 = = G0 ΣA G0 ΣG 0 G0 ΣG " #−1 A0 ΣA 0 = 0 (X̃0 Σ−1 X̃)−1 " # (A0 ΣA)−1 0 = . 0 X̃0 Σ−1 X̃ Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 50 / 58 Now multiplying on the left by [A, G] and on the right by [A, G]0 yields " −1 Σ = [A, G] (A0 ΣA)−1 0 0 X̃0 Σ−1 X̃ #" # A0 G0 = A(A0 ΣA)−1 A0 + GX̃0 Σ−1 X̃G0 . Copyright c 2012 Dan Nettleton (Iowa State University) (3) Statistics 611 51 / 58 Now GX̃0 Σ−1 X̃G0 = Σ−1 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1 = Σ−1 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1 . (4) Combining (3) and (4) yields A(A0 ΣA)−1 A0 = Σ−1 − Σ−1 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1 . Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 52 / 58 Now use Lemmas R2 and R3 to prove the REML Theorem. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 53 / 58 1 1 n−r log(2π) − log |A0 ΣA| − w0 (A0 ΣA)−1 w 2 2 2 1 = constant1 − log(|X̃0 Σ−1 X̃||Σ||X̃0 X̃|−1 ) 2 1 0 − y A(A0 ΣA)−1 A0 y 2 1 1 = constant2 − log |Σ| − log |X̃0 Σ−1 X̃| 2 2 1 0 −1 − y (Σ − Σ−1 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1 )y 2 lw (θ|w) = − Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 54 / 58 = constant2 − 1 1 log |Σ| − log |X̃0 Σ−1 X̃| 2 2 1 − y0 Σ−1/2 (I − Σ−1/2 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1/2 )Σ−1/2 y 2 1 1 = constant2 − log |Σ| − log |X̃0 Σ−1 X̃| 2 2 1 0 −1/2 − yΣ (I − PΣ−1/2 X̃ )Σ−1/2 y. 2 Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 55 / 58 Now C(Σ−1/2 X̃) = C(Σ−1/2 X) ∴ PΣ−1/2 X̃ = PΣ−1/2 X ∴ y0 Σ−1/2 (I − PΣ−1/2 X̃ )Σ−1/2 y = y0 Σ−1/2 (I − PΣ−1/2 X )Σ−1/2 y = k(I − PΣ−1/2 X )Σ−1/2 yk2 Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 56 / 58 = kΣ−1/2 y − Σ−1/2 X(X0 Σ−1 X)− X0 Σ−1 yk2 = kΣ−1/2 y − Σ−1/2 Xβ̂(θ)k2 = kΣ−1/2 (y − Xβ̂(θ))k2 = (y − Xβ̂(θ))0 Σ−1/2 Σ−1/2 (y − Xβ̂(θ)) = (y − Xβ̂(θ))0 Σ−1 (y − Xβ̂(θ)). Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 57 / 58 ∴ we have lw (θ|w) = constant2 − 1 log |Σ| 2 1 − log |X̃0 Σ−1 X̃| 2 1 − (y − Xβ̂(θ))0 Σ−1 (y − Xβ̂(θ)) 2 = constant2 + g(θ). ∴ θ̂ maximizes lw (θ|w) over θ ∈ Θ ⇐⇒ θ̂ maximizes g(θ) over θ ∈ Θ. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics 611 58 / 58