Document 10639956

advertisement
ML and REML Variance Component
Estimation
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
1 / 58
Suppose
y = Xβ + ε,
where ε ∼ N(0, Σ) for some positive definite, symmetric matrix Σ.
Furthermore, suppose each element of Σ is a known function of an
unknown vector of parameters θ ∈ Θ.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
2 / 58
For example, consider
Σ=
m
X
σj2 Zj Z0j + σe2 I,
j=1
where Z1 , . . . , Zm are known matrices,
θ = [θ1 , . . . , θm , θm+1 ]0 = [σ12 , . . . , σm2 , σe2 ]0 ,
and Θ = {θ : θj > 0, j = 1, . . . , m + 1}.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
3 / 58
The likelihood function is
1
0
L(β, θ|y) = (2π)−n/2 |Σ|−1/2 e− 2 (y−Xβ) Σ
−1
(y−Xβ)
and the log likelihood function is
1
1
n
l(β, θ|y) = − log(2π) − log |Σ| − (y − Xβ)0 Σ−1 (y − Xβ).
2
2
2
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
4 / 58
Based on previous results,
(y − Xβ)0 Σ−1 (y − Xβ)
is minimized over β ∈ Rp , for any fixed θ, by
β̂(θ) ≡ (X0 Σ−1 X)− X0 Σ−1 y.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
5 / 58
Thus, the profile log likelihood for θ is
l∗ (θ|y) =
sup l(β, θ|y)
β∈Rp
1
1
n
= − log(2π) − log |Σ| − (y − Xβ̂(θ))0 Σ−1 (y − Xβ̂(θ)).
2
2
2
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
6 / 58
In the general case, numerical methods are used to find the maximizer
of l∗ (θ|y) over θ ∈ Θ.
Let
θ̂ MLE = arg max{l∗ (θ|y) : θ ∈ Θ}.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
7 / 58
Let Σ̂MLE be the matrix Σ with θ̂ MLE in place of θ.
The MLE of an estimable Cβ is then given by
− 0 −1
Cβ̂ MLE = C(X0 Σ̂−1
MLE X) X Σ̂MLE y.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
8 / 58
We can write Σ as σ 2 V, where σ 2 > 0, V is PD and symmetric, σ 2 is
known function of θ, and each entry of V is a known function of θ, e.g.,
Σ =
m
X
σj2 Zj Z0j + σe2 I
j=1
=
c
Copyright 2012
Dan Nettleton (Iowa State University)
σe2
X
m
σj2
Zj Z0j
σe2
j=1
+I
Statistics 611
9 / 58
2
and V̂ MLE denote σ 2 and V with θ̂ MLE in place of θ, then
If we let σ̂MLE
2
Σ̂MLE = σ̂MLE
V̂ MLE
and Σ̂−1
MLE =
1
2
σ̂MLE
V̂ −1
MLE .
It follows that
− 0 −1
Cβ̂ MLE = C(X0 Σ̂−1
MLE X) X Σ̂MLE y
−
1
−1
0 1
= C X 2 V̂ MLE X
X0 2 V̂ −1
y
σ̂MLE
σ̂MLE MLE
− 0 −1
= C(X0 V̂ −1
MLE X) X V̂ MLE y.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
10 / 58
Note that
− 0 −1
Cβ̂ MLE = C(X0 V̂ −1
MLE X) X V̂ MLE y
is the GLS estimator under the Aitken model with V̂ MLE in place of V.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
11 / 58
We have seen by example that the MLE of the variance
component vector θ can be biased.
For example, for the case of Σ = σ 2 I, where θ = [σ 2 ], the MLE of
σ 2 is
(y − Xβ̂)0 (y − Xβ̂)
n−r 2
with expectation
σ .
n
n
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
12 / 58
This MLE for σ 2 is often criticized for “failing to account for the loss of
degrees of freedom needed to estimate β.”
#
"
n−r 2
(y − Xβ̂)0 (y − Xβ̂)
=
σ
E
n
n
< σ 2
(y − Xβ)0 (y − Xβ)
= E
n
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
13 / 58
A familiar special case: Suppose
i.i.d.
y1 , . . . , yn ∼ N(µ, σ 2 ).
Then
Pn
E
i=1 (yi
n
Pn
E
− µ)2
i=1 (yi
c
Copyright 2012
Dan Nettleton (Iowa State University)
n
− ȳ)2
= σ 2 , but
=
n−1 2
σ .
n
Statistics 611
14 / 58
REML is an approach that produces unbiased estimators for these
special cases and produces less biased estimators than ML
estimators in general.
Depending on whom you ask, REML stands for REsidual
Maximum Likelihood or REstricted Maximum Likelihood.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
15 / 58
The REML method:
Find n − rank(X) = n − r linearly independent vectors a1 , . . . , an−r
such that a0i X = 00 for all i = 1, . . . , n − r.
Find the maximum likelihood estimate of θ using
w1 ≡ a01 y, . . . , wn−r ≡ a0n−r y as data.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
16 / 58

A = [a1 , . . . , an−r ]
c
Copyright 2012
Dan Nettleton (Iowa State University)
 

w1
a01 y
 .   . 
0
.   . 
w=
 .  =  .  = A y.
wn−r
a0n−r y
Statistics 611
17 / 58
If a0 X = 00 , then a0 y is known as an error contrast.
Thus, w1 , . . . , wn−r comprise a set of n − r error contrasts.
Because (I − PX )X = X − PX X = X − X = 0, the elements of
(I − PX )y = y − PX y = y − ŷ are each error contrasts.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
18 / 58
Because rank(I − PX ) = n − rank(X) = n − r, there exists a set of
n − r linearly independent rows of I − PX that can be used in step
1 of the REML method to get a1 , . . . , an−r .
If we do use a subset of rows of I − PX to get a1 , . . . , an−r , the error
contrasts w1 = a01 y, . . . , wn−r = a0n−r y will be a subset of the
elements of (I − PX )y = y − ŷ, the residual vector.
This is why it makes sense to call the procedure Residual
Maximum Likelihood.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
19 / 58
Note that
w = A0 y
= A0 (Xβ + ε)
= A0 Xβ + A0 ε
= 0 + A0 ε
= A0 ε.
d
Thus, w = A0 ε ∼ N(A0 0, A0 ΣA) = N(0, A0 ΣA), and the distribution
of w depends on θ but not β.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
20 / 58
The log likelihood function in this case is
lw (θ|w) = −
n−r
1
1
log(2π) − log |A0 ΣA| − w0 (A0 ΣA)−1 w.
2
2
2
An MLE for θ, say θ̂, can be found in the general case using numerical
methods to obtain the REML estimate of θ.
Let
θ̂ = arg max{lw (θ|y) : θ ∈ Θ}.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
21 / 58
Once a REML estimate of θ (and thus Σ) has been obtained, the
BLUE of an estimable Cβ if Σ were unknown can be approximated by
Cβ̂ REML = C(X0 Σ̂−1 X)− X0 Σ̂−1 y,
where Σ̂ is Σ with θ̂ (the REML estimate of θ) in place of θ.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
22 / 58
Suppose A and B are each n × (n − r) matrices satisfying
rank(A) = rank(B) = n − r
and A0 X = B0 X = 0.
Let
w = A0 y and
v = B0 y.
Prove that
θ̂ maximizes lw (θ|w) over θ ∈ Θ
iff
θ̂ maximizes lv (θ|v) over θ ∈ Θ.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
23 / 58
Proof:
A0 X = 0 =⇒ X0 A = 0 =⇒ each column of A ∈ N (X0 ).
dim(N (X0 )) = n − rank(X) = n − r.
∴ the n − r LI columns of A form a basis for N (X0 ).
B0 X = 0 =⇒ X0 B = 0 =⇒ columns of B ∈ N (X0 ).
∴ ∃ an (n − r) × (n − r) matrix M 3 AM = B.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
24 / 58
Note that M, an (n − r) × (n − r) matrix, is nonsingular ∵
n − r = rank(B) = rank(AM)
≤ rank(M) ≤ n − r
rank(M) = n − r.
=⇒
∴ v0 (B0 ΣB)−1 v = y0 B(M0 A0 ΣAM)−1 B0 y
= y0 AMM−1 (A0 ΣA)−1 (M0 )−1 M0 A0 y
= y0 A(A0 ΣA)−1 A0 y = w0 (A0 ΣA)−1 w.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
25 / 58
Also
|B0 ΣB| = |M0 A0 ΣAM|
c
Copyright 2012
Dan Nettleton (Iowa State University)
= |M0 ||A0 ΣA||M|
= |M|2 |A0 ΣA|.
Statistics 611
26 / 58
Now note that
n−r
1
1
log(2π) − log |B0 ΣB| − v0 (B0 ΣB)−1 v
2
2
2
n−r
1
1
2 0
= −
log(2π) − log(|M| |A ΣA|) − w0 (A0 ΣA)−1 w
2
2
2
n
−
r
1
1
log(2π) − log(|A0 ΣA|)
= − log(|M|2 ) −
2
2
2
1 0 0
−1
− w (A ΣA) w
2
= − log |M| + lw (θ|w).
lv (θ|v) = −
Because M is free of θ, the result follows.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
27 / 58
Consider the special case
y = Xβ + ε,
ε ∼ N(0, σ 2 I).
Prove that the REML estimator of σ 2 is
σ̂ 2 =
c
Copyright 2012
Dan Nettleton (Iowa State University)
y0 (I − PX )y
.
n−r
Statistics 611
28 / 58
REML Theorem:
Suppose y = Xβ + ε, where ε ∼ N(0, Σ), Σ a PD, symmetric matrix
whose entries are known functions of an unknown parameter vector
θ ∈ Θ. Furthermore, suppose rank(X) = r and X̃ is any n × r matrix
consisting of any set of r LI columns of X.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
35 / 58
REML Theorem:
Let A be any n × (n − r) matrix satisfying
rank(A) = n − r
and
A0 X = 0.
Let
w = A0 y ∼ N(0, A0 ΣA).
Then θ̂ maximizes lw (θ|w) over θ ∈ Θ ⇐⇒ θ̂ maximizes, over θ ∈ Θ,
1
1
1
g(θ) = − log |Σ| − log |X̃0 Σ−1 X̃| − (y − Xβ̂(θ))0 Σ−1 (y − Xβ̂(θ)),
2
2
2
where β̂(θ) = (X0 Σ−1 X)− X0 Σ−1 y.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
36 / 58
We have previously shown:
1. ∃ an n × (n − r) matrix Q 3 QQ0 = I − PX , Q0 X = 0, and Q0 Q = I,
n×m
and
2. Any n × (n − r) matrix A of rank n − r satisfying A0 X = 0 leads to
the same REML estimator θ̂.
Thus, without loss of generality, we may assume the matrix A in the
REML Theorem is Q.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
37 / 58
We will now prove 3 lemmas and then use those lemmas to prove the
REML Theorem.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
38 / 58
Lemma R1:
Let G = Σ−1 X̃(X̃0 Σ−1 X̃)−1 and suppose A is and n × (n − r) matrix
satisfying
A0 A =m×m
I and
AA0 = I − PX .
Then
|[A, G]|2 = |X̃0 X̃|−1 .
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
39 / 58
Proof of Lemma R1:
|[A, G]|2 = |[A, G]||[A, G]|
= |[A, G]0 ||[A, G]|
A0 A A0 G I
0 A
G
= 0
=
G A G0 G G0 A G0 G
= |I||G0 G − G0 AA0 G|
(by our result on the determinant of a partitioned matrix)
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
40 / 58
= |G0 G − G0 (I − PX )G|
= |G0 PX G| = |G0 PX̃ G|
= |(X̃0 ΣX̃)−1 X̃0 Σ−1 X̃(X̃0 X̃)−1 X̃0 Σ−1 X̃(X̃0 Σ−1 X̃)−1 |
= |[(X̃0 ΣX̃)−1 (X̃0 Σ−1 X̃)](X̃0 X̃)−1 [(X̃0 Σ−1 X̃)(X̃0 Σ−1 X̃)−1 ]|
= |(X̃0 X̃)−1 | = |X̃0 X̃|−1 .
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
41 / 58
Lemma R2:
|A0 ΣA| = |X̃0 Σ−1 X̃||Σ||X̃0 X̃|−1 ,
where A, X̃, and Σ are as previously defined.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
42 / 58
Proof of Lemma R2:
Again define
G = Σ−1 X̃(X̃0 Σ−1 X̃)−1 .
Show that
(1) A0 ΣG = 0, and
(2) G0 ΣG = (X̃0 Σ−1 X̃)−1 .
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
43 / 58
Next show that
|[A, G]|2 |Σ| = |A0 ΣA||X̃0 Σ−1 X̃|−1 .
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
45 / 58
Now use Lemma R1 to complete the proof of Lemma R2.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
47 / 58
Lemma R3:
A(A0 ΣA)−1 A0 = Σ−1 − Σ−1 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1 ,
where A, Σ, and X̃ are as defined previously.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
49 / 58
Proof of Lemma R3:
[A, G]−1 Σ−1 ([A, G]0 )−1 = ([A, G]0 Σ[A, G])−1
"
#−1 "
#−1
A0 ΣA A0 ΣG
A0 ΣA
0
=
=
G0 ΣA G0 ΣG
0
G0 ΣG
"
#−1
A0 ΣA
0
=
0
(X̃0 Σ−1 X̃)−1
"
#
(A0 ΣA)−1
0
=
.
0
X̃0 Σ−1 X̃
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
50 / 58
Now multiplying on the left by [A, G] and on the right by [A, G]0 yields
"
−1
Σ
= [A, G]
(A0 ΣA)−1
0
0
X̃0 Σ−1 X̃
#" #
A0
G0
= A(A0 ΣA)−1 A0 + GX̃0 Σ−1 X̃G0 .
c
Copyright 2012
Dan Nettleton (Iowa State University)
(3)
Statistics 611
51 / 58
Now
GX̃0 Σ−1 X̃G0 = Σ−1 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1
= Σ−1 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1 .
(4)
Combining (3) and (4) yields
A(A0 ΣA)−1 A0 = Σ−1 − Σ−1 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1 .
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
52 / 58
Now use Lemmas R2 and R3 to prove the REML Theorem.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
53 / 58
1
1
n−r
log(2π) − log |A0 ΣA| − w0 (A0 ΣA)−1 w
2
2
2
1
= constant1 − log(|X̃0 Σ−1 X̃||Σ||X̃0 X̃|−1 )
2
1 0
− y A(A0 ΣA)−1 A0 y
2
1
1
= constant2 − log |Σ| − log |X̃0 Σ−1 X̃|
2
2
1 0 −1
− y (Σ − Σ−1 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1 )y
2
lw (θ|w) = −
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
54 / 58
= constant2 −
1
1
log |Σ| − log |X̃0 Σ−1 X̃|
2
2
1
− y0 Σ−1/2 (I − Σ−1/2 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1/2 )Σ−1/2 y
2
1
1
= constant2 − log |Σ| − log |X̃0 Σ−1 X̃|
2
2
1 0 −1/2
− yΣ
(I − PΣ−1/2 X̃ )Σ−1/2 y.
2
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
55 / 58
Now
C(Σ−1/2 X̃) = C(Σ−1/2 X)
∴ PΣ−1/2 X̃ = PΣ−1/2 X
∴ y0 Σ−1/2 (I − PΣ−1/2 X̃ )Σ−1/2 y = y0 Σ−1/2 (I − PΣ−1/2 X )Σ−1/2 y
= k(I − PΣ−1/2 X )Σ−1/2 yk2
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
56 / 58
= kΣ−1/2 y − Σ−1/2 X(X0 Σ−1 X)− X0 Σ−1 yk2
= kΣ−1/2 y − Σ−1/2 Xβ̂(θ)k2
= kΣ−1/2 (y − Xβ̂(θ))k2
= (y − Xβ̂(θ))0 Σ−1/2 Σ−1/2 (y − Xβ̂(θ))
= (y − Xβ̂(θ))0 Σ−1 (y − Xβ̂(θ)).
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
57 / 58
∴ we have
lw (θ|w) = constant2 −
1
log |Σ|
2
1
− log |X̃0 Σ−1 X̃|
2
1
− (y − Xβ̂(θ))0 Σ−1 (y − Xβ̂(θ))
2
= constant2 + g(θ).
∴ θ̂ maximizes lw (θ|w) over θ ∈ Θ ⇐⇒ θ̂ maximizes g(θ) over θ ∈ Θ.
c
Copyright 2012
Dan Nettleton (Iowa State University)
Statistics 611
58 / 58
Download