Document 10639955

advertisement
ML and REML Variance Component
Estimation
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
1 / 58
Suppose
y = Xβ + ε,
where ε ∼ N(0, Σ) for some positive definite, symmetric matrix Σ.
Furthermore, suppose each element of Σ is a known function of an
unknown vector of parameters θ ∈ Θ.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
2 / 58
For example, consider
Σ=
m
X
σj2 Zj Z0j + σe2 I,
j=1
where Z1 , . . . , Zm are known matrices,
θ = [θ1 , . . . , θm , θm+1 ]0 = [σ12 , . . . , σm2 , σe2 ]0 ,
and Θ = {θ : θj > 0, j = 1, . . . , m + 1}.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
3 / 58
The likelihood function is
1
0
L(β, θ|y) = (2π)−n/2 |Σ|−1/2 e− 2 (y−Xβ) Σ
−1
(y−Xβ)
and the log likelihood function is
1
1
n
l(β, θ|y) = − log(2π) − log |Σ| − (y − Xβ)0 Σ−1 (y − Xβ).
2
2
2
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
4 / 58
Based on previous results,
(y − Xβ)0 Σ−1 (y − Xβ)
is minimized over β ∈ Rp , for any fixed θ, by
β̂(θ) ≡ (X0 Σ−1 X)− X0 Σ−1 y.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
5 / 58
Thus, the profile log likelihood for θ is
l∗ (θ|y) =
sup l(β, θ|y)
β∈Rp
1
1
n
= − log(2π) − log |Σ| − (y − Xβ̂(θ))0 Σ−1 (y − Xβ̂(θ)).
2
2
2
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
6 / 58
In the general case, numerical methods are used to find the maximizer
of l∗ (θ|y) over θ ∈ Θ.
Let
θ̂ MLE = arg max{l∗ (θ|y) : θ ∈ Θ}.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
7 / 58
Let Σ̂MLE be the matrix Σ with θ̂ MLE in place of θ.
The MLE of an estimable Cβ is then given by
− 0 −1
Cβ̂ MLE = C(X0 Σ̂−1
MLE X) X Σ̂MLE y.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
8 / 58
We can write Σ as σ 2 V, where σ 2 > 0, V is PD and symmetric, σ 2 is
known function of θ, and each entry of V is a known function of θ, e.g.,
Σ =
m
X
σj2 Zj Z0j + σe2 I
j=1
=
Copyright c 2012 Dan Nettleton (Iowa State University)
σe2
X
m
σj2
Zj Z0j
σe2
j=1
+I
Statistics 611
9 / 58
2
and V̂ MLE denote σ 2 and V with θ̂ MLE in place of θ, then
If we let σ̂MLE
2
Σ̂MLE = σ̂MLE
V̂ MLE
and Σ̂−1
MLE =
1
2
σ̂MLE
V̂ −1
MLE .
It follows that
− 0 −1
Cβ̂ MLE = C(X0 Σ̂−1
MLE X) X Σ̂MLE y
−
1
−1
0 1
= C X 2 V̂ MLE X
X0 2 V̂ −1
y
σ̂MLE
σ̂MLE MLE
− 0 −1
= C(X0 V̂ −1
MLE X) X V̂ MLE y.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
10 / 58
Note that
− 0 −1
Cβ̂ MLE = C(X0 V̂ −1
MLE X) X V̂ MLE y
is the GLS estimator under the Aitken model with V̂ MLE in place of V.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
11 / 58
We have seen by example that the MLE of the variance
component vector θ can be biased.
For example, for the case of Σ = σ 2 I, where θ = [σ 2 ], the MLE of
σ 2 is
(y − Xβ̂)0 (y − Xβ̂)
n−r 2
with expectation
σ .
n
n
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
12 / 58
This MLE for σ 2 is often criticized for “failing to account for the loss of
degrees of freedom needed to estimate β.”
#
"
n−r 2
(y − Xβ̂)0 (y − Xβ̂)
=
σ
E
n
n
< σ 2
(y − Xβ)0 (y − Xβ)
= E
n
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
13 / 58
A familiar special case: Suppose
i.i.d.
y1 , . . . , yn ∼ N(µ, σ 2 ).
Then
Pn
E
i=1 (yi
n
Pn
E
− µ)2
i=1 (yi
Copyright c 2012 Dan Nettleton (Iowa State University)
n
− ȳ)2
= σ 2 , but
=
n−1 2
σ .
n
Statistics 611
14 / 58
REML is an approach that produces unbiased estimators for these
special cases and produces less biased estimators than ML
estimators in general.
Depending on whom you ask, REML stands for REsidual
Maximum Likelihood or REstricted Maximum Likelihood.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
15 / 58
The REML method:
Find n − rank(X) = n − r linearly independent vectors a1 , . . . , an−r
such that a0i X = 00 for all i = 1, . . . , n − r.
Find the maximum likelihood estimate of θ using
w1 ≡ a01 y, . . . , wn−r ≡ a0n−r y as data.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
16 / 58

A = [a1 , . . . , an−r ]
Copyright c 2012 Dan Nettleton (Iowa State University)
 

w1
a01 y
 .   . 
0
.   . 
w=
 .  =  .  = A y.
wn−r
a0n−r y
Statistics 611
17 / 58
If a0 X = 00 , then a0 y is known as an error contrast.
Thus, w1 , . . . , wn−r comprise a set of n − r error contrasts.
Because (I − PX )X = X − PX X = X − X = 0, the elements of
(I − PX )y = y − PX y = y − ŷ are each error contrasts.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
18 / 58
Because rank(I − PX ) = n − rank(X) = n − r, there exists a set of
n − r linearly independent rows of I − PX that can be used in step
1 of the REML method to get a1 , . . . , an−r .
If we do use a subset of rows of I − PX to get a1 , . . . , an−r , the error
contrasts w1 = a01 y, . . . , wn−r = a0n−r y will be a subset of the
elements of (I − PX )y = y − ŷ, the residual vector.
This is why it makes sense to call the procedure Residual
Maximum Likelihood.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
19 / 58
Note that
w = A0 y
= A0 (Xβ + ε)
= A0 Xβ + A0 ε
= 0 + A0 ε
= A0 ε.
d
Thus, w = A0 ε ∼ N(A0 0, A0 ΣA) = N(0, A0 ΣA), and the distribution
of w depends on θ but not β.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
20 / 58
The log likelihood function in this case is
lw (θ|w) = −
n−r
1
1
log(2π) − log |A0 ΣA| − w0 (A0 ΣA)−1 w.
2
2
2
An MLE for θ, say θ̂, can be found in the general case using numerical
methods to obtain the REML estimate of θ.
Let
θ̂ = arg max{lw (θ|y) : θ ∈ Θ}.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
21 / 58
Once a REML estimate of θ (and thus Σ) has been obtained, the
BLUE of an estimable Cβ if Σ were unknown can be approximated by
Cβ̂ REML = C(X0 Σ̂−1 X)− X0 Σ̂−1 y,
where Σ̂ is Σ with θ̂ (the REML estimate of θ) in place of θ.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
22 / 58
Suppose A and B are each n × (n − r) matrices satisfying
rank(A) = rank(B) = n − r
and A0 X = B0 X = 0.
Let
w = A0 y and
v = B0 y.
Prove that
θ̂ maximizes lw (θ|w) over θ ∈ Θ
iff
θ̂ maximizes lv (θ|v) over θ ∈ Θ.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
23 / 58
Proof:
A0 X = 0 =⇒ X0 A = 0 =⇒ each column of A ∈ N (X0 ).
dim(N (X0 )) = n − rank(X) = n − r.
∴ the n − r LI columns of A form a basis for N (X0 ).
B0 X = 0 =⇒ X0 B = 0 =⇒ columns of B ∈ N (X0 ).
∴ ∃ an (n − r) × (n − r) matrix M 3 AM = B.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
24 / 58
Note that M, an (n − r) × (n − r) matrix, is nonsingular ∵
n − r = rank(B) = rank(AM)
≤ rank(M) ≤ n − r
rank(M) = n − r.
=⇒
∴ v0 (B0 ΣB)−1 v = y0 B(M0 A0 ΣAM)−1 B0 y
= y0 AMM−1 (A0 ΣA)−1 (M0 )−1 M0 A0 y
= y0 A(A0 ΣA)−1 A0 y = w0 (A0 ΣA)−1 w.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
25 / 58
Also
|B0 ΣB| = |M0 A0 ΣAM|
Copyright c 2012 Dan Nettleton (Iowa State University)
= |M0 ||A0 ΣA||M|
= |M|2 |A0 ΣA|.
Statistics 611
26 / 58
Now note that
n−r
1
1
log(2π) − log |B0 ΣB| − v0 (B0 ΣB)−1 v
2
2
2
n−r
1
1
2 0
= −
log(2π) − log(|M| |A ΣA|) − w0 (A0 ΣA)−1 w
2
2
2
n
−
r
1
1
log(2π) − log(|A0 ΣA|)
= − log(|M|2 ) −
2
2
2
1 0 0
−1
− w (A ΣA) w
2
= − log |M| + lw (θ|w).
lv (θ|v) = −
Because M is free of θ, the result follows.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
27 / 58
Consider the special case
y = Xβ + ε,
ε ∼ N(0, σ 2 I).
Prove that the REML estimator of σ 2 is
σ̂ 2 =
Copyright c 2012 Dan Nettleton (Iowa State University)
y0 (I − PX )y
.
n−r
Statistics 611
28 / 58
Proof:
I − PX is symmetric. Thus, by the Spectral Decomposition Theorem,
I − PX =
n
X
λj qj q0j
j=1
where λ1 , . . . , λn are eigenvalues of I − PX and q1 , . . . , qn are
orthonormal eigenvectors.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
29 / 58
Because I − PX is idempotent and of rank n − r, n − r of the λj values
equal 1 and the other r equal 0.
Define Q to be the matrix whose columns are the eigenvectors
corresponding to the n − r nonzero eigenvalues.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
30 / 58
Then Q is n × (n − r), QQ0 = I − PX , and Q0 Q = I.
Thus,
rank(Q) = n − r
and
QQ0 X = (I − PX )X = X − PX X = 0.
Multiplying on the left by Q0
=⇒ Q0 QQ0 X = Q0 0
=⇒ IQ0 X = 0
=⇒ Q0 X = 0.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
31 / 58
∴ the REML estimator of σ 2 can be obtained by finding the maximizer
of the likelihood based on the data
w ≡ Q0 y ∼ N(Q0 Xβ, Q0 σ 2 IQ)
d
= N(0, σ 2 I).
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
32 / 58
n−r
n−r
1
log(2π) −
log(σ 2 ) − w0 w/σ 2
2
2
2
n − r w0 w
= −
+ 4
2σ 2
2σ
w0 w
= 0 ⇐⇒ σ̂ 2 =
n−r
lw (σ 2 |w) = −
∂lw (σ 2 |w)
∂σ 2
2
∂lw (σ |w)
∂σ 2
σ 2 =σ̂ 2
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
33 / 58
∴ the MLE of σ 2 based on w and the REML estimator of σ 2 based on y
is
σ̂ 2 =
=
Copyright c 2012 Dan Nettleton (Iowa State University)
(Q0 y)0 (Q0 y)
w0 w
=
n−r
n−r
y0 QQ0 y
y0 (I − PX )y
=
.
n−r
n−r
Statistics 611
34 / 58
REML Theorem:
Suppose y = Xβ + ε, where ε ∼ N(0, Σ), Σ a PD, symmetric matrix
whose entries are known functions of an unknown parameter vector
θ ∈ Θ. Furthermore, suppose rank(X) = r and X̃ is any n × r matrix
consisting of any set of r LI columns of X.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
35 / 58
REML Theorem:
Let A be any n × (n − r) matrix satisfying
rank(A) = n − r
and
A0 X = 0.
Let
w = A0 y ∼ N(0, A0 ΣA).
Then θ̂ maximizes lw (θ|w) over θ ∈ Θ ⇐⇒ θ̂ maximizes, over θ ∈ Θ,
1
1
1
g(θ) = − log |Σ| − log |X̃0 Σ−1 X̃| − (y − Xβ̂(θ))0 Σ−1 (y − Xβ̂(θ)),
2
2
2
where β̂(θ) = (X0 Σ−1 X)− X0 Σ−1 y.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
36 / 58
We have previously shown:
1. ∃ an n × (n − r) matrix Q 3 QQ0 = I − PX , Q0 X = 0, and Q0 Q = I,
n×m
and
2. Any n × (n − r) matrix A of rank n − r satisfying A0 X = 0 leads to
the same REML estimator θ̂.
Thus, without loss of generality, we may assume the matrix A in the
REML Theorem is Q.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
37 / 58
We will now prove 3 lemmas and then use those lemmas to prove the
REML Theorem.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
38 / 58
Lemma R1:
Let G = Σ−1 X̃(X̃0 Σ−1 X̃)−1 and suppose A is and n × (n − r) matrix
satisfying
A0 A =m×m
I and
AA0 = I − PX .
Then
|[A, G]|2 = |X̃0 X̃|−1 .
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
39 / 58
Proof of Lemma R1:
|[A, G]|2 = |[A, G]||[A, G]|
= |[A, G]0 ||[A, G]|
=
A0 A A0 G
G0 A G0 G
=
I
A0 G
G0 A G0 G
= |I||G0 G − G0 AA0 G|
(by our result on the determinant of a partitioned matrix)
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
40 / 58
= |G0 G − G0 (I − PX )G|
= |G0 PX G| = |G0 PX̃ G|
= |(X̃0 ΣX̃)−1 X̃0 Σ−1 X̃(X̃0 X̃)−1 X̃0 Σ−1 X̃(X̃0 Σ−1 X̃)−1 |
= |[(X̃0 ΣX̃)−1 (X̃0 Σ−1 X̃)](X̃0 X̃)−1 [(X̃0 Σ−1 X̃)(X̃0 Σ−1 X̃)−1 ]|
= |(X̃0 X̃)−1 | = |X̃0 X̃|−1 .
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
41 / 58
Lemma R2:
|A0 ΣA| = |X̃0 Σ−1 X̃||Σ||X̃0 X̃|−1 ,
where A, X̃, and Σ are as previously defined.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
42 / 58
Proof of Lemma R2:
Again define
G = Σ−1 X̃(X̃0 Σ−1 X̃)−1 .
Show that
(1) A0 ΣG = 0, and
(2) G0 ΣG = (X̃0 Σ−1 X̃)−1 .
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
43 / 58
A0 ΣG = A0 ΣΣ−1 X̃(X̃0 Σ−1 X̃)−1
= A0 X̃(X̃0 Σ−1 X̃)−1
= 0 ∵ A0 X = 0 =⇒ A0 X̃ = 0.
(1)
G0 ΣG = (X̃0 Σ−1 X̃)−1 X̃0 Σ−1 ΣΣ−1 X̃(X̃0 Σ−1 X̃)−1
= (X̃0 Σ−1 X̃)−1 X̃0 Σ−1 X̃(X̃0 Σ−1 X̃)−1
= (X̃0 Σ−1 X̃)−1 .
Copyright c 2012 Dan Nettleton (Iowa State University)
(2)
Statistics 611
44 / 58
Next show that
|[A, G]|2 |Σ| = |A0 ΣA||X̃0 Σ−1 X̃|−1 .
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
45 / 58
|[A, G]|2 |Σ| = |[A, G]0 ||Σ||[A, G]|
" #
A0
Σ[A, G]
=
G0
=
=
A0 ΣA A0 ΣG
G0 ΣA G0 ΣG
A0 ΣA
0
0
(X̃0 Σ−1 X̃)−1
by (1) and (2)
= |A0 ΣA||X̃0 Σ−1 X̃|−1
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
46 / 58
Now use Lemma R1 to complete the proof of Lemma R2.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
47 / 58
We have
|[A, G]|2 |Σ| = |A0 ΣA||X̃0 Σ−1 X̃|−1
=⇒ |A0 ΣA| = |X̃0 Σ−1 X̃||Σ||[A, G]|2
Copyright c 2012 Dan Nettleton (Iowa State University)
= |X̃0 Σ−1 X̃||Σ||X̃0 X̃|−1
by Lemma R1.
Statistics 611
48 / 58
Lemma R3:
A(A0 ΣA)−1 A0 = Σ−1 − Σ−1 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1 ,
where A, Σ, and X̃ are as defined previously.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
49 / 58
Proof of Lemma R3:
[A, G]−1 Σ−1 ([A, G]0 )−1 = ([A, G]0 Σ[A, G])−1
"
#−1 "
#−1
A0 ΣA A0 ΣG
A0 ΣA
0
=
=
G0 ΣA G0 ΣG
0
G0 ΣG
"
#−1
A0 ΣA
0
=
0
(X̃0 Σ−1 X̃)−1
"
#
(A0 ΣA)−1
0
=
.
0
X̃0 Σ−1 X̃
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
50 / 58
Now multiplying on the left by [A, G] and on the right by [A, G]0 yields
"
−1
Σ
= [A, G]
(A0 ΣA)−1
0
0
X̃0 Σ−1 X̃
#" #
A0
G0
= A(A0 ΣA)−1 A0 + GX̃0 Σ−1 X̃G0 .
Copyright c 2012 Dan Nettleton (Iowa State University)
(3)
Statistics 611
51 / 58
Now
GX̃0 Σ−1 X̃G0 = Σ−1 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1
= Σ−1 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1 .
(4)
Combining (3) and (4) yields
A(A0 ΣA)−1 A0 = Σ−1 − Σ−1 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1 .
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
52 / 58
Now use Lemmas R2 and R3 to prove the REML Theorem.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
53 / 58
1
1
n−r
log(2π) − log |A0 ΣA| − w0 (A0 ΣA)−1 w
2
2
2
1
= constant1 − log(|X̃0 Σ−1 X̃||Σ||X̃0 X̃|−1 )
2
1 0
− y A(A0 ΣA)−1 A0 y
2
1
1
= constant2 − log |Σ| − log |X̃0 Σ−1 X̃|
2
2
1 0 −1
− y (Σ − Σ−1 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1 )y
2
lw (θ|w) = −
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
54 / 58
= constant2 −
1
1
log |Σ| − log |X̃0 Σ−1 X̃|
2
2
1
− y0 Σ−1/2 (I − Σ−1/2 X̃(X̃0 Σ−1 X̃)−1 X̃0 Σ−1/2 )Σ−1/2 y
2
1
1
= constant2 − log |Σ| − log |X̃0 Σ−1 X̃|
2
2
1 0 −1/2
− yΣ
(I − PΣ−1/2 X̃ )Σ−1/2 y.
2
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
55 / 58
Now
C(Σ−1/2 X̃) = C(Σ−1/2 X)
∴ PΣ−1/2 X̃ = PΣ−1/2 X
∴ y0 Σ−1/2 (I − PΣ−1/2 X̃ )Σ−1/2 y = y0 Σ−1/2 (I − PΣ−1/2 X )Σ−1/2 y
= k(I − PΣ−1/2 X )Σ−1/2 yk2
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
56 / 58
= kΣ−1/2 y − Σ−1/2 X(X0 Σ−1 X)− X0 Σ−1 yk2
= kΣ−1/2 y − Σ−1/2 Xβ̂(θ)k2
= kΣ−1/2 (y − Xβ̂(θ))k2
= (y − Xβ̂(θ))0 Σ−1/2 Σ−1/2 (y − Xβ̂(θ))
= (y − Xβ̂(θ))0 Σ−1 (y − Xβ̂(θ)).
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
57 / 58
∴ we have
lw (θ|w) = constant2 −
1
log |Σ|
2
1
− log |X̃0 Σ−1 X̃|
2
1
− (y − Xβ̂(θ))0 Σ−1 (y − Xβ̂(θ))
2
= constant2 + g(θ).
∴ θ̂ maximizes lw (θ|w) over θ ∈ Θ ⇐⇒ θ̂ maximizes g(θ) over θ ∈ Θ.
Copyright c 2012 Dan Nettleton (Iowa State University)
Statistics 611
58 / 58
Download