An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data Jae-Kwang Kim

advertisement
An Efficient Estimation Method for
Longitudinal Surveys with Monotone Missing
Data
Jae-Kwang Kim 1
Iowa State University
June 28, 2012
1
Joint work with Dr. Ming Zhou (when he was a PhD student at ISU)
Reference
Zhou, M. and Kim, J.K. (2012). “An Efficient Method of Estimation
for Longitudinal Surveys with Monotone Missing Data”, Biometrika,
Accepted for publication.
2 / 47
Outline
1
Basic Setup
2
Propensity Score Method
3
GLS Approach with Propensity Score Method
4
Application to Longitudinal Missing
5
Simulation Study
6
Conclusion
3 / 47
Basic Setup
X, Y: random variables from some distribution
θ: parameter of interest, defined through
E {U(θ; X, Y)} = 0.
Examples
1
2
3
θ = E(Y): U(θ; X, Y) = Y − θ
θ = FY−1 (1/2): U(θ) = FY (θ) − 1/2
θ is regression coefficient: U(θ) = X 0 (Y − Xθ).
4 / 47
Basic Setup
Estimator of θ: Solve
Ûn (θ) ≡
n
X
U(θ; xi , yi ) = 0
i=1
to get θ̂n , where (xi , yi ) are IID realizations of (X, Y).
Under some conditions, θ̂n converges in probability to θ and
follows from a normal distribution when n → ∞.
What if some of yi are missing ?
5 / 47
Basic Setup
x1 , · · · , xn are fully observed.
Some of yi are not observed.
Let
ri =
1
0
if yi is observed
if yi is missing.
Note that ri is also a random variable (whose probability
distribution is generally unknown).
6 / 47
Basic Setup
Complete-Case (CC) method
Solve
n
X
ri U(θ; xi , yi ) = 0
i=1
Biased unless Pr (r = 1 | X, Y) does not depend on (X, Y), i.e.
biased unless the set of the respondents is a simple random
sample from the original data.
7 / 47
Basic Setup
Weighted Complete-Case (WCC) method
Solve
ÛW (θ) ≡
n
X
ri wi U(θ; xi , yi ) = 0
i=1
for some weights wi . The weight is often called the propensity
scores (or propensity weights).
The choice of
wi =
1
Pr (ri = 1 | xi , yi )
will make the resulting estimator consistent.
Requires some assumption about Pr (ri = 1 | xi , yi ).
8 / 47
Basic Setup
Justification for using wi = 1/Pr (ri = 1 | xi , yi )
Note that
n
o
E ÛW (θ) | x1 , · · · , xn , y1 , · · · , yn = Ûn (θ)
where the expectation is taken with respect to r.
Thus, the probability limit of the solution to ÛW (θ) = 0 is equal
to the probability limit of the solution to Ûn (θ) = 0.
No distributional assumptions made about (X, Y).
9 / 47
Propensity score method
Idea
For simplicity, assume that Pr (ri = 1 | xi , yi ) takes a parametric
form.
Pr (ri = 1 | xi , yi ) = π(xi , yi ; φ∗ )
for some unknown φ∗ . The functional form of π(·) is known. For
example,
π(x, y; φ∗ ) =
exp(φ∗0 + φ∗1 x + φ∗2 y)
1 + exp(φ∗0 + φ∗1 x + φ∗2 y)
Propensity score approach to missing data: obtain θ̂PS which
solves
n
X
1
ÛPS (θ) ≡
ri
U(θ; xi , yi ) = 0
π(x, y; φ̂)
i=1
for some φ̂ which converges to φ∗ in probability.
10 / 47
Propensity score method
Issues
Identifiability: Model parameters may not be fully identifiable
from the observed sample.
May assume
Pr (ri = 1 | xi , yi ) = Pr (ri = 1 | xi ) .
This condition is often called MAR (Missing at random).
For longitudinal data with monotone missing pattern, the MAR
condition means
Pr (ri,t = 1 | xi , yi1 , · · · , yiT ) = Pr (ri,t = 1 | xi , yi1 , · · · , yi,t−1 ) .
That is, the response probability at time t may depend on the
value of y observed up to time t.
11 / 47
Propensity score method
Issues
Estimation of φ∗
Maximum likelihood method: Solve
S(φ) ≡
n
X
{ri − πi (φ)} qi (φ) = 0
i=1
where qi = ∂logit{πi (φ)}/∂φ.
Maximum likelihood method does not always lead to efficient
estimation (see Example 1 next).
Inference using θ̂PS : Note that θ̂PS = θ̂PS (φ̂). We need to
incorporate the sampling variability of φ̂ in making inference
about θ using θ̂PS .
12 / 47
Propensity score method
Example 1
Response model
πi (φ∗ ) =
exp(φ∗0 + φ∗1 xi )
1 + exp(φ∗0 + φ∗1 xi )
Parameter of interest: θ = E(Y).
PS estimator of θ: Solve jointly
U(θ, φ) =
n
X
ri {πi (φ)}−1 (yi − θ) = 0
i=1
S(φ) =
n
X
{ri − πi (φ)}(1, xi ) = (0, 0)
i=1
13 / 47
Propensity score method
Example 1 (Cont’d)
Taylor linearization
−1
∂U
∂S
∗
∼
θ̂PS (φ̂) = θ̂PS (φ ) − E
E
S(φ∗ )
∂φ
∂φ
= θ̂PS (φ∗ ) − {Cov (U, S)} {V(S)}−1 S(φ∗ ),
by the property of zero-mean function.
(i.e. If E(U) = 0, then E(∂U/∂φ) = −Cov(U, S).)
So, we have
V{θ̂PS (φ̂)} ∼
= V{θ̂PS (φ∗ ) | S⊥ } ≤ V{θ̂PS (φ∗ )},
where
V{θ̂ | S⊥ } = V(θ̂) − Cov(θ̂, S) {V(S)}−1 Cov(S, θ̂).
14 / 47
GLS approach with propensity score method
Motivation
The propensity score method is used to reduce the bias, rather
than to reduce the variance.
In the previous example, the PS estimator for θx = E(X) is
Pn
θ̂x,PS = Pi=1
n
ri π̂i−1 xi
−1
i=1 ri π̂i
where π̂i = πi (φ̂).
Note that θ̂x,PS is not necessarily equal to x̄n = n−1
Pn
i=1 xi .
How to incorporate the extra information of x̄n ?
15 / 47
GLS approach with propensity score method
GLS (or GMM) approach
Let θ = (θx , θy ). We have three estimators for two parameters.
Find θ that minimizes
0  
−1 

x̄n − θx
x̄n − θx
x̄n − θx


 θ̂x,PS − θx 
QPS (θ) =  θ̂x,PS − θx  V̂  θ̂x,PS − θx 


θ̂y,PS − θy
θ̂y,PS − θy
θ̂y,PS − θy

where θ̂PS = θ̂PS (φ̂).
Computation for V̂ is somewhat cumbersome.
16 / 47
GLS approach with propensity score method
Alternative GLS (or GMM) approach
Find (θ, φ) that minimizes
0  
x̄n − θx
x̄n − θx


 θ̂x,PS (φ) − θx    θ̂x,PS (φ) − θx



 θ̂y,PS (φ) − θy  V̂  θ̂y,PS (φ) − θy


S(φ)
S(φ)

−1 
x̄n − θx


  θ̂x,PS (φ) − θx


  θ̂y,PS (φ) − θy


S(φ)


.

Computation for V̂ is easier since we can treat φ as if known.
Let Q∗ (θ, φ) be the above objective function. It can be shown
that Q∗ (θ, φ̂) = QPS (θ) and so minimizing Q∗ (θ, φ̂) is equivalent
to minimizing QPS (θ).
17 / 47
GLS approach with propensity score method
Justification for the equivalence
May write
0 −1 V11 V12
ÛPS (θ, φ)
ÛPS (θ, φ)
Q (θ, φ) =
V21 V22
S(φ)
S(φ)
= Q1 (θ | φ) + Q2 (φ)
∗
where
0 −1 −1
−1
ÛPS − V12 V22
S
V UPS | S⊥
ÛPS − V12 V22
S
n
o−1
Q2 (φ) = S(φ)0 V̂ (S)
S(φ)
Q1 (θ | φ)
=
For the MLE φ̂, we have Q2 (φ̂) = 0 and Q1 (θ | φ̂) = QPS (θ).
18 / 47
GLS approach with propensity score method
Back to Example 1
Response model
πi (φ∗ ) =
exp(φ∗0 + φ∗1 xi )
1 + exp(φ∗0 + φ∗1 xi )
Three direct PS estimators of (1, θx , θy ):
(θ̂1,PS , θ̂x,PS , θ̂y,PS ) = n−1
n
X
ri π̂i−1 (1, xi , yi ) .
i=1
x̄n =
n−1
Pn
i=1 xi
available.
What is the optimal estimator of θy ?
19 / 47
GLS approach with propensity score method
Example 1 (Cont’d)
Minimize

x̄n − θx
 θ̂ (φ) − 1
 1,PS

 θ̂x,PS (φ) − θx

 θ̂y,PS (φ) − θy
S(φ)
0  

x̄n

 θ̂ (φ)

 

   1,PS


 V̂  θ̂x,PS (φ)

 

 θ̂y,PS (φ)
 



S(φ)
−1 


 

 













x̄n − θx
θ̂1,PS (φ) − 1
θ̂x,PS (φ) − θx
θ̂y,PS (φ) − θy
S(φ)







with respect to (θx , θy , φ), where
S(φ) =
n X
ri
− 1 hi (φ) = 0
πi (φ)
i=1
with hi (φ) = πi (φ)(1, xi )0 .
20 / 47
GLS approach with propensity score method
Example 1 (Cont’d)
Equivalently, minimize
0  
θ̂y,PS (φ) − θy
θ̂y,PS (φ)


 θ̂1,PS (φ) − 1    θ̂1,PS (φ)



 θ̂ (φ) − x̄  V̂  θ̂ (φ) − x̄
x,PS
n
x,PS
n


S(φ)
S(φ)

−1 
θ̂y,PS (φ) − θy


  θ̂1,PS (φ) − 1


  θ̂ (φ) − x̄
x,PS
n


S(φ)




with respect to (θy , φ), since the optimal estimator of θx is x̄n .
21 / 47
GLS approach with propensity score method
Example 1 (Cont’d)
The solution can be written as
n
o
θ̂y,opt = θ̂y,PS + 1 − θ̂1,PS B̂0 + x̄n − θ̂1,PS B̂1 + 0 − S(φ̂) Ĉ
where




0 −1
 
B̂0
n
n
1
1
1
 X
X
 B̂1  =
ri bi  xi  yi
ri bi  xi   xi 


i=1
i=1
hi
hi
hi
Ĉ

and bi = π̂i−2 (1 − π̂i ).
Note that the last term {0 − S(φ̂)}Ĉ, which is equal to zero, does
not contribute to the point estimation. But, it is used for variance
estimation.
22 / 47
GLS approach with propensity score method
Example 1 (Cont’d)
That is, for variance estimation, we simply express
−1
θ̂y,opt = n
n
X
η̂i
i=1
where
η̂ = B̂0 + xi B̂1 + h0i Ĉ +
ri yi − B̂0 − xi B̂1 − h0i Ĉ
π̂i
and apply the standard variance formula to η̂i .
This idea can be extended to the survey sampling setup.
23 / 47
GLS approach with propensity score method
Example 1 (Cont’d)
The optimal estimator is linear in y. That is, we can write
n
θ̂y,opt =
X
1 X ri
gi yi =
wi yi
n
π̂i
ri =1
i=1
where gi satisfies
n
n
X
X
ri
0
gi (1, xi , hi ) =
(1, xi , h0i ).
π̂i
i=1
i=1
This, doubly robust under Eζ (y | x) = β0 + β1 x in the sense that
θy,opt is consistent when either the response model or the
superpopulation model holds.
24 / 47
Application to longitudinal missing
Basic Setup
Xi is always observed and remains unchanged for t = 0, 1, . . . , T.
Yit is the response for subject i at time t.
rit : The response indicator for subject i at time t.
Assuming no missing in the baseline year, Y0 can be absorbed
into X.
Monotone missing pattern
rit = 0 ⇒ ri,t+1 = 0, ∀t = 1, . . . , T − 1.
Li,t = (Xi0 , Yi1 , . . . , Yi,t )0 : Measurement up to t.
Parameter of interest is
µt = E{Yit }.
25 / 47
Application to longitudinal missing
Missing mechanism
Missing completely at random (MCAR) :
P(rit=1 |ri,t−1 = 1, Li,T ) = P(rit=1 |ri,t−1 = 1).
Covariate-dependent missing (CDM) :
P(rit = 1|ri,t−1 = 1, Li,T ) = P(rit = 1|ri,t−1 = 1, Xi ).
Missing at random (MAR) :
P(rit = 1|ri,t−1 = 1, Li,T ) = P(rit = 1|ri,t−1 = 1, Li,t−1 ).
Missing not at random (MNAR) : Missing at random does not
hold.
26 / 47
Application to longitudinal missing
Motivation
Panel attrition is frequently encountered in panel surveys, while
classical methods often assume covariate-dependent missing,
which can be unrealistic. We want to develop a PS method under
MAR.
Want to make full use of available information.
27 / 47
Application to longitudinal missing
Modeling Propensity Score
Under MAR, in the longitudinal data case, we would consider
the conditional probabilities:
pit := P(rit = 1|ri,t−1 = 1, Li,t−1 ), t = 1, . . . , T.
Then
πit =
t
Y
pij .
j=1
πt then can be modeled through modeling pt with pt (Lt−1 ; φt ).
This kind of modeling is also adopted in Robins et al. (1995).
28 / 47
Application to longitudinal missing
Score Function for Longitudinal Data Under parametric models for
pt ’s, the partial likelihood for φ1 , . . . , φT is
L(φ1 , . . . , φT ) =
n Y
T
Y
ri,t
r
pit (1 − pit )1−ri,t i,t−1 ,
i=1 t=1
and the corresponding score function is (S1 (φ1 ), . . . , ST (φT )), where
St (φt ) =
n
X
ri,t−1 {rit − pit (φt )} qit (φt ) = 0
i=1
where qit (φt ) = ∂logit{pit (φt )}/∂φt . Under logistic regression model
such that pt = 1/{1 + exp(−φ0t Lt−1 )}, we have qit (φt ) = Lt−1 .
29 / 47
Application to longitudinal missing
Example 2
Assume T = 3
Parameter of interest: µx = E(X), µt = E(Yt ), t = 1, 2, 3.
P
PS estimator of µp at year t: θ̂p,t = n−1 ni=1 rit π̂it−1 yip , p ≤ t.
P
Estimator under t = 0 (baseline year): θ̂x,0 = n−1 ni=1 xi
Estimator under t = 1: θ̂x,1 , θ̂1,1
Estimator under t = 2: θ̂x,2 , θ̂1,2 , θ̂2,2
Estimator under t = 3: θ̂x,3 , θ̂1,3 , θ̂2,3 , θ̂3,3
(T + 1) × p + T(T + 1)/2 estimators for p + T parameters.
(p = dim(x))
30 / 47
Application to longitudinal missing
GMM for Longitudinal Data Case
Need to incorporate auxiliary information sequentially.
T = 1 already covered in Example 1.
For t = 2, we have auxiliary information about µx from t = 0
sample (i.e. x̄n ) and another auxiliary information about µ1 from
t = 1 sample (i.e. θ̂1,opt ).
Thus, the optimal estimator of θ2 takes the form of
0
θ̂2,opt = θ̂2,2 + x̄n − θ̂x,2 B̂1 + θ̂1,opt − θ̂1,2 B̂2
for some B̂1 and B̂2 .
31 / 47
Application to longitudinal missing
The auxiliary information up to time t can be incorporated by the
estimating function
Ẽ(ξt−1 ) = 0
Pn
−1
where Ẽ(∆) = n
i=1 ∆i and


r0
X
π0 u
1  r

 1 u2 X
 

r0
 π1

Y
1


π0 u1 L0

  r1 u L 
..

  π1 2 1 
.



ξt−1 := 
,
..
=
X

.

 

rt−1
 Y1  
 rt−1 

πt−1 ut Lt−1
 π ut  .  
 t−1  ..  
Yt−1
where ut = rt /pt − 1.
Note that E{ξt−1 } = 0 because E{ut |Lt−1 , rt−1 = 1} = 0.
32 / 47
Application to longitudinal missing
The score function can be written as
S̄t := (S10 , . . . , St0 )0 = nẼ{ψt−1 },
where

ψt−1

(r1 − p1 r
0 )X 

 (r2 − p2 r1 ) X
 



Y1
r0 u1 p1 L0



 
..

  r1 u2 p2 L1 
. 


=

=
.
..
 
X

.



 Y1  
r
u
p
L



t−1 t t t−1
(rt − pt rt−1 )  ..  

 . 
Yt−1
33 / 47
Application to longitudinal missing
GMM for Longitudinal Data Case This motivates the minimizing the
following quadratic form:
0  
 −1 

Ẽ{rt Yt /π̂t } − µt
Ẽ{rt Yt /π̂t } − µt
Ẽ{rt Yt /πt } 

 Ẽ{ξˆt−1 } − E{ξt−1 }  .
Qt =  Ẽ{ξˆt−1 } − E{ξt−1 }  V̂  Ẽ{ξt−1 } 


Ẽ{ψ
}
Ẽ{ψ̂t−1 } − E{ψt−1 }
Ẽ{ψ̂t−1 } − E{ψt−1 }
t−1

34 / 47
Application to longitudinal missing
Optimal PS Estimator
Theorem (1)
Under the logistic type response model, where the score function for
(φ1 , . . . , φT ) is Ẽ(ψT−1 ) = 0. For each year t, the optimal estimator
of µt = E{Yt } among the class
µ̂t,Bt ,Ct = Ẽ{rt Yt /π̂t } − B0t Ẽ{ξˆt−1 } − Ct0 Ẽ{ψ̂t−1 },
0 , . . . , Ĉ 0 )0
is given by µ̂t,B̂t ,Ĉt , where B̂t = (B̂01t , . . . , B̂0tt )0 , Ĉt = (Ĉ1t
tt
with
!
!0 )
(
1
1
L
L
r
π̂
1
j−1
j−1
t
j−1
0 0
π̂j−1
π̂j−1
(B̂0j,t , Ĉj,t
) = Ẽ−1
−1
π̂t
p̂j
p̂j Lj−1
p̂j Lj−1
(
!
)
1
1
rt Yt
π̂i−1 Li−1
× Ẽ
−1
.
p̂j
π̂t
p̂i Li−1
35 / 47
Application to longitudinal missing
Variance Estimation
Theorem (2)
The Ŷt,opt estimator is asymptotically equivalent to Ẽ{ηt }, where
!
t
1
L
rt Yt X 0
j−1
ηt =
,
−
Dj,t rj−1 uj πj−1
πt
pj Lj−1
j=1
with
(
Dj,t = E−1
rj−1 u2j
1
πj−1 Lj−1
pj Lj−1
!
1
πj−1 Lj−1
pj Lj−1
!0 ) (
E uj
1
πi−1 Li−1
pi Li−1
!
rt Yt
πt
)
.
Thus the variance of Ŷt,opt can be consistently estimated by
n−1 (n − 1)−1 Ẽ{η̂t − Ẽ(η̂t )}2 ,
36 / 47
Application to longitudinal missing
Properties of our Optimal Estimator
Ŷt,opt is asymptotically normal with mean µt and variance that is
equal to the lower bound of asymptotic variance corresponding
to the following family
Ŷt,PSA − B0 Ẽ{ξˆt−1 }.
Computational advantage due to the fact that ri−1 ui Li and
rj−1 uj Lj are “orthogonal” (uncorrelated) for i 6= j.
Variance estimation is also very convenient, as implied by
Theorem 2.
37 / 47
Numerical Study
Robins et al. (1995) Estimator
Robins et al. (1995) proposed a class of estimators for estimating µt
in the longitudinal data case with monotone missing, by incorporating
a regression model of E(Yit |Xi ) = m(Xi ; βt ). When m(Xi ; β) = Xi0 βt ,
the weighted estimating equation method in Robins et al. (1995)
would give an estimator µ̂t (out of that family) that is a solution to
1
rt
0
{Yt − µt − β1,t
(X − Ẽ[X])}
,
Ẽ
X − Ẽ(X)
π̂t
which gives
Ẽ{rt Yt /π̂t }
0
µ̂t =
− β̂1,t
Ẽ{rt /π̂t }
Ẽ{rt Yt /π̂t }
− X̄n .
Ẽ{rt /π̂t }
38 / 47
Numerical Study
Estimators under Study
The estimator using the full sample, i.e. Ẽ{Yt }, for reference.
The naive estimator, i.e. the simple average of the complete
sample, which is
µ̂t,naive = Ẽ{rt Yt }/Ẽ{rt }.
The direct propensity score adjusted estimator, that is,
µ̂t,PSA = Ẽ{rt Yt /π̂t }.
The estimator using weighted estimating equations by Robins et
al. (1995), denoted by µ̂t,RRZ .
Our estimator given in theorem 1, denoted by µ̂t,opt .
39 / 47
Numerical Study
Simulation Study I
Y0 = 2(X − 1) + e0 , Yt = 2(X − 1) + 2Yt−1 + et , for t > 1,
where X ∼ N(1, 1), et = 0.5et−1 + vt , e0 ∼ N(0, 1) and vt ∼ N(0, 1)
independently for different t. The missing indicator rt follows the
following distribution:
P(rt = 1|X, Yt−1 , rt−1 = 1) = expit(1 + 2X − Yt−1 /(t + 1)),
and there is no missing in the baseline year.
40 / 47
Numerical Study
Simulation Study I
We used B = 10000 Monte Carlo samples of size n = 300 for
this simulation. The response rates for t = 1, 2, 3 are
0.93, 0.87, 0.75 respectively.
We also computed variance estimator of the optimal estimator
using the formula in Theorem 2. The relative biases of the
variance estimator, for t = 1, 2, 3 are −0.0149,−0.0159,
−0.0172 respectively.
41 / 47
Numerical Study
Results from Simulation Study 1
Table: Comparison for different methods when n = 300, T = 3 with Monte
Carlo sample size 10000 for simulation study 1, using the full data as
baseline.
Full
Naive
PS
RRZ
Opt
100*RMSE/RMSE.Full
t=1 t=2
t=3
100
100
100
134
118
260
101
102
157
100
101
105
100
100
101
42 / 47
Numerical Study
Simulation Study II
Y0 = 2(X − 1)1/3 + e0 , Yt = 2(X − 1)1/3 + 2Yt−1 + et , for t > 1,
where X ∼ N(1, 1), et = 0.5et−1 + vt , e0 ∼ N(0, 1) and vt ∼ N(0, 1)
independently for different t. The missing indicator rt follows the
following distribution:
P(rt = 1|X, Yt−1 , rt−1 = 1) = expit(1 + 2X − Yt−1 /(t + 1)).
43 / 47
Numerical Study
Simulation Study II
B = 10000, n = 300, response rates for t = 1, 2, 3 are
0.92, 0.85, 0.74 respectively.
Using the same formula for variance estimation, in this
simulation study, for the optimal estimator, the relative biases of
the variance estimator, for t = 1, 2, 3 are
−0.0029, 0.0058, 0.0066 respectively.
44 / 47
Numerical Study
Results from Simulation Study II
Full
Naive
PS
RRZ
Opt
100*RMSE/RMSE.Full
t=1 t=2
t=3
100
100
100
133
115
229
102
106
144
102
103
118
101
103
108
45 / 47
Concluding Remarks
We adopted GLS (GMM) technique and constructed an optimal
estimator among a class of unbiased estimators.
Under monotone missing pattern, applying GLS (GMM) method
to estimate µX , µ1 , . . . , µT simultaneously is exactly the same as
what we proposed (estimate µX , µ1 , . . . , µT one by one).
This method is directly applicable to the case when the baseline
year sample is selected with a complex probability sample.
Extension to non-monotone missing pattern, time-dependent
covariate can be important topics for further investigation.
46 / 47
Thank You!
Questions ? : jkim@iastate.edu
47 / 47
Download