Introduction Bayesian Methods to Handle Missing Data in High-Dimensional Data Sets

advertisement
Bayesian Methods to
Handle Missing Data in
High-Dimensional Data Sets
using Factor Analysis Strategies
Thomas R. Belin
UCLA Department of Biostatistics
Juwon Song
Univ. of Texas-M.D. Anderson Cancer Center
Jianming Wang
Medtronic Inc.
Multiple imputation
Rationale : Useful framework for representing
uncertainty due to missingness
Requires imputations to be “proper”
Advice : include available information to the
fullest extent possible (Rubin 1996 JASA)
- avoid bias in the imputation
- make assumption of “ignorable” missing data
more plausible
Alternative modeling strategies
Address inestimable or unstable parameters
by :
• deleting variables
• using proper prior distribution
- ridge prior for multivariate normal
(MVN) model (Schafer 1997 text)
• restrictions on covariance matrix
(common factors in MVN model)
Introduction
General Problem: Incomplete highdimensional longitudinal data
• A large number of variables
• A modest number of cases
• With missing values
• Initially consider cross-sectional data, then
consider longitudinal structure
Overparameterization concerns
With modest sample size and large number of
variables, even a simple model can be
overparameterized
Example : 50 variables ⇒ 50×49/2=1225
correlation parameters in multivariate normal
model with general covariance matrix
Analysis often proceeds based on arbitrary choice of
variables to include or exclude
Factor model for incomplete
multivariate normal data
Idea : ignore factors corresponding to small
eigenvalues
Notation:
Y : n×p data matrix with missing items
Z : n×k unobserved factor-score matrix,
where k ≤ p
(Yi Zi): iid (p+k)-variate normal distribution
Zi ∼N(0, Ik), i.e., assuming orthogonal factors
1
Factor model for incomplete
multivariate normal data (cont’d)
Model:
Yi = α + Zi β + εi , for i=1, 2, ... , n,
where α is 1×p mean vector,
β is k×p factor-loading matrix,
and εi ∼ N( 0, τ2 ),
where τ2 = diag(τ12,τ22,…, τp2)
Model fitting
Gibbs sampling : based on assumed factor
structure (i.e., k known), draw:
(a) mean vector
(b) factor loadings
(c) uniqueness
(d) factor scores
(e) missing items
Simulation evaluations
Details of model fitting
• Can use weakly informative prior for uniqueness
terms τj2 to avoid degenerate variance estimates
• Can use either noninformative or weakly
informative priors for means and factor loadings
• Used transformations to speed convergence
• Multiple modes possible (Rubin and Thayer 1982,
1983 Psychometrika), so simulate multiple chains
• Monitor convergence (Gelman and Rubin 1992
Statistical Science)
Simulation factor structure
Example: Each item loads on one factor
⎡0.8
⎢0
⎢
β = ⎢0
⎢
⎢0
⎣⎢ 0
0 L 0⎤
0 L 0 ⎥⎥
0 L 0 0.8 L 0.8 0 L 0 0 L 0 ⎥
⎥
0 L 0 0 L 0 0.8 L 0.8 0 L 0 ⎥
0 L 0 0 L 0 0 L 0 0.8 L 0.8⎦⎥
L 0.8 0 L 0 0 L 0
L 0 0.8 L 0.8 0 L 0
L 0
L 0
L 0
0 L 0
0 L 0
Evaluate bias, coverage when model is
correct, overparameterized, or
underparameterized
n
100
500
p # true factors
100
5
10
100
5
10
# assumed factors
5, 10
5, 10
5, 10
5, 10
Simulation details
Also considered hypothetical scenario where items load on
two factors
200 replications for each combination of simulation
conditions
- error standard deviation of 1.5% for 95% coverage
Percentage of missing data ranged from
5-25% for each variable
Three missing-data mechanisms (MAR where available-case
analysis might do well, MAR where available-case
analysis not expected to do well, and non-ignorable where
method appropriate under MAR might do well)
2
Simulation results: Factor
model, cross-sectional mean
Simulation results: Other
methods, cross-sectional mean
Factor model performs well when model
correct or overparameterized (coverages
range from 93% - 97%)
Factor model coverage is below nominal level
when model underparameterized (coverages
range from 86% - 93%)
MVN frequently fails to converge with n=100
without ridge prior
MVN with ridge prior has good coverage (94% 98%), interval widths typically wider than for
factor model (2-16% wider on average, depending
on details such as missing data mechanism)
Available-case analysis performs poorly (coverages
ranging from 37% - 88%)
Simulation study based on
observed covariance matrix
Simulation results: psychological
testing example
Generate multivariate normal data (200 replicates,
SE = 1.5% for 95% coverage statistics) with mean
and covariance fixed at published values from
Harman (1967) study of 24 psychological tests on
145 school children
Number of factors not known in advance
Consider 4, 5, 7 factors following earlier analysis
Also consider 11 factors based on cumulative
variance explained exceeding 80% and desire not
to underparameterize model
Coverage rates:
4-factor model:
93% - 95%
5-factor model:
93% - 96%
7-factor model:
93% - 95%
11-factor model:
93% - 95%
MVN model:
94% - 95%
Available-case analysis:
12% - 84%
Interval widths for MVN model within 5% of factor
model widths, usually within 1%
Application: Emergency room
intervention study
Factor model for emergency
room intervention study
Specialized emergency room intervention vs.
standard emergency room treatment for 140
female adolescents after suicide attempt
Twenty-seven outcomes measured at baseline, 3, 6,
12, 18 months + many baseline characteristics
Most vars 5-25% missing, some 50-60% missing
Main interests:
- effectiveness of emergency room intervention
- whether baseline psychological impairment is
related to outcomes over time
135 variables, including 27 longitudinal outcomes
Longitudinal outcomes: measures at different time
points treated as separate variables
Assume 30 factors:
- explained about 80% of the variation
- simulation analysis: insufficient number of
factors can cause serious bias
- with 27 longitudinal outcomes, general enough
to allow each longitudinal variable to represent a
separate factor
3
Emergency-room intervention study:
evaluations, results
After imputation, related longitudinal outcomes to
baseline predictors using SAS PROC MIXED
Compared imputation under factor model with
growth-curve imputation strategy developed by
Schafer (1997 PAN program)
No substantial differences seen in significance tests
for intervention effect
Some sensitivity seen in significance of impairment
effect, intervention and impairment interactions
Imputation for longitudinal data
PAN (Schafer, 1997): Using Multivariate
Linear Mixed-effect Model (MLMM)
• Appropriate for multivariate longitudinal
data or clustered data
• Imputation by multivariate linear mixedeffect model Y i = X i α + Z i γ i + δ i
txm txp pxm txq qxm txm
V
Assume (γi ) ~ N(0,Φ) and
(δ i )V ~ N (0, Σδδ )
Challenge with MI using PAN
Potential solution to over-parameterization
MI under PAN can be over-parameterized easily
• Example: 15 variables collected longitudinally
five times, modeled with 2 random effects in PAN
• # of parameters in Φ , random effects:
15*31/2=465
• # of parameters in Σδδ , error terms: 15*16/2=120
• Total # of parameters: 585
• Parameter reduction seems sensible when number
of cases is modest, e.g. 300
If those 15 variables feature sizable correlations,
they could be viewed as measuring 3-5 underlying
factors.
Strategy:
• Reduce the dimension of the problem by factor
analysis
• Model the estimated factor scores by a MLMM
• Factor structure reflects cross-sectional
correlations among variables measured at the same
time; MLMM reflects longitudinal correlations
Ordinary factor analysis model
Ordinary factor analysis model (continued)
Factor analysis model
Yi = µ + Λ f i + ε i , i = 1, 2,..., n
where fi ~ N(0, Σ ff ) and εi ~ N(0, Ψ)
Because
ΣYY = ΛΣ ff ΛT + Ψ = ΛΣ1/ff 2Σ1/ff 2ΛT + Ψ = Λ* (Λ* )T + Ψ
we often assume fi ~ N (0, I )
Also assume that Λ is of full rank
(Seber, 1977)
Identifiability
• Solution invariant under orthogonal
transformation
Yi = µ + ΛTT −1 fi + ε i = µ + Λ* fi* + ε i
ΛT Ψ −1Λ = Γ = Diagonal
• Common restrictions
which is equivalent to k(k-1)/2 restrictions
• Identifiable if 1
2
2
[( p − k) −( p + k)] >= 0
4
Error-in-variables factor model
Generalizing factor analysis model
• Standardization of factor scores presents
challenge for generalizing factor analysis
model to longitudinal setting
• Idea: Use “error-in-variables”
representation of factor model
• Error-in-variables model (Fuller, 1987)
⎛β ⎞ ⎛β ⎞
Yi = ⎜ 0 ⎟ + ⎜ 1 ⎟ f i + ε i
⎝ 0 ⎠ ⎝ I ⎠
i2
and
Then
Error-in-variables factor model (continued)
⎛β ⎞
⎛β ⎞
⎝ ⎠
⎝ ⎠
T
1
1
• Covariance matrix of Y is ΣYY = ⎜ I ⎟ Σ ff ⎜ I ⎟ +Ψ
• The total # of distinct parameters is
( p − k )k + p +
1
1
k ( k + 1) = p + pk − k ( k + 1)
2
2
which is exactly the same as the ordinary model
with the additional k(k-1)/2 restrictions used to
avoid indeterminacy
• No additional restrictions necessary
⎛Y ⎞
Interpretation: If we partition Yi into ⎜ i1 ⎟ and let
⎝ Yi 2 ⎠
Y = f + u , Y = β + β f + e,
i
i
i1
0
1 i
i
⎛e ⎞
εi = ⎜ i ⎟ ,
⎝ ui ⎠
⎛Y ⎞ ⎛β ⎞ ⎛β ⎞
⎛e ⎞
Yi = ⎜ i1 ⎟ = ⎜ 0 ⎟ + ⎜ 1 ⎟ fi + ⎜ i ⎟
⎝ Yi 2 ⎠ ⎝ 0 ⎠ ⎝ I ⎠
⎝ ui ⎠
A Longitudinal Factor Analysis model
• Extending Error-in-variables Model to LFA
⎛ ⎛Yi11 ⎞ ⎞ ⎛ ⎛ β01 ⎞⎞ ⎛⎛ β11 ⎞
⎛ ⎛ ei1 ⎞ ⎞
⎞
⎜ ⎜ ⎟ ⎟ ⎜ ⎜ ⎟ ⎟ ⎜⎜ ⎟ 0 L 0 ⎟
⎜⎜ ⎟⎟
⎜ ⎝Yi12 ⎠ ⎟ ⎜ ⎝ 0 ⎠⎟ ⎜⎝ I ⎠
⎜ ⎝ui1 ⎠ ⎟
⎟
⎛ Yi1 ⎞ ⎜
⎟ ⎜
⎟ ⎜
⎟⎛ fi1 ⎞ ⎜⎛ e ⎞⎟
⎛ β12 ⎞
⎜ ⎟ ⎜⎛Yi21 ⎞⎟ ⎜⎛ β02 ⎞⎟ ⎜
⎟⎜ fi2 ⎟ ⎜⎜ i2 ⎟⎟
L
Y
0
0
⎜
⎟
⎜
⎟
⎜
⎟
i
2
Yi = ⎜ ⎟ = ⎜⎝Yi22 ⎠⎟ = ⎜⎝ 0 ⎠⎟ +⎜
⎟⎜ ⎟ +⎜⎝ui2 ⎠⎟ =Β0 +Β1 fi +εi
⎝I ⎠
⎜M⎟ ⎜
⎟ ⎜
⎟
⎟ ⎜
⎟⎜ M ⎟ ⎜
M O M ⎟⎜ ⎟ ⎜ M ⎟
⎜ ⎟ ⎜ M ⎟ ⎜ M ⎟ ⎜ M
⎝ Yit ⎠ ⎜
⎝ fit ⎠ ⎜
⎟
⎜
⎟
⎜
⎟
⎛β ⎞
⎛β ⎞
⎛Y ⎞
⎛e ⎞⎟
⎜ ⎜ it1 ⎟ ⎟ ⎜⎜ ⎜ 0t ⎟ ⎟⎟ ⎜⎜ 0
⎜ ⎜ it ⎟ ⎟
0 L ⎜ 1t ⎟⎟⎟
⎜ Y ⎟
⎜ u ⎟
I
0
⎝
⎠
⎝
⎠
⎠ ⎝
⎠
⎝ ⎝ it 2 ⎠ ⎠ ⎝
⎝ ⎝ it ⎠ ⎠
Aspects of LFA model
• The # of factors is the same on each occasion,
but the factor loadings and factor scores may
change
• No constraints on covariance structure of the fi
• The unique-component vectors are uncorrelated
with the factors both within and across
occasions.
• The unique-component errors are uncorrelated
within occasion and across occasions
Advantages of LFA model
Advantages of this LFA model:
• Identifiability problem can easily be handled
• Preserves the mean structure and covariance
structure, making the study of elevation change
and pattern change simultaneously possible
• Can incorporate linear mixed-effect model
structure for longitudinal data
• Can incorporate baseline covariates
5
Implementation
Evaluations
• Use data augmentation (I-step: linear
regressions, P-step: analog to ML for
multivariate normal with complete data)
• Assume conjugate forms (normal, inverse
Wishart) for prior distributions for
parameters, assume relatively diffuse priors
that still produce proper posteriors
• Conditional distributions all in closed form
( α 6×5 reflects small to moderate covariate
effects for predicting factor scores and a
linear trend in factor scores)
Simulation when number of factors is correctly specified
The mean of Y49 , which (averaged across simulation replicates) was
missing on 27% of individuals
Analysis
Method
M.C.
Average
M.C.
S.E.
Average 95%
Interval length
True value
17.074
All data
17.078
0.426
1.677
Available data
18.854
0.530
5 imputations
17.072
0.567
Β1[( Z i ⊗ I k ) Φ ( Z iT ⊗ I k ) + I t ⊗ Σ δδ )]Β1T + Σ εε
for i=1,2,…,350, p=15 measurements, k=5 factors at
t=5 time points, Yi has dimension (15x5)x1=75x1
(Β) rc = 1/ 6 +
⎬
for r = 6 ⎭
c / 0.5
and variance
⎧4 if
⎩1 if
Σεε = diagnal (σ εε2 ) = 5 Φ rc = ⎨
• X incorporates intercept, 3 continuous
variables, 1 binary variable and time
• Z allows for random intercepts, slopes
• α = ⎧0.3 + Bern(0.5) / 2 for r ≠ 6⎫
⎨
⎩
Β0 + Β1 ( X iα ) RV
Simulation design (continued)
Simulation design
rc
We generated 100 data sets with Yi from a MVN
with mean
Actual 95%
Coverage
r+c
500
⎧20 if r = c ⎫
r = c⎫
⎬ (Σδδ ) rc = ⎨⎩ 5 if r ≠ c ⎭⎬
r ≠ c⎭
(to avoid singular factor loading matrix)
• Missingness introduced using MAR mechanism (a series of
binary draws with probabilities depending on observed values)
• ( Σεε , Σδδ and Φ incorporate relative variances, covariance
describing unique variance, common variance among factor
scores, and variance of random effects
• Simulation SE 95% of coverage statistics with 100
replicates=0.0218, margin of error=0.0427
Simulation when number of factors is correctly specified
The mean of Y66
, a variable which is missing 100% of the time
(i.e. a variable not measured at a given time point)
Analysis
Method
M.C.
Average
M.C.
S.E.
Average 95%
Interval length
Actual
95%
Coverage
True value
20.8195
98%
All data
20.7955
0.5128
2.0170
94%
2.091
7%
Available
data
--
--
--
--
2.231
96%
5
imputations
20.7678
0.6503
2.5554
95%
6
Simulation when number of factors is incorrectly specified
The mean of Y49 (average missingness rate=27%)
Analysis
Method
M.C.
Average
True value
17.074
All data
17.078
Available data
18.854
M.C.
S.E.
Average 95%
Interval length
Actual 95%
Coverage
0.4263
1.677
98%
0.5304
2.091
Simulation when number of factors is incorrectly specified
The mean of Y66 , which has a 100% missingness rate
Analysis
Method
M.C.
Average
M. C.
S. E.
Average 95%
Interval length
Actual 95%
Coverage
True value
20.8195
All data
20.7955
0.5128
2.0170
94%
Available data
--
--
--
--
7%
F=5 (true number)
17.072
0.5672
2.231
96%
F=6
17.055
0.4873
1.9153
94%
F=5(true
number)
20.7678
0.6503
2.5554
95%
20.9565
0.7161
2.8142
94%
F=4
17.612
0.5962
2.3429
89%
F=6
F=3
17.663
0.6213
2.4410
86%
F=4
20.6473
1.1139
4.3780
91%
F=3
20.4091
1.2484
4.9060
83%
Example using LFA: oral surgery study
Randomized study of two oral surgery treatments
(MMF, RIF) with longitudinal follow-up of
quality-of-life (GOHAI) and psychological
outcomes
Hierarchical growth-curve model using WINBUGS:
Yij = β 0i + β1i (tij − t ) + ε ij ,
β 0i = β 00 + β 01Si + δ 0i ,
β1i = β10 + β11Si + δ1i
δ 0i ~ N (0, σ 02 ) δ1i ~ N (0, σ 12 )
Si = 1 , if RIF
Si = 0 ,
if MMF
Summary and future research
Summary
•
Factor-analysis methods provide flexible framework for
addressing incomplete high-dimensional longitudinal
data
Ongoing and future research
•
Rounding continuous to binary imputations
•
Determining number of factors
•
Robustness of methods to normality assumption
•
Can the parameters in LFA be estimated by EM or
related methods?
•
Comparisons with IVEWare and related methods, hot
deck approaches
Findings of interest
• Difference in average intercept, average slope
between RIF and MMF ( β 01 , β11 ) significant under
MI (NORM or LFA) analysis, not under availablecase analysis
• Different interpretations emerge from MI analysis
(RIF starts lower, ends with comparable values)
• Compared to MI using NORM, MI using LFA has
17%-34% narrower interval estimates for
parameters
Goal
To develop general-purpose
multiple imputation procedures
appropriate for high-dimensional
data sets
• Cross-sectional
• Longitudinal
7
Simulation results: simple
regression coefficient
Simulation missing data mechanisms
M1 (MAR): First 99 variables MCAR, missingness
on last variable according to logistic regression on
other 99 with normally distributed coefficients
M2 (MAR): First 99 variables MCAR, missingness
on last variable according to logistic regression on
other variables included in same factor with halfnormal distributed coefficients
M3 (nonignorable but “close” to MAR):
Missingness on each variable depends on two
other variables in overlapping manner
Factor model: coverages 93% - 98% when
model correct or overparameterized, 19% 80% when model underparameterized
MVN model: Frequently fails to converge
with non-informative prior, coverages 91%
- 99% with ridge prior
Available-case analysis: coverages range
from 44% - 100%
Equivalence of two factor analysis models
Incorporating multivariate linear
mixed-effect model for factor scores
One can write:
⎛µ ⎞ ⎛Λ ⎞
µ + Λfi = ⎜ 1 ⎟ + ⎜ 1 ⎟ Λ −21 (Λ 2 fi )
⎝ µ2 ⎠ ⎝ Λ 2 ⎠
⎛ µ1 ⎞ ⎛ Λ1Λ −21 ⎞
=⎜ ⎟+⎜
⎟ (Λ 2 fi )
⎝ µ2 ⎠ ⎝ Ι k ⎠
*
⎛ µ ⎞ ⎛Λ ⎞
= ⎜ 1 ⎟ + ⎜ 1 ⎟ fi *
⎝ µ2 ⎠ ⎝ Ι k ⎠
⎛ µ1 − Λ1* µ2 ⎞ ⎛ Λ1* ⎞ *
=⎜
⎟ + ⎜ ⎟ ( fi + µ2 )
0
⎝
⎠ ⎝ Ιk ⎠
*
*
⎛ µ1 ⎞ ⎛ Λ1 ⎞ **
= ⎜ ⎟ + ⎜ ⎟ fi
⎝ 0 ⎠ ⎝ Ιk ⎠
Modified LFA with covariates
• Rearrange fi in a matrix form
Then
can be modeled by
⎛ ⎛ ei1 ⎞ ⎞
⎞
0 ⎟
⎜⎜ ⎟⎟
⎜ ⎝ui1 ⎠ ⎟
⎟
⎜⎛ e ⎞ ⎟
⎟
⎜⎜ i 2 ⎟ ⎟
L 0 ⎟
RV
RV
⎟[(Xiα + Ziγi ) +δi ] +⎜⎝ui2 ⎠⎟
⎜
⎟
⎟
O M ⎟
⎜ M ⎟
⎜ ⎛e ⎞ ⎟
⎛ β1t ⎞⎟
it
⎜⎜ ⎟⎟
L ⎜ ⎟⎟⎟
⎜ u ⎟
⎝ I ⎠⎠
⎝ ⎝ it ⎠ ⎠
L
f i 1T
f i T2
M
f itT
⎞
⎟
⎟
⎟
⎟⎟
⎠
f%i = X i α + Z i γ i + δ i
txk
txm mxk txq qxk
txk
We assume that the t rows of δ i are iid
V
and (γ i ) ~ N (0, Φ) . Thus
N (0, Σδδ )
⎛ fi1 ⎞
⎜ ⎟
⎜ fi 2 ⎟ ~ N (( X α )V ,(Z ⊗ I )Φ(Z T ⊗ I ) + I ⊗Σ ))
δδ
i
i
k
i
k
t
⎜ M ⎟
⎜ ⎟
f
⎝ it ⎠
Linear growth curve model estimates: Available-case
analysis, MI using NORM, MI using LFA
• Combining the LFA with the linear mixedeffect model, we obtain
⎛⎛Yi11 ⎞⎞ ⎛⎛ β01 ⎞⎞ ⎛⎛ β11 ⎞
⎜ ⎜ ⎟ ⎟ ⎜ ⎜ ⎟ ⎟ ⎜⎜ ⎟ 0
⎜⎝Yi12 ⎠⎟ ⎜⎝ 0 ⎠⎟ ⎜⎝ I ⎠
⎜ ⎛Y ⎞ ⎟ ⎜⎛ β ⎞ ⎟ ⎜
⎛β ⎞
⎜⎜ i21 ⎟⎟ ⎜ 02 ⎟ ⎜ 0 ⎜ 12 ⎟
Yi = ⎜⎝Yi22 ⎠⎟ = ⎜⎜⎝ 0 ⎟⎠⎟ +⎜
⎝I ⎠
⎜
⎟ ⎜
⎟ ⎜
M
⎜ M ⎟ ⎜ M ⎟ ⎜ M
⎜ ⎛Y ⎞ ⎟ ⎜⎛ β0t ⎞ ⎟ ⎜
⎜ ⎜ it1 ⎟ ⎟ ⎜⎜⎜ ⎟ ⎟⎟ ⎜⎜ 0
0
⎜ Y ⎟ ⎝0⎠
⎠ ⎝
⎝ ⎝ it 2 ⎠ ⎠ ⎝
⎛
⎜
f%i = ⎜
⎜
⎜⎜
⎝
Analysis Method
Available Case
Analysis
Multiple Imputation
Using NORM
Multiple Imputation
Using LFA
Estimate
Posterior
Mean
95% CI
Posterior
Mean
95% CI
Posterior
Mean
95% CI
Beta00
28.55
(26.24,
30.92)
29.30
(26.35,
32.33)
28.90
(26.45,
31.20)
Beta01
-0.29
(-4.67,
4.05)
-4.24
(-7.18,
-1.44)*
-3.93
(-5.72,
-1.95)*
Beta10
7.07
(4.78,
9.24)*
6.15
(1.90,
9.79)*
6.57
(2.24,
9.34)*
Beta11
1.86
(-2.42,
5.96)
2.72
(0.20,
5.38)*
2.69
(0.92,
5.02)*
*p<0.05.
8
Download