Bayesian Methods to Handle Missing Data in High-Dimensional Data Sets

advertisement
Bayesian Methods to
Handle Missing Data in
High-Dimensional Data Sets
using Factor Analysis Strategies
Thomas R. Belin
UCLA Department of Biostatistics
Juwon Song
Univ. of Texas-M.D. Anderson Cancer Center
Jianming Wang
Medtronic Inc.
Introduction
General Problem: Incomplete highdimensional longitudinal data
• A large number of variables
• A modest number of cases
• With missing values
• Initially consider cross-sectional data, then
consider longitudinal structure
Multiple imputation
Rationale : Useful framework for representing
uncertainty due to missingness
Requires imputations to be “proper”
Advice : include available information to the
fullest extent possible (Rubin 1996 JASA)
- avoid bias in the imputation
- make assumption of “ignorable” missing data
more plausible
Overparameterization concerns
With modest sample size and large number of
variables, even a simple model can be
overparameterized
Example : 50 variables  5049/2=1225
correlation parameters in multivariate normal
model with general covariance matrix
Analysis often proceeds based on arbitrary choice of
variables to include or exclude
Alternative modeling strategies
Address inestimable or unstable parameters
by :
• deleting variables
• using proper prior distribution
- ridge prior for multivariate normal
(MVN) model (Schafer 1997 text)
• restrictions on covariance matrix
(common factors in MVN model)
Factor model for incomplete
multivariate normal data
Idea : ignore factors corresponding to small
eigenvalues
Notation:
Y : np data matrix with missing items
Z : nk unobserved factor-score matrix,
where k p
(Yi Zi): iid (p+k)-variate normal distribution
Zi ~N(0, Ik), i.e., assuming orthogonal factors
Factor model for incomplete
multivariate normal data (cont’d)
Model:
Yi = a + Zi b + ei , for i=1, 2, ... , n,
where a is 1p mean vector,
b is kp factor-loading matrix,
and ei ~ N( 0, t2 ),
where t2 = diag(t12,t22,…, tp2)
Model fitting
Gibbs sampling : based on assumed factor
structure (i.e., k known), draw:
(a) mean vector
(b) factor loadings
(c) uniqueness
(d) factor scores
(e) missing items
Details of model fitting
• Can use weakly informative prior for uniqueness
terms tj2 to avoid degenerate variance estimates
• Can use either noninformative or weakly
informative priors for means and factor loadings
• Used transformations to speed convergence
• Multiple modes possible (Rubin and Thayer 1982,
1983 Psychometrika), so simulate multiple chains
• Monitor convergence (Gelman and Rubin 1992
Statistical Science)
Simulation evaluations
Evaluate bias, coverage when model is
correct, overparameterized, or
underparameterized
n
100
500
p # true factors
100
5
10
100
5
10
# assumed factors
5, 10
5, 10
5, 10
5, 10
Simulation factor structure
Example: Each item loads on one factor
0.8
0

b  0

0
 0
0.8
0
0
0
0
0
0.8
0
0
0
0
0.8
0
0
0
0
0
0.8
0
0
0
0
0.8
0
0
0
0
0
0.8
0
0
0
0
0.8
0
0
0
0
0
0.8
0
0 
0

0
0.8
Simulation details
Also considered hypothetical scenario where items load on
two factors
200 replications for each combination of simulation
conditions
- error standard deviation of 1.5% for 95% coverage
Percentage of missing data ranged from
5-25% for each variable
Three missing-data mechanisms (MAR where available-case
analysis might do well, MAR where available-case
analysis not expected to do well, and non-ignorable where
method appropriate under MAR might do well)
Simulation results: Factor
model, cross-sectional mean
Factor model performs well when model
correct or overparameterized (coverages
range from 93% - 97%)
Factor model coverage is below nominal level
when model underparameterized (coverages
range from 86% - 93%)
Simulation results: Other
methods, cross-sectional mean
MVN frequently fails to converge with n=100
without ridge prior
MVN with ridge prior has good coverage (94% 98%), interval widths typically wider than for
factor model (2-16% wider on average, depending
on details such as missing data mechanism)
Available-case analysis performs poorly (coverages
ranging from 37% - 88%)
Simulation study based on
observed covariance matrix
Generate multivariate normal data (200 replicates,
SE = 1.5% for 95% coverage statistics) with mean
and covariance fixed at published values from
Harman (1967) study of 24 psychological tests on
145 school children
Number of factors not known in advance
Consider 4, 5, 7 factors following earlier analysis
Also consider 11 factors based on cumulative
variance explained exceeding 80% and desire not
to underparameterize model
Simulation results: psychological
testing example
Coverage rates:
4-factor model:
93% - 95%
5-factor model:
93% - 96%
7-factor model:
93% - 95%
11-factor model:
93% - 95%
MVN model:
94% - 95%
Available-case analysis:
12% - 84%
Interval widths for MVN model within 5% of factor
model widths, usually within 1%
Application: Emergency room
intervention study
Specialized emergency room intervention vs.
standard emergency room treatment for 140
female adolescents after suicide attempt
Twenty-seven outcomes measured at baseline, 3, 6,
12, 18 months + many baseline characteristics
Most vars 5-25% missing, some 50-60% missing
Main interests:
- effectiveness of emergency room intervention
- whether baseline psychological impairment is
related to outcomes over time
Factor model for emergency
room intervention study
135 variables, including 27 longitudinal outcomes
Longitudinal outcomes: measures at different time
points treated as separate variables
Assume 30 factors:
- explained about 80% of the variation
- simulation analysis: insufficient number of
factors can cause serious bias
- with 27 longitudinal outcomes, general enough
to allow each longitudinal variable to represent a
separate factor
Emergency-room intervention study:
evaluations, results
After imputation, related longitudinal outcomes to
baseline predictors using SAS PROC MIXED
Compared imputation under factor model with
growth-curve imputation strategy developed by
Schafer (1997 PAN program)
No substantial differences seen in significance tests
for intervention effect
Some sensitivity seen in significance of impairment
effect, intervention and impairment interactions
Imputation for longitudinal data
PAN (Schafer, 1997): Using Multivariate
Linear Mixed-effect Model (MLMM)
• Appropriate for multivariate longitudinal
data or clustered data
• Imputation by multivariate linear mixedeffect model Yi  X ia  Zi i   i
txm txp pxm txq qxm txm
Assume
( i )V ~ N (0, )
and
( i )V ~ N (0,  )
Challenge with MI using PAN
MI under PAN can be over-parameterized easily
• Example: 15 variables collected longitudinally
five times, modeled with 2 random effects in PAN
• # of parameters in  , random effects:
15*31/2=465
• # of parameters in  , error terms: 15*16/2=120
• Total # of parameters: 585
• Parameter reduction seems sensible when number
of cases is modest, e.g. 300
Potential solution to over-parameterization
If those 15 variables feature sizable correlations,
they could be viewed as measuring 3-5 underlying
factors.
Strategy:
• Reduce the dimension of the problem by factor
analysis
• Model the estimated factor scores by a MLMM
• Factor structure reflects cross-sectional
correlations among variables measured at the same
time; MLMM reflects longitudinal correlations
Ordinary factor analysis model
Factor analysis model
Yi    fi  e i , i  1, 2,..., n
where fi ~ N (0,  ff ) and e i ~ N (0, )
Because
YY   ff T    1/ff 21/ff 2T    * (* )T  
we often assume fi ~ N (0, I )
Also assume that  is of full rank
(Seber, 1977)
Ordinary factor analysis model (continued)
Identifiability
• Solution invariant under orthogonal
transformation
Yi    TT 1 fi  e i    * fi *  e i
T  1    Diagonal
• Common restrictions
which is equivalent to k(k-1)/2 restrictions
• Identifiable if 1
2
2
[( p  k )  ( p  k )]  0
Generalizing factor analysis model
• Standardization of factor scores presents
challenge for generalizing factor analysis
model to longitudinal setting
• Idea: Use “error-in-variables”
representation of factor model
Error-in-variables factor model
• Error-in-variables model (Fuller, 1987)
 b 0   b1 
Yi  

 fi  e i
 0 
 I 
Interpretation: If we partition Yi into
Y  f  u , Y  b  b f  e,
i2
and
Then
i
i
i1
0
1 i
i
 ei 
ei    ,
 ui 
 Yi1   b0   b1 
 ei 
Yi          fi   
 Yi 2   0   I 
 ui 
 Yi1 
 
 Yi 2 
and let
Error-in-variables factor model (continued)
 b1   b1 
• Covariance matrix of Y is YY     ff    
I I
T
• The total # of distinct parameters is
1
1
( p  k )k  p  k (k  1)  p  pk  k (k  1)
2
2
which is exactly the same as the ordinary model
with the additional k(k-1)/2 restrictions used to
avoid indeterminacy
• No additional restrictions necessary
A Longitudinal Factor Analysis model
• Extending Error-in-variables Model to LFA
  Yi11     b01     b11 
        0
  Yi12     0     I 
 Yi1  
 
 
 b12 
    Yi 21     b 02   
Yi 2   
0  



Yi 
   Yi 22      0    
I 

  
 
 
  
 
 
 Yit  
 Yit1     b 0t   
         0
0
 Y 
  it 2     0   
  ei1  

0 
 
  ui1  

  fi1    e  
 
0   fi 2    i 2  
   ui 2    0  1 fi  e i


  



  fit  
e 
 b1t  
it



 




u
 I 
it




Aspects of LFA model
• The # of factors is the same on each occasion,
but the factor loadings and factor scores may
change
• No constraints on covariance structure of the f i
• The unique-component vectors are uncorrelated
with the factors both within and across
occasions.
• The unique-component errors are uncorrelated
within occasion and across occasions
Advantages of LFA model
Advantages of this LFA model:
• Identifiability problem can easily be handled
• Preserves the mean structure and covariance
structure, making the study of elevation change
and pattern change simultaneously possible
• Can incorporate linear mixed-effect model
structure for longitudinal data
• Can incorporate baseline covariates
Implementation
• Use data augmentation (I-step: linear
regressions, P-step: analog to ML for
multivariate normal with complete data)
• Assume conjugate forms (normal, inverse
Wishart) for prior distributions for
parameters, assume relatively diffuse priors
that still produce proper posteriors
• Conditional distributions all in closed form
Evaluations
We generated 100 data sets with Yi from a MVN
with mean
0  1 ( X ia ) RV
and variance
1[( Z i  I k )( Z iT  I k )  I t   )]1T  ee
for i=1,2,…,350, p=15 measurements, k=5 factors at
t=5 time points, Yi has dimension (15x5)x1=75x1
Simulation design
• X incorporates intercept, 3 continuous
variables, 1 binary variable and time
• Z allows for random intercepts, slopes
• a  0.3  Bern(0.5) / 2 for r  6
rc


c / 0.5

for r  6 
( a 65 reflects small to moderate covariate
effects for predicting factor scores and a
linear trend in factor scores)
Simulation design (continued)
4 if r  c  ( )  20 if r  c 


ee  diagnal ( ee )  5  rc  
 rc

5
if
r

c


1 if r  c 
r c
() rc  1/ 6 
500
(to avoid singular factor loading matrix)
2
• Missingness introduced using MAR mechanism (a series of
binary draws with probabilities depending on observed values)
• ( ee ,  and  incorporate relative variances, covariance
describing unique variance, common variance among factor
scores, and variance of random effects
• Simulation SE 95% of coverage statistics with 100
replicates=0.0218, margin of error=0.0427
Simulation when number of factors is correctly specified
The mean of Y49 , which (averaged across simulation replicates) was
missing on 27% of individuals
M.C.
S.E.
Average 95%
Interval length
Actual 95%
Coverage
17.078
0.426
1.677
98%
Available data
18.854
0.530
2.091
7%
5 imputations
17.072
0.567
2.231
96%
Analysis
Method
M.C.
Average
True value
17.074
All data
Simulation when number of factors is correctly specified
The mean of Y66
, a variable which is missing 100% of the time
(i.e. a variable not measured at a given time point)
Analysis
Method
M.C.
Average
M.C.
S.E.
Average 95%
Interval length
Actual
95%
Coverage
True value
20.8195
All data
20.7955
0.5128
2.0170
94%
Available
data
--
--
--
--
5
imputations
20.7678
0.6503
2.5554
95%
Simulation when number of factors is incorrectly specified
The mean of Y49 (average missingness rate=27%)
Analysis
Method
M.C.
Average
M.C.
S.E.
Average 95%
Interval length
Actual 95%
Coverage
True value
17.074
All data
17.078
0.4263
1.677
98%
Available data
18.854
0.5304
2.091
7%
F=5 (true number)
17.072
0.5672
2.231
96%
F=6
17.055
0.4873
1.9153
94%
F=4
17.612
0.5962
2.3429
89%
F=3
17.663
0.6213
2.4410
86%
Simulation when number of factors is incorrectly specified
The mean of Y66 , which has a 100% missingness rate
Analysis
Method
M.C.
Average
M. C.
S. E.
Average 95%
Interval length
Actual 95%
Coverage
True value
20.8195
All data
20.7955
0.5128
2.0170
94%
Available data
--
--
--
--
F=5(true
number)
20.7678
0.6503
2.5554
95%
F=6
20.9565
0.7161
2.8142
94%
F=4
20.6473
1.1139
4.3780
91%
F=3
20.4091
1.2484
4.9060
83%
Example using LFA: oral surgery study
Randomized study of two oral surgery treatments
(MMF, RIF) with longitudinal follow-up of
quality-of-life (GOHAI) and psychological
outcomes
Hierarchical growth-curve model using WINBUGS:
Yij  b 0i  b1i (tij  t )  e ij ,
b0i  b00  b01Si   0i ,
b1i  b10  b11Si  1i
e ij ~ N (0,V )  ~ N (0,  2 )  ~ N (0,  2 )
0i
0
1i
1
Si  1 , if RIF
Si  0 ,
if MMF
Findings of interest
• Difference in average intercept, average slope
between RIF and MMF ( b 01 , b11 ) significant under
MI (NORM or LFA) analysis, not under availablecase analysis
• Different interpretations emerge from MI analysis
(RIF starts lower, ends with comparable values)
• Compared to MI using NORM, MI using LFA has
17%-34% narrower interval estimates for
parameters
Summary and future research
Summary
•
Factor-analysis methods provide flexible framework for
addressing incomplete high-dimensional longitudinal
data
Ongoing and future research
•
Rounding continuous to binary imputations
•
Determining number of factors
•
Robustness of methods to normality assumption
•
Can the parameters in LFA be estimated by EM or
related methods?
•
Comparisons with IVEWare and related methods, hot
deck approaches
Goal
To develop general-purpose
multiple imputation procedures
appropriate for high-dimensional
data sets
• Cross-sectional
• Longitudinal
Simulation missing data mechanisms
M1 (MAR): First 99 variables MCAR, missingness
on last variable according to logistic regression on
other 99 with normally distributed coefficients
M2 (MAR): First 99 variables MCAR, missingness
on last variable according to logistic regression on
other variables included in same factor with halfnormal distributed coefficients
M3 (nonignorable but “close” to MAR):
Missingness on each variable depends on two
other variables in overlapping manner
Simulation results: simple
regression coefficient
Factor model: coverages 93% - 98% when
model correct or overparameterized, 19% 80% when model underparameterized
MVN model: Frequently fails to converge
with non-informative prior, coverages 91%
- 99% with ridge prior
Available-case analysis: coverages range
from 44% - 100%
Equivalence of two factor analysis models
One can write:
 1   1  1
  f i        2 (  2 f i )
 2    2 
 1   1 21 
 
 ( 2 fi )
 2    k 
 1   1*  *
      fi
 2    k 
 1  1*  2   1*  *

    ( fi  2 )
0

  k 
 1*   1*  **
      fi
 0   k 
Incorporating multivariate linear
mixed-effect model for factor scores
• Rearrange fi in a matrix form
Then i f can be modeled by
 f i1T
 T
f
fi   i 2


 fT
 it







fi  X ia  Zi i   i
txk txm mxk txq qxk
txk
We assume that the t rows of  i are iid
V
and ( i ) ~ N (0, ) . Thus
 fi1 
 
 fi 2  ~ N (( X a )V ,( Z  I )( Z T  I )  I   ))
i
i
k
i
k
t

 
 
 fit 
N (0,  )
Modified LFA with covariates
• Combining the LFA with the linear mixedeffect model, we obtain
  Yi11     b01     b11 
       0
  Yi12     0     I 
Y   b  
b 
  i 21     02    0  12 
Yi    Yi 22      0    
 I 

 
 

 
 
  Y     b 0t   
  it1        0
0
 Y 
  it 2     0   
  ei1  

0 
 
  ui1  

 e 

i2

0 
 
RV
RV
 [( X ia  Zi i )   i ]    ui 2  






e 
 b1t  
it



 




 I 
  uit  
Linear growth curve model estimates: Available-case
analysis, MI using NORM, MI using LFA
Analysis Method
Available Case
Analysis
Multiple Imputation
Using NORM
Multiple Imputation
Using LFA
Estimate
Posterior
Mean
95% CI
Posterior
Mean
95% CI
Posterior
Mean
95% CI
Beta00
28.55
(26.24,
30.92)
29.30
(26.35,
32.33)
28.90
(26.45,
31.20)
Beta01
-0.29
(-4.67,
4.05)
-4.24
(-7.18,
-1.44)*
-3.93
(-5.72,
-1.95)*
Beta10
7.07
(4.78,
9.24)*
6.15
(1.90,
9.79)*
6.57
(2.24,
9.34)*
Beta11
1.86
(-2.42,
5.96)
2.72
(0.20,
5.38)*
2.69
(0.92,
5.02)*
*p<0.05.
Download