Bayesian Methods to Handle Missing Data in High-Dimensional Data Sets using Factor Analysis Strategies Thomas R. Belin UCLA Department of Biostatistics Juwon Song Univ. of Texas-M.D. Anderson Cancer Center Jianming Wang Medtronic Inc. Multiple imputation Rationale : Useful framework for representing uncertainty due to missingness Requires imputations to be “proper” Advice : include available information to the fullest extent possible (Rubin 1996 JASA) - avoid bias in the imputation - make assumption of “ignorable” missing data more plausible Alternative modeling strategies Address inestimable or unstable parameters by : • deleting variables • using proper prior distribution - ridge prior for multivariate normal (MVN) model (Schafer 1997 text) • restrictions on covariance matrix (common factors in MVN model) Introduction General Problem: Incomplete highdimensional longitudinal data • A large number of variables • A modest number of cases • With missing values • Initially consider cross-sectional data, then consider longitudinal structure Overparameterization concerns With modest sample size and large number of variables, even a simple model can be overparameterized Example : 50 variables ⇒ 50×49/2=1225 correlation parameters in multivariate normal model with general covariance matrix Analysis often proceeds based on arbitrary choice of variables to include or exclude Factor model for incomplete multivariate normal data Idea : ignore factors corresponding to small eigenvalues Notation: Y : n×p data matrix with missing items Z : n×k unobserved factor-score matrix, where k ≤ p (Yi Zi): iid (p+k)-variate normal distribution Zi ∼N(0, Ik), i.e., assuming orthogonal factors 1 Factor model for incomplete multivariate normal data (cont’d) Model: Yi = α + Zi β + εi , for i=1, 2, ... , n, where α is 1×p mean vector, β is k×p factor-loading matrix, and εi ∼ N( 0, τ2 ), where τ2 = diag(τ12,τ22,…, τp2) Model fitting Gibbs sampling : based on assumed factor structure (i.e., k known), draw: (a) mean vector (b) factor loadings (c) uniqueness (d) factor scores (e) missing items Simulation evaluations Details of model fitting • Can use weakly informative prior for uniqueness terms τj2 to avoid degenerate variance estimates • Can use either noninformative or weakly informative priors for means and factor loadings • Used transformations to speed convergence • Multiple modes possible (Rubin and Thayer 1982, 1983 Psychometrika), so simulate multiple chains • Monitor convergence (Gelman and Rubin 1992 Statistical Science) Simulation factor structure Example: Each item loads on one factor ⎡0.8 ⎢0 ⎢ β = ⎢0 ⎢ ⎢0 ⎣⎢ 0 0 L 0⎤ 0 L 0 ⎥⎥ 0 L 0 0.8 L 0.8 0 L 0 0 L 0 ⎥ ⎥ 0 L 0 0 L 0 0.8 L 0.8 0 L 0 ⎥ 0 L 0 0 L 0 0 L 0 0.8 L 0.8⎦⎥ L 0.8 0 L 0 0 L 0 L 0 0.8 L 0.8 0 L 0 L 0 L 0 L 0 0 L 0 0 L 0 Evaluate bias, coverage when model is correct, overparameterized, or underparameterized n 100 500 p # true factors 100 5 10 100 5 10 # assumed factors 5, 10 5, 10 5, 10 5, 10 Simulation details Also considered hypothetical scenario where items load on two factors 200 replications for each combination of simulation conditions - error standard deviation of 1.5% for 95% coverage Percentage of missing data ranged from 5-25% for each variable Three missing-data mechanisms (MAR where available-case analysis might do well, MAR where available-case analysis not expected to do well, and non-ignorable where method appropriate under MAR might do well) 2 Simulation results: Factor model, cross-sectional mean Simulation results: Other methods, cross-sectional mean Factor model performs well when model correct or overparameterized (coverages range from 93% - 97%) Factor model coverage is below nominal level when model underparameterized (coverages range from 86% - 93%) MVN frequently fails to converge with n=100 without ridge prior MVN with ridge prior has good coverage (94% 98%), interval widths typically wider than for factor model (2-16% wider on average, depending on details such as missing data mechanism) Available-case analysis performs poorly (coverages ranging from 37% - 88%) Simulation study based on observed covariance matrix Simulation results: psychological testing example Generate multivariate normal data (200 replicates, SE = 1.5% for 95% coverage statistics) with mean and covariance fixed at published values from Harman (1967) study of 24 psychological tests on 145 school children Number of factors not known in advance Consider 4, 5, 7 factors following earlier analysis Also consider 11 factors based on cumulative variance explained exceeding 80% and desire not to underparameterize model Coverage rates: 4-factor model: 93% - 95% 5-factor model: 93% - 96% 7-factor model: 93% - 95% 11-factor model: 93% - 95% MVN model: 94% - 95% Available-case analysis: 12% - 84% Interval widths for MVN model within 5% of factor model widths, usually within 1% Application: Emergency room intervention study Factor model for emergency room intervention study Specialized emergency room intervention vs. standard emergency room treatment for 140 female adolescents after suicide attempt Twenty-seven outcomes measured at baseline, 3, 6, 12, 18 months + many baseline characteristics Most vars 5-25% missing, some 50-60% missing Main interests: - effectiveness of emergency room intervention - whether baseline psychological impairment is related to outcomes over time 135 variables, including 27 longitudinal outcomes Longitudinal outcomes: measures at different time points treated as separate variables Assume 30 factors: - explained about 80% of the variation - simulation analysis: insufficient number of factors can cause serious bias - with 27 longitudinal outcomes, general enough to allow each longitudinal variable to represent a separate factor 3 Emergency-room intervention study: evaluations, results After imputation, related longitudinal outcomes to baseline predictors using SAS PROC MIXED Compared imputation under factor model with growth-curve imputation strategy developed by Schafer (1997 PAN program) No substantial differences seen in significance tests for intervention effect Some sensitivity seen in significance of impairment effect, intervention and impairment interactions Imputation for longitudinal data PAN (Schafer, 1997): Using Multivariate Linear Mixed-effect Model (MLMM) • Appropriate for multivariate longitudinal data or clustered data • Imputation by multivariate linear mixedeffect model Y i = X i α + Z i γ i + δ i txm txp pxm txq qxm txm V Assume (γi ) ~ N(0,Φ) and (δ i )V ~ N (0, Σδδ ) Challenge with MI using PAN Potential solution to over-parameterization MI under PAN can be over-parameterized easily • Example: 15 variables collected longitudinally five times, modeled with 2 random effects in PAN • # of parameters in Φ , random effects: 15*31/2=465 • # of parameters in Σδδ , error terms: 15*16/2=120 • Total # of parameters: 585 • Parameter reduction seems sensible when number of cases is modest, e.g. 300 If those 15 variables feature sizable correlations, they could be viewed as measuring 3-5 underlying factors. Strategy: • Reduce the dimension of the problem by factor analysis • Model the estimated factor scores by a MLMM • Factor structure reflects cross-sectional correlations among variables measured at the same time; MLMM reflects longitudinal correlations Ordinary factor analysis model Ordinary factor analysis model (continued) Factor analysis model Yi = µ + Λ f i + ε i , i = 1, 2,..., n where fi ~ N(0, Σ ff ) and εi ~ N(0, Ψ) Because ΣYY = ΛΣ ff ΛT + Ψ = ΛΣ1/ff 2Σ1/ff 2ΛT + Ψ = Λ* (Λ* )T + Ψ we often assume fi ~ N (0, I ) Also assume that Λ is of full rank (Seber, 1977) Identifiability • Solution invariant under orthogonal transformation Yi = µ + ΛTT −1 fi + ε i = µ + Λ* fi* + ε i ΛT Ψ −1Λ = Γ = Diagonal • Common restrictions which is equivalent to k(k-1)/2 restrictions • Identifiable if 1 2 2 [( p − k) −( p + k)] >= 0 4 Error-in-variables factor model Generalizing factor analysis model • Standardization of factor scores presents challenge for generalizing factor analysis model to longitudinal setting • Idea: Use “error-in-variables” representation of factor model • Error-in-variables model (Fuller, 1987) ⎛β ⎞ ⎛β ⎞ Yi = ⎜ 0 ⎟ + ⎜ 1 ⎟ f i + ε i ⎝ 0 ⎠ ⎝ I ⎠ i2 and Then Error-in-variables factor model (continued) ⎛β ⎞ ⎛β ⎞ ⎝ ⎠ ⎝ ⎠ T 1 1 • Covariance matrix of Y is ΣYY = ⎜ I ⎟ Σ ff ⎜ I ⎟ +Ψ • The total # of distinct parameters is ( p − k )k + p + 1 1 k ( k + 1) = p + pk − k ( k + 1) 2 2 which is exactly the same as the ordinary model with the additional k(k-1)/2 restrictions used to avoid indeterminacy • No additional restrictions necessary ⎛Y ⎞ Interpretation: If we partition Yi into ⎜ i1 ⎟ and let ⎝ Yi 2 ⎠ Y = f + u , Y = β + β f + e, i i i1 0 1 i i ⎛e ⎞ εi = ⎜ i ⎟ , ⎝ ui ⎠ ⎛Y ⎞ ⎛β ⎞ ⎛β ⎞ ⎛e ⎞ Yi = ⎜ i1 ⎟ = ⎜ 0 ⎟ + ⎜ 1 ⎟ fi + ⎜ i ⎟ ⎝ Yi 2 ⎠ ⎝ 0 ⎠ ⎝ I ⎠ ⎝ ui ⎠ A Longitudinal Factor Analysis model • Extending Error-in-variables Model to LFA ⎛ ⎛Yi11 ⎞ ⎞ ⎛ ⎛ β01 ⎞⎞ ⎛⎛ β11 ⎞ ⎛ ⎛ ei1 ⎞ ⎞ ⎞ ⎜ ⎜ ⎟ ⎟ ⎜ ⎜ ⎟ ⎟ ⎜⎜ ⎟ 0 L 0 ⎟ ⎜⎜ ⎟⎟ ⎜ ⎝Yi12 ⎠ ⎟ ⎜ ⎝ 0 ⎠⎟ ⎜⎝ I ⎠ ⎜ ⎝ui1 ⎠ ⎟ ⎟ ⎛ Yi1 ⎞ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟⎛ fi1 ⎞ ⎜⎛ e ⎞⎟ ⎛ β12 ⎞ ⎜ ⎟ ⎜⎛Yi21 ⎞⎟ ⎜⎛ β02 ⎞⎟ ⎜ ⎟⎜ fi2 ⎟ ⎜⎜ i2 ⎟⎟ L Y 0 0 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ i 2 Yi = ⎜ ⎟ = ⎜⎝Yi22 ⎠⎟ = ⎜⎝ 0 ⎠⎟ +⎜ ⎟⎜ ⎟ +⎜⎝ui2 ⎠⎟ =Β0 +Β1 fi +εi ⎝I ⎠ ⎜M⎟ ⎜ ⎟ ⎜ ⎟ ⎟ ⎜ ⎟⎜ M ⎟ ⎜ M O M ⎟⎜ ⎟ ⎜ M ⎟ ⎜ ⎟ ⎜ M ⎟ ⎜ M ⎟ ⎜ M ⎝ Yit ⎠ ⎜ ⎝ fit ⎠ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎛β ⎞ ⎛β ⎞ ⎛Y ⎞ ⎛e ⎞⎟ ⎜ ⎜ it1 ⎟ ⎟ ⎜⎜ ⎜ 0t ⎟ ⎟⎟ ⎜⎜ 0 ⎜ ⎜ it ⎟ ⎟ 0 L ⎜ 1t ⎟⎟⎟ ⎜ Y ⎟ ⎜ u ⎟ I 0 ⎝ ⎠ ⎝ ⎠ ⎠ ⎝ ⎠ ⎝ ⎝ it 2 ⎠ ⎠ ⎝ ⎝ ⎝ it ⎠ ⎠ Aspects of LFA model • The # of factors is the same on each occasion, but the factor loadings and factor scores may change • No constraints on covariance structure of the fi • The unique-component vectors are uncorrelated with the factors both within and across occasions. • The unique-component errors are uncorrelated within occasion and across occasions Advantages of LFA model Advantages of this LFA model: • Identifiability problem can easily be handled • Preserves the mean structure and covariance structure, making the study of elevation change and pattern change simultaneously possible • Can incorporate linear mixed-effect model structure for longitudinal data • Can incorporate baseline covariates 5 Implementation Evaluations • Use data augmentation (I-step: linear regressions, P-step: analog to ML for multivariate normal with complete data) • Assume conjugate forms (normal, inverse Wishart) for prior distributions for parameters, assume relatively diffuse priors that still produce proper posteriors • Conditional distributions all in closed form ( α 6×5 reflects small to moderate covariate effects for predicting factor scores and a linear trend in factor scores) Simulation when number of factors is correctly specified The mean of Y49 , which (averaged across simulation replicates) was missing on 27% of individuals Analysis Method M.C. Average M.C. S.E. Average 95% Interval length True value 17.074 All data 17.078 0.426 1.677 Available data 18.854 0.530 5 imputations 17.072 0.567 Β1[( Z i ⊗ I k ) Φ ( Z iT ⊗ I k ) + I t ⊗ Σ δδ )]Β1T + Σ εε for i=1,2,…,350, p=15 measurements, k=5 factors at t=5 time points, Yi has dimension (15x5)x1=75x1 (Β) rc = 1/ 6 + ⎬ for r = 6 ⎭ c / 0.5 and variance ⎧4 if ⎩1 if Σεε = diagnal (σ εε2 ) = 5 Φ rc = ⎨ • X incorporates intercept, 3 continuous variables, 1 binary variable and time • Z allows for random intercepts, slopes • α = ⎧0.3 + Bern(0.5) / 2 for r ≠ 6⎫ ⎨ ⎩ Β0 + Β1 ( X iα ) RV Simulation design (continued) Simulation design rc We generated 100 data sets with Yi from a MVN with mean Actual 95% Coverage r+c 500 ⎧20 if r = c ⎫ r = c⎫ ⎬ (Σδδ ) rc = ⎨⎩ 5 if r ≠ c ⎭⎬ r ≠ c⎭ (to avoid singular factor loading matrix) • Missingness introduced using MAR mechanism (a series of binary draws with probabilities depending on observed values) • ( Σεε , Σδδ and Φ incorporate relative variances, covariance describing unique variance, common variance among factor scores, and variance of random effects • Simulation SE 95% of coverage statistics with 100 replicates=0.0218, margin of error=0.0427 Simulation when number of factors is correctly specified The mean of Y66 , a variable which is missing 100% of the time (i.e. a variable not measured at a given time point) Analysis Method M.C. Average M.C. S.E. Average 95% Interval length Actual 95% Coverage True value 20.8195 98% All data 20.7955 0.5128 2.0170 94% 2.091 7% Available data -- -- -- -- 2.231 96% 5 imputations 20.7678 0.6503 2.5554 95% 6 Simulation when number of factors is incorrectly specified The mean of Y49 (average missingness rate=27%) Analysis Method M.C. Average True value 17.074 All data 17.078 Available data 18.854 M.C. S.E. Average 95% Interval length Actual 95% Coverage 0.4263 1.677 98% 0.5304 2.091 Simulation when number of factors is incorrectly specified The mean of Y66 , which has a 100% missingness rate Analysis Method M.C. Average M. C. S. E. Average 95% Interval length Actual 95% Coverage True value 20.8195 All data 20.7955 0.5128 2.0170 94% Available data -- -- -- -- 7% F=5 (true number) 17.072 0.5672 2.231 96% F=6 17.055 0.4873 1.9153 94% F=5(true number) 20.7678 0.6503 2.5554 95% 20.9565 0.7161 2.8142 94% F=4 17.612 0.5962 2.3429 89% F=6 F=3 17.663 0.6213 2.4410 86% F=4 20.6473 1.1139 4.3780 91% F=3 20.4091 1.2484 4.9060 83% Example using LFA: oral surgery study Randomized study of two oral surgery treatments (MMF, RIF) with longitudinal follow-up of quality-of-life (GOHAI) and psychological outcomes Hierarchical growth-curve model using WINBUGS: Yij = β 0i + β1i (tij − t ) + ε ij , β 0i = β 00 + β 01Si + δ 0i , β1i = β10 + β11Si + δ1i δ 0i ~ N (0, σ 02 ) δ1i ~ N (0, σ 12 ) Si = 1 , if RIF Si = 0 , if MMF Summary and future research Summary • Factor-analysis methods provide flexible framework for addressing incomplete high-dimensional longitudinal data Ongoing and future research • Rounding continuous to binary imputations • Determining number of factors • Robustness of methods to normality assumption • Can the parameters in LFA be estimated by EM or related methods? • Comparisons with IVEWare and related methods, hot deck approaches Findings of interest • Difference in average intercept, average slope between RIF and MMF ( β 01 , β11 ) significant under MI (NORM or LFA) analysis, not under availablecase analysis • Different interpretations emerge from MI analysis (RIF starts lower, ends with comparable values) • Compared to MI using NORM, MI using LFA has 17%-34% narrower interval estimates for parameters Goal To develop general-purpose multiple imputation procedures appropriate for high-dimensional data sets • Cross-sectional • Longitudinal 7 Simulation results: simple regression coefficient Simulation missing data mechanisms M1 (MAR): First 99 variables MCAR, missingness on last variable according to logistic regression on other 99 with normally distributed coefficients M2 (MAR): First 99 variables MCAR, missingness on last variable according to logistic regression on other variables included in same factor with halfnormal distributed coefficients M3 (nonignorable but “close” to MAR): Missingness on each variable depends on two other variables in overlapping manner Factor model: coverages 93% - 98% when model correct or overparameterized, 19% 80% when model underparameterized MVN model: Frequently fails to converge with non-informative prior, coverages 91% - 99% with ridge prior Available-case analysis: coverages range from 44% - 100% Equivalence of two factor analysis models Incorporating multivariate linear mixed-effect model for factor scores One can write: ⎛µ ⎞ ⎛Λ ⎞ µ + Λfi = ⎜ 1 ⎟ + ⎜ 1 ⎟ Λ −21 (Λ 2 fi ) ⎝ µ2 ⎠ ⎝ Λ 2 ⎠ ⎛ µ1 ⎞ ⎛ Λ1Λ −21 ⎞ =⎜ ⎟+⎜ ⎟ (Λ 2 fi ) ⎝ µ2 ⎠ ⎝ Ι k ⎠ * ⎛ µ ⎞ ⎛Λ ⎞ = ⎜ 1 ⎟ + ⎜ 1 ⎟ fi * ⎝ µ2 ⎠ ⎝ Ι k ⎠ ⎛ µ1 − Λ1* µ2 ⎞ ⎛ Λ1* ⎞ * =⎜ ⎟ + ⎜ ⎟ ( fi + µ2 ) 0 ⎝ ⎠ ⎝ Ιk ⎠ * * ⎛ µ1 ⎞ ⎛ Λ1 ⎞ ** = ⎜ ⎟ + ⎜ ⎟ fi ⎝ 0 ⎠ ⎝ Ιk ⎠ Modified LFA with covariates • Rearrange fi in a matrix form Then can be modeled by ⎛ ⎛ ei1 ⎞ ⎞ ⎞ 0 ⎟ ⎜⎜ ⎟⎟ ⎜ ⎝ui1 ⎠ ⎟ ⎟ ⎜⎛ e ⎞ ⎟ ⎟ ⎜⎜ i 2 ⎟ ⎟ L 0 ⎟ RV RV ⎟[(Xiα + Ziγi ) +δi ] +⎜⎝ui2 ⎠⎟ ⎜ ⎟ ⎟ O M ⎟ ⎜ M ⎟ ⎜ ⎛e ⎞ ⎟ ⎛ β1t ⎞⎟ it ⎜⎜ ⎟⎟ L ⎜ ⎟⎟⎟ ⎜ u ⎟ ⎝ I ⎠⎠ ⎝ ⎝ it ⎠ ⎠ L f i 1T f i T2 M f itT ⎞ ⎟ ⎟ ⎟ ⎟⎟ ⎠ f%i = X i α + Z i γ i + δ i txk txm mxk txq qxk txk We assume that the t rows of δ i are iid V and (γ i ) ~ N (0, Φ) . Thus N (0, Σδδ ) ⎛ fi1 ⎞ ⎜ ⎟ ⎜ fi 2 ⎟ ~ N (( X α )V ,(Z ⊗ I )Φ(Z T ⊗ I ) + I ⊗Σ )) δδ i i k i k t ⎜ M ⎟ ⎜ ⎟ f ⎝ it ⎠ Linear growth curve model estimates: Available-case analysis, MI using NORM, MI using LFA • Combining the LFA with the linear mixedeffect model, we obtain ⎛⎛Yi11 ⎞⎞ ⎛⎛ β01 ⎞⎞ ⎛⎛ β11 ⎞ ⎜ ⎜ ⎟ ⎟ ⎜ ⎜ ⎟ ⎟ ⎜⎜ ⎟ 0 ⎜⎝Yi12 ⎠⎟ ⎜⎝ 0 ⎠⎟ ⎜⎝ I ⎠ ⎜ ⎛Y ⎞ ⎟ ⎜⎛ β ⎞ ⎟ ⎜ ⎛β ⎞ ⎜⎜ i21 ⎟⎟ ⎜ 02 ⎟ ⎜ 0 ⎜ 12 ⎟ Yi = ⎜⎝Yi22 ⎠⎟ = ⎜⎜⎝ 0 ⎟⎠⎟ +⎜ ⎝I ⎠ ⎜ ⎟ ⎜ ⎟ ⎜ M ⎜ M ⎟ ⎜ M ⎟ ⎜ M ⎜ ⎛Y ⎞ ⎟ ⎜⎛ β0t ⎞ ⎟ ⎜ ⎜ ⎜ it1 ⎟ ⎟ ⎜⎜⎜ ⎟ ⎟⎟ ⎜⎜ 0 0 ⎜ Y ⎟ ⎝0⎠ ⎠ ⎝ ⎝ ⎝ it 2 ⎠ ⎠ ⎝ ⎛ ⎜ f%i = ⎜ ⎜ ⎜⎜ ⎝ Analysis Method Available Case Analysis Multiple Imputation Using NORM Multiple Imputation Using LFA Estimate Posterior Mean 95% CI Posterior Mean 95% CI Posterior Mean 95% CI Beta00 28.55 (26.24, 30.92) 29.30 (26.35, 32.33) 28.90 (26.45, 31.20) Beta01 -0.29 (-4.67, 4.05) -4.24 (-7.18, -1.44)* -3.93 (-5.72, -1.95)* Beta10 7.07 (4.78, 9.24)* 6.15 (1.90, 9.79)* 6.57 (2.24, 9.34)* Beta11 1.86 (-2.42, 5.96) 2.72 (0.20, 5.38)* 2.69 (0.92, 5.02)* *p<0.05. 8