Analysis of incomplete data regression models in health services studies Nicholas J. Horton Department of Mathematics and Statistics Smith College nhorton@email.smith.edu http://www.math.smith.edu/∼nhorton June 25th, 2006 Acknowledgements Joint work with Ken Kleinman, Department of Ambulatory Care Policy, Harvard Medical School funding from NIH MH54693 Plan for talk • Introduction and motivation • Example dataset • Missing data nomenclature • Missing data methods • Application • Concluding thoughts Motivation • missing data a common problem • may be due to design or happenstance • ignoring missing data may lead to inefficiency • ignoring missing data may lead to bias Motivation (cont.) • particularly salient for health services research – more opportunity for missingness in larger studies – administrative datasets may not have complete coverage – some items may be intentionally censored • software to fit incomplete data regression models is improving (but not yet entirely there!) Example Dataset • Kids’ Inpatient Database (KID) • developed by Heathcare Cost and Utilization Project (HCUP) • sponsored by Agency for Healthcare Research and Quality (AHRQ) • Year 2000 dataset contains data from 27 State Inpatient Databases Inferential Goal of Analysis What factors predict • whether 10-20 year old subjects • with a primary, secondary or tertiary diagnosis of mental health or substance abuse issues • are discharged from a hospitalization in a routine fashion (e.g. not AMA, transferred to another facility, or died) Predictors with Complete Data • AGE (in years) • LOS (length of stay, in days) • NDX (number of medical diagnoses) • WEEKEND (=1 if admitted on a weekend) • FEMALE (=1 if female) Predictors with Missing Data • RACE (1=Caucasian, 2=Black, 3=Hispanic, 4=Other) • SEASON (Winter, Spring, Summer, Fall) • ATYPE (Admission type: 1=emergency, 2=urgent, 3=elective, 4=other) • TOTCHG (Total charges, in dollars) • reasons for missingness? Missing Data Patterns (Splus missing data library) 10 variables, 135344 observations, 12 patterns 4 vars. (40%) have at least one missing value 55770 obs. (41%) have at least one missing value Breakdown by variable V O name Missing % missing 1 8 TOTCHG 5021 4 2 2 ATYPE 15093 11 3 10 SEASON 15616 12 4 7 RACE 21888 16 Missing Data Patterns (cont.) 1234 .... ...m ..m. .m.. m... ..mm .m.m count 79574 21335 15354 13601 3665 213 234 11 mm.. 1213 1 2 3 4 5 6 7 <<<<<<- complete cases missing RACE missing SEASON missing ATYPE missing TOTCHG missing SEASON + RACE Missing Data Nomenclature: monotonicity • hierarchy exists such that completeness in one variable determines completeness of another • monotone patterns simplify analysis 1 2 3 4 5 1234 .... ...m ..mm .mmm mmmm Missing Data Nomenclature: monotonicity • KID dataset is decidedly non-monotone • To create a monotone pattern would require arbitrarily dropping some observations 1234 count 1 .... 79574 2 ...m 21335 6 ..mm 213 <- complete cases <- missing RACE <- missing SEASON + RACE Notation • Y outcome of regression model • X predictor in regression model (typically a vector, X1, X2, . . . , Xp) • f (Y |X, β) regression model of interest Missing data nomenclature: mechanisms • Introduced by Little and Rubin (text, 1987, 2002) • Let R = 1 denote whether a particular variable (say Y2) is missing in a longitudinal study • What assumptions are we willing to make regarding the missingness law: f (R|Y1, Y2, X, γ)? Missing data nomenclature: MCAR (Missing Completely at Random) • f (R|Y1, Y2, X) = f (R) • Missingness does not depend on observed or unobserved quantities • Example: data fell from the truck Missing data nomenclature: MAR (Missing at Random) • f (R|Y1, Y2, X) = f (R|Y1, X) • Missingness does not depend on unobserved quantities • Example: doctor took a subject off a longitudinal trial because they were too sick (based on observed Y1) Missing data nomenclature: NINR (Nonignorable nonresponse) • f (R|Y1, Y2, X) = f (R|Y1, Y2, X) (no simplification) • Missingness depends on unobserved quantities • Example: subject missed their observation Y2 because they were too sick to get out of bed Missing data nomenclature • Little and Rubin showed that if MAR missingness, then likelihood based approaches can ignore missing data mechanism and still yield the right answer • MAR impossible to verify without auxiliary information • NINR models require a lot of work modeling missingness, best used for sensitivity analyses (Partial) Taxonomy of methods • Complete case • Multiple imputation methods • Maximum likelihood methods • Excellent review by Ibrahim and colleagues (JASA 2005) Complete case approach • Simple • Main drawback: inefficient • Use only 59% of the KID dataset! • May yield bias Multiple imputation • ‘fill-in’ the missing values with some ‘appropriate’ value to give a completed dataset • repeat this process multiple times • combine results from each of these multiple imputations • requires a model to ‘fill-in’ the values • Originally due to Rubin (1978) Specifying imputation model • Most complicated task (since running the separate analyses is fast and cheap) • Simpler when the predictors and outcome are plausibly multivariate normal • Harder with categorical missing values • Even harder if non-monotone Specifying imputation model for dichotomous variable (cont.) • Use a normal model (but how to include in analysis?) • Use a normal model and round (leads to bias, Horton et al 2003) • Use a discriminant model (slightly harder but feasible) Models for dichotomous imputation Correctly specifying the imputation model for dichotomous variables is feasible (Rubin, 1987, p.169) • estimate probability that Yi = 1 • for each imputation, generate uniform (0,1) RV • set Yi = 1 if the uniform random variable is less than the estimated prob. (and 0 o.w.) Specifying imputation model (cont.) • What if there are multiple categorical variables with missing values? • Straightforward if monotone pattern (SAS PROC MI) • Use of MICE in R or Stata (Multiple Imputation using Chained Equation, van Buuren et al 1999, Royston 2005): impute one value, use that to impute the next, and repeat Maximum likelihood • Typically we are interested in f (Y |X, β) where the covariates are assumed fixed • To gain information from partially observed subjects, posit a distribution for f (X|α) • Maximize likelihood of f (Y, X|β, α), typically through use of the EM (Expectation-Maximization) algorithm • Originally due to Ibrahim (1990) Maximum likelihood (via EM) Alternate: • calculating the Expected value of the missing observations • Maximizing the complete data log likelihood given those values • formalized by Dempster, Laird and Rubin (1977) Maximum likelihood implementations in LogXact • Supports up to 10 categorical covariates with missing values, allows covariates to take up to 5 values (e.g. 0,1,2,3,4) • Uses a simplifying approach due to Lipsitz and Ibrahim (1996) to partition the joint distribution of the missing values Maximum likelihood implementations in LogXact (cont.) f (X1, X2, X3, X4) = f (X1)f (X2|X1)f (X3|X1, X2)f (X4|X1, X2, X3) • Support for continuous missing values feasible in future versions (but requires use of MCEM and further modeling assumptions) Maximum likelihood implementations in S-Plus • Supports multivariate normal for continuous random variables conditional on categorical ones (conditional Gaussian) (Schafer, 1997) • Requires full specification of a log-linear model for the categorical random variables Results for KID (descriptive statistics) variable ROUTINE WEEKEND FEMALE WHITE percentage 86% 20% 54% 57% Results for KID (descriptive statistics) variable mean (SD) AGE 16.3 (2.7) LOS 6.4 (12.7) TOTCHG $9,230 ($17,371) NDX 3.5 (2.0) Results for KID (complete case model) parameter AGE WEEKEND FEMALE LOS TOTCHG NDX OR 0.96 0.94 1.09 0.997 0.999 0.90 p-value <0.001 0.025 <0.001 <0.001 <0.001 <0.001 Results for KID (CC, cont.) parameter SEASON RACE ATYPE df 3 3 3 p-value 0.006 <0.001 <0.001 SEASON: winter and fall least likely non-routine RACE: white more likely routine ATYPE: non-emergency most likely routine Results for KID (comparisons) How does accounting for missingness (in this case using SAS PROC MI and LogXact) affect our estimates (log OR)? param CC PROC MI LogXact AGE est (se) -0.040 (0.0040) -0.036 (0.0032) -0.039 (0.0031) FEMALE est (se) 0.088 (0.0210) 0.118 (0.0170) 0.106 (0.0161) Discussion • Complete case estimator simple, but may be inefficient and biased (particularly when missingness depends on Y ) • Missing data methods are available, require imposition of assumptions (MAR) and additional effort, but yield efficiency gains (of approximately 25% in our example) Discussion (cont.) • a variety of models have been proposed in the statistical literature, many of these make simplifying assumptions or have been coded specifically for a given situation • general methods to handle missingness in this setting remains difficult (requires compromises) Future work • further work is needed to assess sensitivity to assumptions and areas where these methods have greatest potential • use of NINR models in this setting • accounting for clustering and survey design (straightforward in Stata) References Dempster AP et al (1977) Maximum likelihood from incomplete data via the EM algorithm, JRSS-B, 39:1-22. Horton NJ and Laird NM (1999) Maximum likelihood analysis of generalized linear models with missing covariates, SMIMR, 8:37-50. Horton NJ and Lipsitz SR (2001) Multiple imputation in practice: comparison of software packages for regression models with missing variables, TAS, 55:244-254. Ibrahim JG (1990) Incomplete data in generalized linear models, JASA, 85:765-769. Ibrahim JG et al (2005) Missing-data methods for generalized linear models: a comparative review, JASA, 100:332-346. Lipsitz SR and Ibrahim JG (1996) A conditional model for incomplete covariates in parametric regression, Biometrika, 83:916-922. Little RJA (1992) Regression with missing X’s: a review, JASA, 87:12271237. Little RJA and Rubin DB (2002) Statistical analysis with missing data, 2nd edition, Wiley. Royston P (2005) Multiple imputation of missing values, Stata Technical Journal, 5(4):527-536. Rubin DB (1987) Multiple imputation for nonresponse in surveys, Wiley. Rubin DB (1996) Multiple imputation after 18+ years, JASA, 91:473-489. Schafer, JL (1997) Analysis of incomplete multivariate data, Chapman and Hall. van Buuren S et al (1999) Multiple imputation of missing blood pressure covariates in survival analysis, Statistics in Medicine, 18:681-694. Analysis of incomplete data regression models in health services studies Nicholas J. Horton Department of Mathematics and Statistics Smith College nhorton@email.smith.edu http://www.math.smith.edu/∼nhorton June 25th, 2006