Missing Data in Research Studies Joseph A. Olsen What do I do about missing data? Introduction • What is certain in life? – Death – Taxes • What is certain in research? – Measurement error – Missing data • Missing data can be: – Due to preventable errors, mistakes, or lack of foresight by the researcher – Due to problems outside the control of the researcher – Deliberate, intended, or planned by the researcher to reduce cost or respondent burden – Due to differential applicability of some items to subsets of respondents – Etc. Some Characteristics of Missing Data Facets of missing data Persons Variables Occasions Type of non-response Unit non-response Block non-response Wave non-response Item non-response Special non-response problems in longitudinal and clustered data Attrition/drop-out Group (e. g. family) member non-response Missing Data Mechanisms (1) Preliminaries: Yobs: The non-missing or observed data Ymiss: The missing or unobserved data M: Whether the data on a given item for a given case is missing (1) or not (0) Missing Completely at Random (MCAR) The probability that an item is missing (M) is unrelated to either the observed (Yobs) or the unobserved (Ymiss) data Missing at Random (MAR) The probability that an item is missing (M) may be related to the observed data (Yobs) but is unrelated to the unobserved data (Ymiss) Missing Not at Random (MNAR) The probability that an item is missing (M) is related to the (unknown) value of the unobserved data (Ymiss), even after conditioning on the observed data (Yobs) Missing Data Mechanisms (2) The appropriateness of different missing data treatments depends (among other things) on the underlying missing data mechanism “Real” missing data can seldom be classified into just one of the three (MCAR, MAR, MNAR) Because we don’t have access to the missing data (Ymiss), we can not empirically test whether or not the data is MNAR If we know (or can convincingly argue) that the data is not MNAR, a test of whether the data is MCAR is available (e. g. in SPSS Missing Values Analysis). Missing Data in Research Studies Missing data mechanism Missing completely at random (MCAR)—Ignorable Missing at random (MAR)—Conditionally ignorable Missing not at random (MNAR)—Nonignorable Amount of missing data Percent of cases with missing data Percent of variables having missing data Percent of data values that are missing Pattern of missing data Missing by design Missing data patterns Univariate Monotonic File matching General Goals of a Missing Data Treatment Preserve the essential characteristics of the data Distributions of the variables Relationships among the variables Maintain the representativeness of the analyzed data Provide valid statistical inference (control Type I error) Maximize the statistical power of the study and its statistical analyses (minimize Type II error) Avoid bias and instability in the parameter estimates and standard errors for statistical models Older Missing Data Treatments (1) Deletion methods Listwise deletion (complete case analysis) Pairwise deletion (available case analysis) Cold deck imputation Deterministic, logical, or rule-based imputation Treat missing data for nominal predictors as an additional category Hot deck (donor case) imputation Cluster based methods Distance based (e. g. nearest neighbor) methods Mean substitution (Variable) mean substitution Mean substitution with added random error Predictor mean substitution with missing data dichotomy Older Missing Data Treatments (2) Regression imputation Regression predicted value imputation Regression imputation with added random error Special methods for longitudinal studies and randomized controlled trials Endpoint only analysis Last observation carried forward (LOCF) Intent to treat worst (best) case imputation Summary growth parameters Special methods for multi-item scales Available item method of scale construction Person mean imputation Two-way imputation Two-way imputation with added random error Newer Missing Data Treatments • Modern state-of-the-art missing data treatments for MAR data – Maximum likelihood – Multiple imputation • Cutting edge investigational missing data treatments for MNAR data – – – – Pattern mixture models Selection models Shared parameter models Inverse probability weighting Statistical Analysis with Missing Data What do you get when you don’t specify what you want? What choices do you have within a given analysis procedure? Often, listwise deletion is the default (and only) option (SPSS Reliability and GLM) Listwise default with pairwise and mean substitution as options (SPSS Factor and Regression Analysis) Pairwise default with listwise option (SPSS Correlation) Modeling approaches that incorporate missing data handling Survival models Mixed effects models Structural equation models Missing data treatments carried out prior to analysis Ad hoc methods (Listwise, pairwise, single imputation, etc.) Modern methods(Maximum Likelihood, Multiple Imputation) Modern Missing Data Treatments Maximum likelihood (ML) Estimates summary statistics or statistical models using all available data Available in modern structural equation modeling software (Amos, EQS, Lisrel, Mplus, Mx, etc.) The ML covariance matrix and mean vector can also be obtained from SPSS MVA, and used for standard Regression, Factor analysis, Reliability, and other procedures There are also freeware and open source programs that can produce the ML covariance matrix and mean vector, usually by using the Expectation Maximization (EM) algorithm (e.g. EMCOV) Multiple imputation Imputes individual data values in multiple complete datasets, averaging the results of the statistical analyses across these datasets Available in the current versions of certain SEM software (Amos, Mplus). Also available in SPSS (MVA), SAS (Proc MI and MIANALYZE), Stata (mi impute and mi estimate), and stand-alone missing data packages such as SOLAS Why do social scientists use modern missing data treatments so infrequently? Lack of awareness or familiarity They are not convinced of the problems with older methods The statistical literature on missing data is technically daunting The techniques aren’t incorporated into the standard statistical analysis procedures used by social scientists Journal reviewers and editors have not required it