Research and Missing Data

advertisement
Missing Data in Research
Studies
Joseph A. Olsen
What do I do about missing data?
Introduction
• What is certain in life?
– Death
– Taxes
• What is certain in research?
– Measurement error
– Missing data
• Missing data can be:
– Due to preventable errors, mistakes, or lack of foresight by the
researcher
– Due to problems outside the control of the researcher
– Deliberate, intended, or planned by the researcher to reduce cost or
respondent burden
– Due to differential applicability of some items to subsets of
respondents
– Etc.
Some Characteristics of Missing Data
 Facets of missing data
 Persons
 Variables
 Occasions
 Type of non-response
 Unit non-response
 Block non-response
 Wave non-response
 Item non-response
 Special non-response problems in longitudinal and
clustered data
 Attrition/drop-out
 Group (e. g. family) member non-response
Missing Data Mechanisms (1)
 Preliminaries:
 Yobs: The non-missing or observed data
 Ymiss: The missing or unobserved data
 M: Whether the data on a given item for a given case is missing (1)
or not (0)
 Missing Completely at Random (MCAR)
 The probability that an item is missing (M) is unrelated to either the
observed (Yobs) or the unobserved (Ymiss) data
 Missing at Random (MAR)
 The probability that an item is missing (M) may be related to the
observed data (Yobs) but is unrelated to the unobserved data (Ymiss)
 Missing Not at Random (MNAR)
 The probability that an item is missing (M) is related to the
(unknown) value of the unobserved data (Ymiss), even after
conditioning on the observed data (Yobs)
Missing Data Mechanisms (2)
 The appropriateness of different missing data
treatments depends (among other things) on the
underlying missing data mechanism
 “Real” missing data can seldom be classified into
just one of the three (MCAR, MAR, MNAR)
 Because we don’t have access to the missing data
(Ymiss), we can not empirically test whether or not
the data is MNAR
 If we know (or can convincingly argue) that the
data is not MNAR, a test of whether the data is
MCAR is available (e. g. in SPSS Missing Values
Analysis).
Missing Data in Research Studies
 Missing data mechanism
 Missing completely at random (MCAR)—Ignorable
 Missing at random (MAR)—Conditionally ignorable
 Missing not at random (MNAR)—Nonignorable
 Amount of missing data
 Percent of cases with missing data
 Percent of variables having missing data
 Percent of data values that are missing
 Pattern of missing data
 Missing by design
 Missing data patterns




Univariate
Monotonic
File matching
General
Goals of a Missing Data Treatment
 Preserve the essential characteristics of the data
Distributions of the variables
Relationships among the variables
 Maintain the representativeness of the analyzed
data
 Provide valid statistical inference (control Type I
error)
 Maximize the statistical power of the study and its
statistical analyses (minimize Type II error)
 Avoid bias and instability in the parameter
estimates and standard errors for statistical
models
Older Missing Data Treatments (1)
 Deletion methods
 Listwise deletion (complete case analysis)
 Pairwise deletion (available case analysis)
 Cold deck imputation
 Deterministic, logical, or rule-based imputation
 Treat missing data for nominal predictors as an additional category
 Hot deck (donor case) imputation
 Cluster based methods
 Distance based (e. g. nearest neighbor) methods
 Mean substitution
 (Variable) mean substitution
 Mean substitution with added random error
 Predictor mean substitution with missing data dichotomy
Older Missing Data Treatments (2)
 Regression imputation
 Regression predicted value imputation
 Regression imputation with added random error
 Special methods for longitudinal studies and randomized
controlled trials
 Endpoint only analysis
 Last observation carried forward (LOCF)
 Intent to treat worst (best) case imputation
 Summary growth parameters
 Special methods for multi-item scales
 Available item method of scale construction
 Person mean imputation
 Two-way imputation
 Two-way imputation with added random error
Newer Missing Data Treatments
• Modern state-of-the-art missing data
treatments for MAR data
– Maximum likelihood
– Multiple imputation
• Cutting edge investigational missing data
treatments for MNAR data
–
–
–
–
Pattern mixture models
Selection models
Shared parameter models
Inverse probability weighting
Statistical Analysis with Missing
Data
 What do you get when you don’t specify what you want? What
choices do you have within a given analysis procedure?
 Often, listwise deletion is the default (and only) option (SPSS
Reliability and GLM)
 Listwise default with pairwise and mean substitution as options
(SPSS Factor and Regression Analysis)
 Pairwise default with listwise option (SPSS Correlation)
 Modeling approaches that incorporate missing data handling
 Survival models
 Mixed effects models
 Structural equation models
 Missing data treatments carried out prior to analysis
 Ad hoc methods (Listwise, pairwise, single imputation, etc.)
 Modern methods(Maximum Likelihood, Multiple Imputation)
Modern Missing Data Treatments
 Maximum likelihood (ML)
 Estimates summary statistics or statistical models using all available data
 Available in modern structural equation modeling software (Amos, EQS,
Lisrel, Mplus, Mx, etc.)
 The ML covariance matrix and mean vector can also be obtained from
SPSS MVA, and used for standard Regression, Factor analysis, Reliability,
and other procedures
 There are also freeware and open source programs that can produce the
ML covariance matrix and mean vector, usually by using the Expectation
Maximization (EM) algorithm (e.g. EMCOV)
 Multiple imputation
 Imputes individual data values in multiple complete datasets, averaging the
results of the statistical analyses across these datasets
 Available in the current versions of certain SEM software (Amos, Mplus).
 Also available in SPSS (MVA), SAS (Proc MI and MIANALYZE), Stata (mi
impute and mi estimate), and stand-alone missing data packages such as
SOLAS
Why do social scientists use modern
missing data treatments so infrequently?
 Lack of awareness or familiarity
 They are not convinced of the problems with older
methods
 The statistical literature on missing data is technically
daunting
 The techniques aren’t incorporated into the standard
statistical analysis procedures used by social
scientists
 Journal reviewers and editors have not required it
Download