(Article Plus material for the JAACAP web site)

advertisement
Croy C, Novins D. Methods for Addressing Missing Data in Psychiatric and
Developmental Research. J Am Acad Child Adolesc Psychiatry
RESOURCE APPENDIX
Reference Materials – General
For an introduction to methods for handling missing data, we recommend the book
Missing Data by Paul Allison (2002). For a more detailed treatment see “Missing Data:
Our View of the State of the Art” by Schafer and Graham (2002). Readers wanting a
comprehensive review should consult Statistical Analysis with Missing Data (2nd ed.) by
Little and Rubin (2002).
Software
Calculation of Means and Covariance Matrix Using EM
S-Plus 6.0 http://www.insightful.com/products/splus/default.asp
SPSS For Windows 11.5 and higher, MVA Package
http://www.spss.com/missing_value/
SAS/STAT For Windows 8.2 and higher, Proc MI
http://support.sas.com/rnd/app/da/new/dami.html
EM-based Imputation and Conditional Mean Substitution
SPSS For Windows 11.5 and higher, MVA Package
http://www.spss.com/missing_value/
SAS/STAT 9.1 and higher, Proc MI
http://support.sas.com/91doc/docMainpage.jsp
(click refresh button on browser if contents of navigation window doesn’t appear)
1
Estimation of Linear Models Using Direct (Full Information) Maximum Likelihood
SAS (stand alone commercial software)
www.sas.com/technologies/analytics/statistics/stat/index.html
(look for SAS/STAT product, Proc Mixed Procedure)
SPSS (stand alone commercial software)
www.spss.com/advanced_models/brochures.htm
(download PDF spec sheet, look for Linear Mixed Models Procedure)
Amos (stand alone or module for SPSS) www.spss.com/amos/index.htm
Mplus (stand alone commercial software) www.statmodel.com
LISREL (stand alone commercial software)
http://www.ssicentral.com/lisrel/mainlis.htm
Mx (free stand-alone software for matrix algebra and numerical optimization)
http://www.vcu.edu/mx/
Multiple Imputation
Two Internet sites provide a lot of easy-to-digest explanations, citations for further
reading, some source text available via links, and free software:
2
http://www.stat.psu.edu/~jls/misoftwa.html
(the link for Frequently Asked Questions is especially helpful)
http://www.multiple-imputation.com/
(the link for Literature is especially helpful)
Horton and Lipsitz (2001) provide a detailed review of the following software for multiple
imputation:
SOLAS 3.0 www.statsol.ie/solas/solas.htm
(note that the propensity score method in SOLAS is invalid for many applications.
See Allison, (2000) or Schafer and Graham (2002))
SAS Version 8.2 (Procs MI and MI Analyze)
http://support.sas.com/rnd/app/da/new/dami.html
Missing Data Library for S-Plus 6.0 www.insightful.com
MICE www.multiple-imputation.com (free software)
Additional free software for multiple imputation:
At http://www.stat.psu.edu/~jls/misoftwa.html
NORM (S-Plus library or stand-alone Windows version for continuous normal
data)
CAT (S-Plus library for categorical data under log linear model)
MIX (S-Plus library for mixed continuous and categorical data)
PAN (S-Plus library for panel or clustered data)
3
At http://gking.harvard.edu/stats.shtml
Amelia (stand-alone program based on King et al.’s (2001) alternative algorithm)
At http://www.isr.umich.edu/src/smp/ive/
IVEware (stand-alone and SAS programs for imputing data of diverse types and
calculating descriptive statistics including model coefficients. Uses sequential
regression (Raghunathan et al., 2001)).
Stata users can download user-written package st0067_1 for multiple
imputation. Royston (2005) describes this software and provides examples of
usage. To download this software Stata users should open Stata, open the Help
menu, click SJ and User-written Programs, then click Search, and enter the word
“imputation”. Click on Package st0067_1 in the results from the search to
display a link to install the software.
4
TECHNICAL APPENDIX
Issues Relating to Whether Missing Data are Missing at Random and Informative Dropouts in Longitudinal Studies
Psychiatric and developmental researchers may contemplate using imputation
algorithms where the missing values for each study participant are predicted from the
observed values for that person (from our perspective, most of the algorithms in
imputation software that are both easy for psychiatric and developmental researchers to
access and use and that are the focus of recent statistical research are of this type).
These algorithms are based on the assumption that the data are Missing at Random.
Indeed “MAR [Missing at Random] is the formal assumption that allows us to first
estimate the relationships among variables from the observed data, and then use these
relationships to obtain unbiased predictions of the missing values from the observed
values” (Schafer and Olsen, 1998, p. 552). Some researchers have gone so far as to
say that “when the assumption of ignorable missing data is not met [i.e. when the data
are not Missing at Random], imputation is usually not appropriate” (McCleary, 2002, p.
340), citing Rubin (1987)). Whether this advice is warranted or is an overgeneralization
depends on the specific algorithms being considered.
Other researchers may be more comfortable estimating replacement values for a
person from his/her observed data though they know the true values are related to data
they have not collected. When such researchers impute using algorithms that are
based on the Missing at Random assumption, they cannot take advantage of
information that may be critical to obtaining accurate estimated values. Consequently
any derived statistics may be biased by an unknown amount. Therefore, researchers
5
concerned that their missing data may not be Missing at Random should note the
following points.
1. Whether imputation with algorithms that assume data are Missing at
Random is acceptable when the assumption may be violated to a minor degree is
controversial. Schafer and Olsen (1998) report that “In the vast majority of studies,
principled methods that assume MAR [Missing at Random] will tend to perform better
than ad hoc procedures such as listwise deletion or imputation of means” (p. 553).
Collins et al. (2001) demonstrate that erroneous assumptions of Missing at Random
may have only minor impact on estimates and standard errors. Additionally, Schafer
and Graham (2002) suggest that standard maximum likelihood approaches may be
useful because “in many psychological research settings the departures from MAR
[Missing at Random] are probably not serious” (p. 154). Schafer and Graham (2002)
also note that when the missing data were never intended to be collected (e.g. cohortsequential designs for longitudinal studies and use of multiple questionnaire forms
containing different subsets of items) the missing values are either Missing Completely
at Random or Missing at Random.
2. Having informative drop-out in longitudinal (repeated measures) studies
is a special case of when data are likely to violate the Missing at Random
assumption. Advanced methods for dealing with missing data that are not Missing at
Random are as appropriate for this situation as they are for cross-sectional studies.
Most of these methods are either selection models or pattern mixture models
(Fairclough, 2004; Little and Rubin, 2002; Schafer and Graham, 2002). In selection
models one first specifies a frequency distribution for the complete data, and then one
6
models how the probability of drop-out depends on the data (Schafer and Graham,
2002). In pattern mixture models, the distribution of a variable’s values is estimated as
a mixture of its distributions in sets of observations grouped according to their missing
data pattern (Fairclough, 2004). Verbeke and Molenberghs (2000) and Little (1995)
have reviewed both selection models and pattern mixture models for longitudinal
studies with drop-out. The following references may also be of interest: Diggle and
Kenward (1994), Fitzmaurice et al. (1995), Follmann and Wu (1995), Hogan et al.
(2004), Jansen et al. (2003), Lin et al. (2004), Streiner (2002), Stubbendick and Ibrahim
(2003), Ten Have et al. (1998), and Wu and Bailey (1989).
Issues Relating to Imputed Values from SPSS MVA Package Using EM
von Hippel (2004) has pointed out that the imputed values from the SPSS
Missing Value Analysis package (MVA, 2004) using EM do not show sufficient variation
because the regression equations used with the SPSS EM algorithm do not add random
variance to the predicted variables. The SPSS EM algorithm compensates for this
reduced variance of predicted values when calculating the final printed covariance
matrix. However, since the special compensation adjustments are not built into SPSS
Base software or other analysis software, von Hippel concludes that “it is inadvisable to
use the EM-imputed data outside the EM module” (p.163).
7
Using Multiple Imputation Methods For Continuous Data When Data are Not Normal or
Continuous
Many researchers may need to impute continuous data, and some research
suggests that software intended for multivariate normal data will often work fine even if
the continuous data have a substantially nonnormal distribution (Graham and Schafer,
1999). Allison (2002) notes : “… multiple imputation under the multivariate normal
model is reasonably straightforward under a wide variety of data types and missing data
patterns. As a routine method for handling missing data, it is probably the best that is
currently available” (pp. 55-56).
Should software for imputing multivariate normal data ever be used to impute
unordered (nominal) categorical data? This is a topic of active discussion and evolution
of opinion. Sinharay et al. (2001, p. 321) cites Schafer (1997) as providing evidence
that “the multivariate normal model gives quite acceptable results even when the
variables are binary or categorical…and the imputed values [are] rounded off to the
nearest category.” Similarly, Allison (2002) says that methods for multiply imputing
categorical data and mixtures of categorical and continuous data are “typically much
more difficult to use and often break down completely” (p. 39) and that “many users will
do just as well by applying the normal methods [i.e. methods assuming a normal
distribution] with some minor alterations” (p. 39). He then shows ways to round the
results from multiple imputation under the normal model to impute dichotomous
variables and variables with multiple categories coded as dummy (0/1) variables.
Schafer and Graham (2002) take a markedly different view, indicating “there are
situations in which the normal model should be avoided – to impute variables that are
8
nominal (unordered categories), for example” (p. 168). Horton et al. (2003) show that
rounding values imputed under the normal model to achieve discrete values can yield
biased estimates and recommend against such rounding.
Before deciding to impute categorical data with algorithms intended for
continuous data, researchers should examine the software available at these sites:
http://www.stat.psu.edu/~jls/misoftwa.html and http://www.isr.umich.edu/src/smp/ive/ .
Methods For Preserving Interactions in Multiple Imputation
Special steps can be taken to impute in a way that takes interactions into
account. However, the researcher must know prior to imputation which interactions will
be tested later in models. When the effect of a variable on a dependent variable is
thought to vary across the levels of a second variable (e.g. gender), Allison (2002) and
Schafer and Graham (2002) recommend splitting the data into groups corresponding to
the levels of the second variable. The researcher then runs separate imputations on the
groups (e.g. impute the missing data on the male and female cases separately and then
combine the male and female cases back into a single dataset). A less preferred
method is to create an interaction indicator variable by multiplying the variables with an
interaction and imputing the missing values for this interaction variable along with the
other variables (Allison, 2002). Allison (2002) likewise recommends creating variables
that are squares or other powers of variables and imputing them with the other data
rather than merely squaring values prior to analysis.
9
Which Variables Should Be Used in Multiple Imputation?
Researchers should include a variety of variables in the imputation process. The
imputation process should include all the variables that will be used in later models.
When variables are used in models but not used in imputation, the model parameters
for those variables will be biased (Allison, 2002; Sinharay et al., 2001). Furthermore,
the imputation process should include variables that predict whether the values of other
variables may be missing, as well as variables that are correlated with the variables
having the most missing data. This will increase the chances that the data are Missing
at Random (an assumption for many multiple imputation procedures) and help reduce
standard errors, thereby increasing statistical power (Collins et al., 2001; Sinharay et al.,
2001).
How Many Imputations Should be Used in Multiple Imputation?
Rubin (1996) says “as few as five multiple imputations (or even three in some
cases) is adequate under each model for nonresponse” (p. 480). Allison (2003) says
that five imputed data sets is widely regarded as sufficient for small to moderate
amounts of missing data, but says “achieving optimal confidence intervals and
hypothesis tests may require substantially more imputations” (p. 553). Schafer and
Olsen (1998) show that with 30% missing information, using 3 imputations yields
standard errors that are 91% efficient, 5 imputations gives standard errors that are 94%
efficient, and 10 imputations yield standard errors that are 97% efficient. A statistic that
is 100% efficient has a sampling variance that is at least as small as that of any other
estimator (Allison, 2003).
10
REFERENCES
Allison P (2000), Multiple imputation for missing data: a cautionary tale. Sociol Methods
Res 28: 301-309
Allison PD (2002), Missing Data. Thousand Oaks, CA: Sage Publications, Inc.
Allison PD (2003), Missing data techniques for structural equation modeling. J Abnorm
Psychol 112: 545-557
Collins LM, Schafer JL, Kam C (2001), A comparison of inclusive and restrictive
strategies in modern missing data procedures. Psychol Methods 6: 330-351
Diggle P, Kenward MG (1994), Informative drop-out in longitudinal data analysis. Appl
Stat 43: 49-93
Fairclough DL (2004), Patient reported outcomes as endpoints in medical research. Stat
Methods Med Res 13: 115-138
Fitzmaurice GM, Molenberghs G, Lipsitz SR (1995), Regression models for longitudinal
binary responses with informative drop-outs. J R Stat Soc Series B 57: 691-704
Follmann D, Wu M (1995), An approximate generalized linear model with random
effects for informative missing data. Biometrics 51: 151-168
Graham JW, Schafer JL (1999), On the performance of multiple imputation for
multivariate data with small sample size. In: Statistical Strategies for Small
Sample Research, Hoyle R, ed. Thousand Oaks, CA: Sage Publications, Inc., pp
1-29
11
Hogan JW, Roy J, Korkontzelou C (2004), Tutorial in biostatistics. Handling drop-out in
longitudinal studies. Stat Med 23: 1455-1497
Horton NJ, Lipsitz SR (2001), Multiple imputation in practice: comparison of software
packages for regression models with missing variables. Am Stat 55: 244-254
Horton NJ, Lipsitz SR, Parzen M (2003), A potential for bias when rounding in multiple
imputation. Am Stat 57: 229-232
Jansen I, Molenberghs G, Aerts M, Thijs H, Van Steen K (2003), A local influence
approach applied to binary data from a psychiatric study. Biometrics 59: 410-419
King G, Honaker J, Joseph A, Scheve K (2001), Analyzing incomplete political science
data: an alternative algorithm for multiple imputation. Am Polit Sci Rev 95: 49-69
Lin H, McCulloch CE, Rosenheck RA (2004), Latent pattern mixture models for
informative intermittent missing data in longitudinal studies. Biometrics 60: 295305
Little RJA (1995), Modeling the dropout mechanism in repeated-measures studies. J
Am Stat Assoc 90: 1112-1121
Little RJA, Rubin DB (2002), Statistical Analysis with Missing Data. 2nd edition
Hoboken, New Jersey: John Wiley & Sons
McCleary L (2002), Using multiple imputation for analysis of incomplete data in clinical
research. Nur Res 51: 339-343
MVA (2004), SPSS Missing Value Analysis 13.0 For Windows. Chicago, IL: SPSS, Inc.
Raghunathan TE, Lepkowski JM, van Hoewyk M, Solenberger PW (2001), A
multivariate technique for multiply imputing missing values using a sequence of
regression models. Survey Methodol 27: 85-95
12
Royston P (2005), Multiple imputation of missing values: update. The Stata Journal 5:
188-201
Rubin DB (1987), Multiple Imputation for Nonresponse in Surveys. New York: John
Wiley & Sons
Rubin DB (1996), Multiple imputation after 18+ years. J Am Stat Assoc 91: 473-489
Schafer JL (1997), Analysis of Incomplete Multivariate Data. London: Chapman & Hall
Schafer JL, Graham JW (2002), Missing data: our view of the state of the art. Psychol
Methods 7: 147-177
Schafer JL, Olsen MK (1998), Multiple imputation for multivariate missing-data
problems: a data analyst's perspective. Multivariate Behav Res 33: 545-571
Sinharay S, Stern HS, Russell D (2001), The use of multiple imputation for the analysis
of missing data. Psychol Methods 6: 317-329
Streiner DL (2002), The case of the missing data: methods of dealing with dropouts and
other research vagaries. Can J Psychiatry 47: 68-75
Stubbendick AL, Ibrahim JG (2003), Maximum likelihood methods for nonignorable
missing responses and covariates in random effects models. Biometrics 59:
1140-1150
Ten Have TR, Kunselman AR, Pulkstenis EP, Landis JR (1998), Mixed effects logistic
regression models for longitudinal binary response data with informative dropout. Biometrics 54: 367-383
Verbeke G, Molenberghs G (2000), Linear Mixed Models for Longitudinal Data. New
York: Springer-Verlag
13
von Hippel PT (2004), Biases in SPSS 12.0 Missing Value Analysis. Am Stat 58: 160164
Wu MC, Bailey KR (1989), Estimation and comparison of changes in the presence of
informative right censoring: conditional linear model. Biometrics 45: 939-955
14
Download