Croy C, Novins D. Methods for Addressing Missing Data in Psychiatric and Developmental Research. J Am Acad Child Adolesc Psychiatry RESOURCE APPENDIX Reference Materials – General For an introduction to methods for handling missing data, we recommend the book Missing Data by Paul Allison (2002). For a more detailed treatment see “Missing Data: Our View of the State of the Art” by Schafer and Graham (2002). Readers wanting a comprehensive review should consult Statistical Analysis with Missing Data (2nd ed.) by Little and Rubin (2002). Software Calculation of Means and Covariance Matrix Using EM S-Plus 6.0 http://www.insightful.com/products/splus/default.asp SPSS For Windows 11.5 and higher, MVA Package http://www.spss.com/missing_value/ SAS/STAT For Windows 8.2 and higher, Proc MI http://support.sas.com/rnd/app/da/new/dami.html EM-based Imputation and Conditional Mean Substitution SPSS For Windows 11.5 and higher, MVA Package http://www.spss.com/missing_value/ SAS/STAT 9.1 and higher, Proc MI http://support.sas.com/91doc/docMainpage.jsp (click refresh button on browser if contents of navigation window doesn’t appear) 1 Estimation of Linear Models Using Direct (Full Information) Maximum Likelihood SAS (stand alone commercial software) www.sas.com/technologies/analytics/statistics/stat/index.html (look for SAS/STAT product, Proc Mixed Procedure) SPSS (stand alone commercial software) www.spss.com/advanced_models/brochures.htm (download PDF spec sheet, look for Linear Mixed Models Procedure) Amos (stand alone or module for SPSS) www.spss.com/amos/index.htm Mplus (stand alone commercial software) www.statmodel.com LISREL (stand alone commercial software) http://www.ssicentral.com/lisrel/mainlis.htm Mx (free stand-alone software for matrix algebra and numerical optimization) http://www.vcu.edu/mx/ Multiple Imputation Two Internet sites provide a lot of easy-to-digest explanations, citations for further reading, some source text available via links, and free software: 2 http://www.stat.psu.edu/~jls/misoftwa.html (the link for Frequently Asked Questions is especially helpful) http://www.multiple-imputation.com/ (the link for Literature is especially helpful) Horton and Lipsitz (2001) provide a detailed review of the following software for multiple imputation: SOLAS 3.0 www.statsol.ie/solas/solas.htm (note that the propensity score method in SOLAS is invalid for many applications. See Allison, (2000) or Schafer and Graham (2002)) SAS Version 8.2 (Procs MI and MI Analyze) http://support.sas.com/rnd/app/da/new/dami.html Missing Data Library for S-Plus 6.0 www.insightful.com MICE www.multiple-imputation.com (free software) Additional free software for multiple imputation: At http://www.stat.psu.edu/~jls/misoftwa.html NORM (S-Plus library or stand-alone Windows version for continuous normal data) CAT (S-Plus library for categorical data under log linear model) MIX (S-Plus library for mixed continuous and categorical data) PAN (S-Plus library for panel or clustered data) 3 At http://gking.harvard.edu/stats.shtml Amelia (stand-alone program based on King et al.’s (2001) alternative algorithm) At http://www.isr.umich.edu/src/smp/ive/ IVEware (stand-alone and SAS programs for imputing data of diverse types and calculating descriptive statistics including model coefficients. Uses sequential regression (Raghunathan et al., 2001)). Stata users can download user-written package st0067_1 for multiple imputation. Royston (2005) describes this software and provides examples of usage. To download this software Stata users should open Stata, open the Help menu, click SJ and User-written Programs, then click Search, and enter the word “imputation”. Click on Package st0067_1 in the results from the search to display a link to install the software. 4 TECHNICAL APPENDIX Issues Relating to Whether Missing Data are Missing at Random and Informative Dropouts in Longitudinal Studies Psychiatric and developmental researchers may contemplate using imputation algorithms where the missing values for each study participant are predicted from the observed values for that person (from our perspective, most of the algorithms in imputation software that are both easy for psychiatric and developmental researchers to access and use and that are the focus of recent statistical research are of this type). These algorithms are based on the assumption that the data are Missing at Random. Indeed “MAR [Missing at Random] is the formal assumption that allows us to first estimate the relationships among variables from the observed data, and then use these relationships to obtain unbiased predictions of the missing values from the observed values” (Schafer and Olsen, 1998, p. 552). Some researchers have gone so far as to say that “when the assumption of ignorable missing data is not met [i.e. when the data are not Missing at Random], imputation is usually not appropriate” (McCleary, 2002, p. 340), citing Rubin (1987)). Whether this advice is warranted or is an overgeneralization depends on the specific algorithms being considered. Other researchers may be more comfortable estimating replacement values for a person from his/her observed data though they know the true values are related to data they have not collected. When such researchers impute using algorithms that are based on the Missing at Random assumption, they cannot take advantage of information that may be critical to obtaining accurate estimated values. Consequently any derived statistics may be biased by an unknown amount. Therefore, researchers 5 concerned that their missing data may not be Missing at Random should note the following points. 1. Whether imputation with algorithms that assume data are Missing at Random is acceptable when the assumption may be violated to a minor degree is controversial. Schafer and Olsen (1998) report that “In the vast majority of studies, principled methods that assume MAR [Missing at Random] will tend to perform better than ad hoc procedures such as listwise deletion or imputation of means” (p. 553). Collins et al. (2001) demonstrate that erroneous assumptions of Missing at Random may have only minor impact on estimates and standard errors. Additionally, Schafer and Graham (2002) suggest that standard maximum likelihood approaches may be useful because “in many psychological research settings the departures from MAR [Missing at Random] are probably not serious” (p. 154). Schafer and Graham (2002) also note that when the missing data were never intended to be collected (e.g. cohortsequential designs for longitudinal studies and use of multiple questionnaire forms containing different subsets of items) the missing values are either Missing Completely at Random or Missing at Random. 2. Having informative drop-out in longitudinal (repeated measures) studies is a special case of when data are likely to violate the Missing at Random assumption. Advanced methods for dealing with missing data that are not Missing at Random are as appropriate for this situation as they are for cross-sectional studies. Most of these methods are either selection models or pattern mixture models (Fairclough, 2004; Little and Rubin, 2002; Schafer and Graham, 2002). In selection models one first specifies a frequency distribution for the complete data, and then one 6 models how the probability of drop-out depends on the data (Schafer and Graham, 2002). In pattern mixture models, the distribution of a variable’s values is estimated as a mixture of its distributions in sets of observations grouped according to their missing data pattern (Fairclough, 2004). Verbeke and Molenberghs (2000) and Little (1995) have reviewed both selection models and pattern mixture models for longitudinal studies with drop-out. The following references may also be of interest: Diggle and Kenward (1994), Fitzmaurice et al. (1995), Follmann and Wu (1995), Hogan et al. (2004), Jansen et al. (2003), Lin et al. (2004), Streiner (2002), Stubbendick and Ibrahim (2003), Ten Have et al. (1998), and Wu and Bailey (1989). Issues Relating to Imputed Values from SPSS MVA Package Using EM von Hippel (2004) has pointed out that the imputed values from the SPSS Missing Value Analysis package (MVA, 2004) using EM do not show sufficient variation because the regression equations used with the SPSS EM algorithm do not add random variance to the predicted variables. The SPSS EM algorithm compensates for this reduced variance of predicted values when calculating the final printed covariance matrix. However, since the special compensation adjustments are not built into SPSS Base software or other analysis software, von Hippel concludes that “it is inadvisable to use the EM-imputed data outside the EM module” (p.163). 7 Using Multiple Imputation Methods For Continuous Data When Data are Not Normal or Continuous Many researchers may need to impute continuous data, and some research suggests that software intended for multivariate normal data will often work fine even if the continuous data have a substantially nonnormal distribution (Graham and Schafer, 1999). Allison (2002) notes : “… multiple imputation under the multivariate normal model is reasonably straightforward under a wide variety of data types and missing data patterns. As a routine method for handling missing data, it is probably the best that is currently available” (pp. 55-56). Should software for imputing multivariate normal data ever be used to impute unordered (nominal) categorical data? This is a topic of active discussion and evolution of opinion. Sinharay et al. (2001, p. 321) cites Schafer (1997) as providing evidence that “the multivariate normal model gives quite acceptable results even when the variables are binary or categorical…and the imputed values [are] rounded off to the nearest category.” Similarly, Allison (2002) says that methods for multiply imputing categorical data and mixtures of categorical and continuous data are “typically much more difficult to use and often break down completely” (p. 39) and that “many users will do just as well by applying the normal methods [i.e. methods assuming a normal distribution] with some minor alterations” (p. 39). He then shows ways to round the results from multiple imputation under the normal model to impute dichotomous variables and variables with multiple categories coded as dummy (0/1) variables. Schafer and Graham (2002) take a markedly different view, indicating “there are situations in which the normal model should be avoided – to impute variables that are 8 nominal (unordered categories), for example” (p. 168). Horton et al. (2003) show that rounding values imputed under the normal model to achieve discrete values can yield biased estimates and recommend against such rounding. Before deciding to impute categorical data with algorithms intended for continuous data, researchers should examine the software available at these sites: http://www.stat.psu.edu/~jls/misoftwa.html and http://www.isr.umich.edu/src/smp/ive/ . Methods For Preserving Interactions in Multiple Imputation Special steps can be taken to impute in a way that takes interactions into account. However, the researcher must know prior to imputation which interactions will be tested later in models. When the effect of a variable on a dependent variable is thought to vary across the levels of a second variable (e.g. gender), Allison (2002) and Schafer and Graham (2002) recommend splitting the data into groups corresponding to the levels of the second variable. The researcher then runs separate imputations on the groups (e.g. impute the missing data on the male and female cases separately and then combine the male and female cases back into a single dataset). A less preferred method is to create an interaction indicator variable by multiplying the variables with an interaction and imputing the missing values for this interaction variable along with the other variables (Allison, 2002). Allison (2002) likewise recommends creating variables that are squares or other powers of variables and imputing them with the other data rather than merely squaring values prior to analysis. 9 Which Variables Should Be Used in Multiple Imputation? Researchers should include a variety of variables in the imputation process. The imputation process should include all the variables that will be used in later models. When variables are used in models but not used in imputation, the model parameters for those variables will be biased (Allison, 2002; Sinharay et al., 2001). Furthermore, the imputation process should include variables that predict whether the values of other variables may be missing, as well as variables that are correlated with the variables having the most missing data. This will increase the chances that the data are Missing at Random (an assumption for many multiple imputation procedures) and help reduce standard errors, thereby increasing statistical power (Collins et al., 2001; Sinharay et al., 2001). How Many Imputations Should be Used in Multiple Imputation? Rubin (1996) says “as few as five multiple imputations (or even three in some cases) is adequate under each model for nonresponse” (p. 480). Allison (2003) says that five imputed data sets is widely regarded as sufficient for small to moderate amounts of missing data, but says “achieving optimal confidence intervals and hypothesis tests may require substantially more imputations” (p. 553). Schafer and Olsen (1998) show that with 30% missing information, using 3 imputations yields standard errors that are 91% efficient, 5 imputations gives standard errors that are 94% efficient, and 10 imputations yield standard errors that are 97% efficient. A statistic that is 100% efficient has a sampling variance that is at least as small as that of any other estimator (Allison, 2003). 10 REFERENCES Allison P (2000), Multiple imputation for missing data: a cautionary tale. Sociol Methods Res 28: 301-309 Allison PD (2002), Missing Data. Thousand Oaks, CA: Sage Publications, Inc. Allison PD (2003), Missing data techniques for structural equation modeling. J Abnorm Psychol 112: 545-557 Collins LM, Schafer JL, Kam C (2001), A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol Methods 6: 330-351 Diggle P, Kenward MG (1994), Informative drop-out in longitudinal data analysis. Appl Stat 43: 49-93 Fairclough DL (2004), Patient reported outcomes as endpoints in medical research. Stat Methods Med Res 13: 115-138 Fitzmaurice GM, Molenberghs G, Lipsitz SR (1995), Regression models for longitudinal binary responses with informative drop-outs. J R Stat Soc Series B 57: 691-704 Follmann D, Wu M (1995), An approximate generalized linear model with random effects for informative missing data. Biometrics 51: 151-168 Graham JW, Schafer JL (1999), On the performance of multiple imputation for multivariate data with small sample size. In: Statistical Strategies for Small Sample Research, Hoyle R, ed. Thousand Oaks, CA: Sage Publications, Inc., pp 1-29 11 Hogan JW, Roy J, Korkontzelou C (2004), Tutorial in biostatistics. Handling drop-out in longitudinal studies. Stat Med 23: 1455-1497 Horton NJ, Lipsitz SR (2001), Multiple imputation in practice: comparison of software packages for regression models with missing variables. Am Stat 55: 244-254 Horton NJ, Lipsitz SR, Parzen M (2003), A potential for bias when rounding in multiple imputation. Am Stat 57: 229-232 Jansen I, Molenberghs G, Aerts M, Thijs H, Van Steen K (2003), A local influence approach applied to binary data from a psychiatric study. Biometrics 59: 410-419 King G, Honaker J, Joseph A, Scheve K (2001), Analyzing incomplete political science data: an alternative algorithm for multiple imputation. Am Polit Sci Rev 95: 49-69 Lin H, McCulloch CE, Rosenheck RA (2004), Latent pattern mixture models for informative intermittent missing data in longitudinal studies. Biometrics 60: 295305 Little RJA (1995), Modeling the dropout mechanism in repeated-measures studies. J Am Stat Assoc 90: 1112-1121 Little RJA, Rubin DB (2002), Statistical Analysis with Missing Data. 2nd edition Hoboken, New Jersey: John Wiley & Sons McCleary L (2002), Using multiple imputation for analysis of incomplete data in clinical research. Nur Res 51: 339-343 MVA (2004), SPSS Missing Value Analysis 13.0 For Windows. Chicago, IL: SPSS, Inc. Raghunathan TE, Lepkowski JM, van Hoewyk M, Solenberger PW (2001), A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodol 27: 85-95 12 Royston P (2005), Multiple imputation of missing values: update. The Stata Journal 5: 188-201 Rubin DB (1987), Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons Rubin DB (1996), Multiple imputation after 18+ years. J Am Stat Assoc 91: 473-489 Schafer JL (1997), Analysis of Incomplete Multivariate Data. London: Chapman & Hall Schafer JL, Graham JW (2002), Missing data: our view of the state of the art. Psychol Methods 7: 147-177 Schafer JL, Olsen MK (1998), Multiple imputation for multivariate missing-data problems: a data analyst's perspective. Multivariate Behav Res 33: 545-571 Sinharay S, Stern HS, Russell D (2001), The use of multiple imputation for the analysis of missing data. Psychol Methods 6: 317-329 Streiner DL (2002), The case of the missing data: methods of dealing with dropouts and other research vagaries. Can J Psychiatry 47: 68-75 Stubbendick AL, Ibrahim JG (2003), Maximum likelihood methods for nonignorable missing responses and covariates in random effects models. Biometrics 59: 1140-1150 Ten Have TR, Kunselman AR, Pulkstenis EP, Landis JR (1998), Mixed effects logistic regression models for longitudinal binary response data with informative dropout. Biometrics 54: 367-383 Verbeke G, Molenberghs G (2000), Linear Mixed Models for Longitudinal Data. New York: Springer-Verlag 13 von Hippel PT (2004), Biases in SPSS 12.0 Missing Value Analysis. Am Stat 58: 160164 Wu MC, Bailey KR (1989), Estimation and comparison of changes in the presence of informative right censoring: conditional linear model. Biometrics 45: 939-955 14