Treatment of missing values 1 1.11) Missing values in autoregressive and crosslagged models: diagnostics and therapy. • What is a missing value? There is a unit nonresponse and an item-non-response. • Also answers such as „Don‘t know“, „Refused“, „no opinion“ are often considered as missing. • In longitudinal studies the problem of missing values is especially disturbing due to panel mortality, that is unit non response. This is also called wave non response, attrition or drop out. 2 Different patterns of non reponse • 1) Univariate pattern: for some indicators or items we have full observations, and for some other items we have missing values- no answers. These items may be fully or partly missing. • 2) A monotone pattern: may arise in longitudinal studies with attrition. If an item is missing in some wave, it continues to be missing in the next waves. • 3) An arbitrary pattern: any set of variables may be missing for any unit. 3 y1 y2 y3…. yp y1 y2 y3…. yp y1 y2 y3…yp ? 1 2 . . . . . . . N ? ? ? ? ? ? 1) univariate pattern 2) monotone pattern. 3)arbitrary pattern 4 • The literature differentiates three kinds of missing values: • 1) MCAR-missing completely at randommeans that whether the data are missing is entirely unrelated statistically to the values that would have been observed. MCAR is the most restrictive assumption. MCAR can be sometimes established by randomly assigning test booklets or blocks of survey questions to different respondents. • 2) MAR-missing at random- is a somewhat more relaxed condition. It means that missingness is statistically unrelated to the variable itself. However, it may be related to other variables in the data set. One way to establish MAR processes is to include completely observed variables that are highly predictive of incomplete 5 data. • 3) MNAR- missing not at random- or nonignorable missing data, where missingness conveys probablistic information about the values that would have been observed. 6 Example: • Let us take two variables, education and income. Education has no missing values, and income has. • MCAR would mean that the missing values of income are dependent neither on education nor on income. • MAR would mean that the missing values of income are dependent on education. That is, education can predict the missing values in income. • MNAR would mean that the missingness in income is not independent of the values of the missings, controlling for the prediction of education. That is, for example high income values are more often missing than low income values. 7 • The diagnosis of the kind of missing is very tricky, and cannot always be established. So often researchers assume the kind of missingness in their data. Fortunately, there are solutions which are independent of the kind of missing data, even if we have MNAR. 8 • MAR: logistic regression is a possible test for MAR. But if there is a significant effect of some non-missing variables of the missingness, it cannot exclude MNAR. • Only experimental designs with a representative sample of the missing, which has no missing, can help design a model for the missingness in the full data set. 9 So far diagnostics, now therapy: • Traditional methods: • 1) Listwise deletion (LD)- deleting every case which has any missing value. • Advantage: consistent solution. • Disadvantage: not efficient, and causes often a drastic reduction in sample size, especially in studies where multiple indicators are involved, and sensitive questions such as income. 10 Therapy (2) • 2) Pairwise deletion (PD) (also called available case (AC) analysis): calculates each correlation separately. This method excludes an observation from the calculation when it is missing a value that is needed for the computation of that particular correlation. • Advantages: smaller loss of cases than in the LD. • Disadvantages: not efficient, and could create problems in estimation, because the observed correlation matrix may not be positive definite. • There is no defined N for the sample, since it depends on the computed pair. 11 • Advantages of the case deletion methods: simplicity. • If a missing data problem can be resolved by discarding only a small part of the sample, then the method can be quite effective. • However, even in that situation, one should explore the data to make sure that the discarded cases are not influential. 12 Reweighting • In some non-MCAR situations, it is possible to reduce biases by applying weights. After incomplete cases are removed, the remaining complete cases are reweighted so that their distribution more closely resembles that of the full sample or population with respect to auxiliary variables (Little and Rubin 1987). • It requires some model for the probabilities of response, to calculate the weights. Better for the univariate and monotone missing patterns. Becomes complicated to apply if 13 missing is in an arbitrary pattern Older methods of multiple imputation • Imputation means filling in missing values with plausible values, and continuing with the analysis. • Advantages: potentially more efficient than discarding the unit. • Prevention of loss of power due to decreasing sample size. • Disadvantages: imputation may be difficult to implement well. 14 • 1) Imputing unconditional means: average is preserved, but distribution aspects such as variance are distorted. • 2) Hot deck imputation: filling in respondents‘ data with values from actual respondents randomly. Advantage: It preserves the variable‘s distribution. Disadvantage: the method still distorts correlations and other measures of association. • 3) Imputing conditional means by regression: the model is first fit for cases to which y is known. After we have a regression parameter from X to Y we use it to forecast missing values of Y by known values of X. It is almost optimal with some corrections for standard errors (Schafer&Schenker 2000). Not recommended for analyses of covariances or correlations, since it overstates the 15 relation between Y and X. • 4) Imputing from a conditional distribution. Distortion of covariances can be eliminated if each missing value of Y is replaced not by a regression prediction but by a random draw from the conditional or predictive distribution of Y given X plus an error term. 16 1) Mean Substitution 2) Hot Deck . . . . . . y y . . . . . . . . . . …. .. . . . . x …. . x 3) Conditional Mean 4) Predictive distribution . . . . . . . y . . y . ... .. . .. . . .. ….. .. . . . …… . .. 17 x x • 1) Mean substitution causes all imputed values to fall on a horizontal line- produces biased estimates for any type of missingness according to simulation studies. • 2) Conditional mean substitution causes them to fall on a regression line-introduces bias. • 3) The hot deck produces an elliptical cloud with too little correlation- produces biased estimates for any type of missingness. • 4) The only method which produces a reasonable point cloud is imputation from the conditional distribution of Y on X-it is unbiased. • However, in all methods coverage is very low (see Schafer and Graham 2002 for details p.161). • Solution: modern methods of imputation: MI and ML 18 Therapy (3) • Modern methods: • 1) FIML-full information maximum likelihood. The FIML discrepancy function maximizes the sum of N casewise contributions to the likelihood function that measure the discrepancy between the observed data and current parameter estimates using all available data for a given case. FIML is a direct method in the sense that model parameters and standard errors are estimated directly from the available data. Missing data (MD) points are not estimated or imputed, and are essentially treated as values that were never intended to be sampled. • Advantage: the algorithm uses all the available information, and the method is both consistent and efficient for MAR. • Disadvantage: the method is model dependent, as it uses information only from variables in the model (different variables in the model-different results). 19 Assumptions of ML estimates • 1) they assume that the sample is large enough and normally distributed for the ML estimates to be approximately unbiased. • 2) they assume some model for the complete data and MAR. • 3) However, in many realistic applications and according to simulation studies departures from the last two assumptions are not large enough to effectively invalidate the results. • 4) According to simulations, non-normality is not a crucial problem. In Graham and Schafer (1999) non-normal missing data when imputed even with small samples reported excellent performance. • 4) For large enough Ns (over 250), MI and ML estimates are very similar. • 5) Conclusions: FIML, and Bayesian are state of the art and should be used. 20 • However, the FIML procedure is attractive, as it is very easy to implement. FIML is available in most SEM programs (AMOS, LISREL, Mplus). Data imputation is available in LISREL, Mplus and Amos. 21