Treatment of Missing Data Wayne Jiang, FCAS Safeco Insurance Companies Why missing handling is important If not properly handled, missing data can lead to biased, invalid or insignificant results. Different kinds of missing data Missing completely at random (MCAR). The probability that an observation is missing is unrelated to the value of the variable or to the value of any other variables, i.e. missing values are randomly distributed across all observations . Different kinds of missing data Missing at random (MAR). The probability of missing does not depend on the value of the variable after controlling for other variables. Or the missing is random after data is split into subgroups. Different kinds of missing data Missing not at random Neither MCAR nor MAR. Very hard to analyze. Pattern of missing data Monotone: In the case of more than one variable can be missing, there is an order of variable can be missing. V1 V2 V3 V4 V5 1 . . . . 1 1 . . . 1 1 1 . . 1 1 1 . . 1 1 1 1 . 1 1 1 1 1 Dealing with missing data If the data set is large and a few random points are missing the problem is not serious. In a smaller data set with a non-random distribution of missing values the problem may be serious. Some ways to deal with the missing data problem (separate category) Treat Missing as its own category Could group very dissimilar classes together. Severe bias could result. Some ways to deal with the missing data problem (deletion) Listwise deletion. Data line with any missing is deleted. Yield unbiased parameter estimate if MCAR. Sacrifices predictive power as less data points used. In SAS Proc REG use that as default. Some ways to deal with the missing data problem (deletion) Pairwise deletion All available data used in calculation of correlation matrices. Create sample size problem and possibly non-positive definite matrices problem. In SAS Proc CORR use that as default. Some ways to deal with the missing data problem (substitution) Mean Replace missing data with global mean. Simple approach. Underestimate the error. Hot substitution deck method Simple approach. Replace missing with value from similar record. Has randomness built in. Still underestimate error. Some ways to deal with the missing data problem (imputation) Regression Replace missing data based on other variables. Improvement over global mean. Still underestimate the error. Multiple imputation A Monte Carlo technique in which the missing values are replaced by 3-10 simulated versions, each of the simulated datasets is analyzed, and the results are combined to produce results that incorporate missing data uncertainty. More complicated but a lot less bias. SAS users can use Proc MI and Proc MIAnalyze. Three steps of multiple imputation Impute data. Data is assumed to be multivariate normal. Parameters are first estimated based on complete case. The imputed data is randomly picked from the distribution. Parameters are estimated again and another imputation follows. Do it until parameter converges. Then multiple sets of data are drawn randomly from the distribution. Three steps of multiple imputation Analyze data Each set of data is analyzed use any preferred methods. Proc ####; BY _Imputation_; …;Run; Save the parameters in a data sets. Three steps of multiple imputation Combine results Estimate = mean of all estimates. Total variance = (Average within variance) + (1 + 1/m) (Between Variance). Proc MIAnalyze parms =####; Run; Reference SAS online manual: http://support.sas.com/rnd/app/papers/miv 802.pdf Carpenter, J and Kenward, M http://www.lshtm.ac.uk/msu/missingdata/st art.html Questions? Thank you!