Treatment of Missing Data Wayne Jiang, FCAS Safeco Insurance Companies

advertisement
Treatment of Missing
Data
Wayne Jiang, FCAS
Safeco Insurance Companies
Why missing handling is important
 If
not properly handled, missing data can
lead to biased, invalid or insignificant
results.
Different kinds of missing data
Missing completely at random (MCAR).


The probability that an observation is
missing is unrelated to the value of the
variable or to the value of any other
variables, i.e. missing values are randomly
distributed across all observations .
Different kinds of missing data
 Missing

at random (MAR).
The probability of missing does not depend on
the value of the variable after controlling for
other variables. Or the missing is random
after data is split into subgroups.
Different kinds of missing data
 Missing


not at random
Neither MCAR nor MAR.
Very hard to analyze.
Pattern of missing data

Monotone:

In the case of more
than one variable can
be missing, there is an
order of variable can
be missing.
V1
V2
V3
V4
V5
1
.
.
.
.
1
1
.
.
.
1
1
1
.
.
1
1
1
.
.
1
1
1
1
.
1
1
1
1
1
Dealing with missing data


If the data set is large and a few random
points are missing the problem is not
serious.
In a smaller data set with a non-random
distribution of missing values the problem
may be serious.
Some ways to deal with the missing
data problem (separate category)

Treat Missing as its own category


Could group very dissimilar classes together.
Severe bias could result.
Some ways to deal with the
missing data problem (deletion)

Listwise deletion.




Data line with any missing is deleted.
Yield unbiased parameter estimate if MCAR.
Sacrifices predictive power as less data points
used.
In SAS Proc REG use that as default.
Some ways to deal with the
missing data problem (deletion)
 Pairwise



deletion
All available data used in calculation of
correlation matrices.
Create sample size problem and possibly
non-positive definite matrices problem.
In SAS Proc CORR use that as default.
Some ways to deal with the
missing data problem (substitution)
 Mean



Replace missing data with global mean.
Simple approach.
Underestimate the error.
 Hot



substitution
deck method
Simple approach. Replace missing with value
from similar record.
Has randomness built in.
Still underestimate error.
Some ways to deal with the
missing data problem (imputation)
 Regression



Replace missing data based on other
variables.
Improvement over global mean.
Still underestimate the error.
Multiple imputation
 A Monte
Carlo technique in which the
missing values are replaced by 3-10
simulated versions, each of the simulated
datasets is analyzed, and the results are
combined to produce results that
incorporate missing data uncertainty.
 More complicated but a lot less bias.
 SAS users can use Proc MI and Proc
MIAnalyze.
Three steps of multiple
imputation
 Impute

data.
Data is assumed to be multivariate normal.
Parameters are first estimated based on
complete case. The imputed data is randomly
picked from the distribution. Parameters are
estimated again and another imputation
follows. Do it until parameter converges. Then
multiple sets of data are drawn randomly from
the distribution.
Three steps of multiple
imputation
 Analyze



data
Each set of data is analyzed use any
preferred methods.
Proc ####; BY _Imputation_; …;Run;
Save the parameters in a data sets.
Three steps of multiple
imputation
 Combine



results
Estimate = mean of all estimates.
Total variance = (Average within variance) +
(1 + 1/m) (Between Variance).
Proc MIAnalyze parms =####; Run;
Reference
 SAS
online manual:
http://support.sas.com/rnd/app/papers/miv
802.pdf
 Carpenter, J and Kenward, M
http://www.lshtm.ac.uk/msu/missingdata/st
art.html
Questions?
Thank you!
Download