missing data

advertisement
HOW TO DEAL WITH MISSING
DATA:
INTRODUCTION
LI QI
UNC CHARLOTTE
GENERAL STEPS FOR ANALYSIS WITH
MISSING DATA
1. Identify patterns/reasons for missing and recode correctly
2. Decide on best method of analysis
3. Make an inference about some aspect (parameter) of the
distribution of the “full” data when some of the data are
missing
STEP 1: UNDERSTAND YOUR DATA

Attrition due to social/natural processes
Eg: School graduation, dropout, death…

Skip pattern in survey
Eg: Certain questions only asked to respondents who indicate they are married

Intentional missing as part of data collection process

Respondent refusal/Non-response

Observations are not sampled with the same frequency.
UNDERSTAND YOUR DATA (CONT.)
Are certain groups more likely to have missing values?
Example: Respondents in service occupations less likely to
report income
Are certain responses more likely to be missing?
Example: Respondents with high income less likely to report
income
MISSING DATA MECHANISM
MCAR (missing completely at random): The probability of
missingness is independent of the data.
If the data are MCAR, then the complete-case estimator is
unbiased and consistent, as our intuition would suggest.
In fact, there is no way that we can distinguish whether the
missing data were MCAR or not from the observed data.-Identifiability problem
MISSING DATA MECHANISM (CONT.)
MAR (missing at random): The probability of missingness
depends only on the observed data.
The MAR assumption allows the dependence between
missingness δ and the variable Y.
P(δ=1|Y,W)=f(W);
Example: Respondents in service occupations less likely to
report income
MISSING DATA MECHANISM (CONT.)
NMAR (nonmissing at random): The probability of
missingness may also depend on the unobservable part
of the data.
Difficult to deal with
STEP 2: DEAL WITH MISSING DATA
Use what you know about why data is missing. Decide on
the best analysis strategy to yield the best estimates.
TRADITIONAL APPROACHES

Deletion Methods


Listwise deletion, pairwise deletion
Single imputation methods

Mean/Mode substitution

Regression substitution
ADVANCED METHODS
Maximum Likelihood method (ML)
Weighing method (IPW)
Multiple imputation method (MI)
MODEL-BASED METHODS: ML
Identify the set of parameter values that produces the highest
log-likelihood.
Advantages: Use both complete cases and incomplete
cases; Enjoy the optimality properties afforded to an MLE.
Disadvantages: We need correctly specify the two
parametric models; Difficult to compute.
INVERSE PROBABILITY WEIGHTING
Little and Rubin (1987) proposed this method for missing
data problems in survey.
INVERSE PROBABILITY WEIGHTING
Idea: A subject with weight of 4 has a probability of
observation of 0.25 (or 1/pi= 0.25). As a result, data from
this subject should count once for herself and 3 times for
those subjects missing.
INVERSE PROBABILITY WEIGHTING
(CONT.)
Advantages: Full likelihood is not necessary; use GEE.
Could be applied widely in different models.
Disadvantages: The selection probability model is not
correctly specified, then IPW estimator would be biased. If
the true selection probability is very small, then it could be
very .
INVERSE PROBABILITY WEIGHTING
(CONT.)
Robins, Rotnitzky and Zhao (1994) discussed the idea of
adding an augmentation term to a simple weighted
estimation equation.
INVERSE PROBABILITY WEIGHTING
(CONT.)
SAS: The GEE and CAUSALTRT Procedures
R: The ipw and CausalGAM package
SINGLE IMPUTATION
Since some of the Y are missing, a natural strategy is to
impute or “estimate” a value for such missing data and then
estimate the parameter of interest behaving as if the
imputed values were the true values.
SINGLE IMPUTATION (CONT.)

For monotone missing data patterns

Regression Method

Propensity Score Method


For arbitrary missing data patterns
MCMC method
All these options are available in SAS MI procedure.
MULTIPLE IMPUTATION (MI)
Single imputation does not refect the uncertainty about the
predictions of the unknown missing values, and the resulting
estimated variances of the parameter estimates will be
biased toward zero.
Multiple imputation does not attempt to estimate each
missing value through simulated values but rather to
represent a random sample of the missing values.
MI PROCEDURE
Multiple imputation inference involves three distinct phases:
The missing data are filled in m times to generate m
complete data sets.
The m complete data sets are analyzed by using standard
procedures.
The results from the m complete data sets are combined for
the inference.
MULTIPLE IMPUTATION PROCESS
MULTIPLE IMPUTATION (CONT.)
SAS
PROC MI
R
mi package
TIME SERIES DATA
Idea: aggregation and interpolation
SAS: PROC EXPAND
http://support.sas.com/rnd/app/examples/ets/missval
Missing Spatial Data
Thank you!
Download