Principles: Missing Data

advertisement
Statistical Methods
Principles
Missing Data
Dr Eleni Matechou
matechou@stats.ox.ac.uk
References:
• R.J.A. Little and D.B. Rubin 2nd edition “Statistical Analysis with Missing Data”
• J.L. Schafer and J.W. Graham 2002 “Missing Data: Our View of the State of the Art”
1
Missing data indicator matrix
Usually, the data set consists of a matrix Y with n rows and p columns.
A row traditionally corresponds to a case and a column to a variable i.e.
yij is the value of variable j for individual i.
With real data sets, it is not uncommon for an entry to be missing.
Denote by M the missing data indicator matrix with entry Mij
equal to 1 if observation yij is missing and 0 otherwise.
Missing data mechanisms
Question: Are the missing values related to the underlying values of
the variables in the data set?
The data can be:
• missing completely at random (MCAR)
• missing at random (MAR)
• not missing at random (NMAR)
Denote by P (M |Y, θ) the distribution of M , where θ are unknown
parameters.
MCAR
If missingness does not depend on the values of the data set, observed or
unobserved, then:
P (M |Y, θ) = P (M |θ)
and the data are MCAR.
Example: n individuals had their blood pressure measured and a random
0
sample of size n < n also had their weight measured.
4
MAR
If missingness does not depend on the unobserved values of the data set
but does depend on the observed then:
P (M |Y, θ) = P (M |Yobs , θ)
and the data are MAR.
Example: n individuals had their blood pressure measured and only those
individuals with high blood pressure also had their weight measured.
5
NMAR
If missingness depends on the unobserved values of the data set then:
P (M |Y, θ) = P (M |Ymiss , θ)
and the data are NMAR.
Example: n individuals had their blood pressure measured but only
overweight individuals also had their weight measured.
Example
Suppose Yi1 is the age and Yi2 is the income of individual i.
Define Mi = 1 if information on income of individual i is missing and 0
otherwise.
If the probability that Mi = 1
• is the same for all individuals ⇒ MCAR.
• depends on Yi1 then ⇒ MAR
(given Yi1 is observed).
• depends on Yi2 then ⇒ NMAR.
Before collecting the data
• In the design of the data collection, take care that missingness is
avoided/minimized; but missingness by design is allowed.
• If missingness is unavoidable, then collect variables that are
predictive of missingness, and of the unobserved values. This helps
for the plausibility of the MAR assumption.
• If there are missing values, try to understand how they arose, and
describe their frequency and patterns.
8
When there are missing data
Part of the descriptive analysis of the data set should always be an
informal investigation of the missingness patterns:
• how many missing values are there?
• how are they clustered for certain variables?
• are there systematic differences with respect to observed variables
between cases with and without missing values?
Knowing what is predictive for missingness within the available data
can help understand the processes leading to missingness.
How to deal with the missing values?
A naive approach is to delete the cases that have missing values and
only analyse the complete data set.
If the data are not MCAR the results can be seriously biased because the
complete cases are probably not a representative sample of the
population.
Even if the data are MCAR, case deletion can result in a large portion of
the data set to be discarded even if the proportion of missingness is not
that high.
10
How to deal with the missing values?
A frequently used approach is to base inference on the likelihood
function for the incomplete data by treating M as a random variable
and specifying the joint distribution of M and Y .
Generally, closed form expressions for the maximum likelihood estimates
cannot be found and iterative processes are required.
The Expectation-Maximisation algorithm is usually the method of
choice.
If MAR holds then the missingness mechanism does not need to be
explicitly modelled.
11
How to deal with the missing values?
An attractive approach is to impute the missing values i.e. to fill them
in, but with what?
• In mean substitution the missing values are replaced by the
average of the observed values.
• In hot deck imputation the missing values of one or more variables
for a nonrespondent (called the recipient) are replaced with observed
values from a respondent (the donor) that is similar to the
non-respondent with respect to characteristics observed by both
cases.
• In conditional mean imputation the missing values are predicted
using a model which has as a response the variable with missing
values and the rest of the variables as predictors (of course using
only the complete cases).
• In conditional distribution imputation the missing values are
replaced by random draws from the conditional distribution of the
variable to be imputed on the other variables.
How do these imputation methods
perform?
Schafer and Graham report on their simulations:
“Mean substitution and the hot deck produce biased estimates for many
parameters under any type of missingness. Conditional mean imputation
performs slightly better but still may introduce bias. Imputing from a
conditional distribution is essentially unbiased under MCAR or MAR but
potentially biased under NMAR”
13
Conditional distribution imputation
Adds random variability to reflect the additional uncertainty caused by
the imputation.
For example, if Y has a multivariate normal distribution, then for each
case i we may substitute missing values by random draws from the
conditional normal distribution of the missing data, given the observed
data.
If MAR holds, then this will be a reasonable procedure;
The multivariate normality assumption is not very critical if the number of missing values is not too high.
Multiple imputation
For a given incomplete data set, the missing data is imputed
independently D times by drawing from the conditional distribution of
the missing data given the observed data.
This leads to D complete data sets, that differ only with respect to the
imputed values.
For each complete data set the desired analysis is executed; standard
errors of parameters are a combination of the within-data set standard
errors, and the variability of estimates between the data sets.
How are the data sets combined?
Suppose the parameter of interest is a scalar γ.
The estimate obtained for the √
parameter of interest from the dth data set
is γ
bd and its standard error is Ud .
The overall estimate is simply the average over the D data sets:
PD
bi
i=1 γ
γ̄ =
D
The uncertainty in γ̄ is:
TD = ŪD + (1 + D−1 )BD
16
Where ŪD is the average within imputation variance:
PD
i=1 Ui
ŪD =
and BD
D
is the between-imputations variance:
PD
BD =
− γ̄)2
D−1
γi
i=1 (b
See MissingData.R for an example.
17
What about NMAR data?
In this case the distribution of the missingness must be explicitly
specified.
• In selection models a distribution for the complete data is specified
first and then a distribution for the missingness is specified
conditional on that of the complete data.
• In pattern-mixture models individuals are classified by their
missingness and the observed data are fitted within each missing
group.
18
R packages useful for inference with
missing data
• Amelia II: Bootstrap EM imputation.
• cat: Missing data methods for categorical data.
• mi: Missing Data Imputation and Model Checking.
• mice: Multiple Imputation and generalized linear regression by
Chained Equations.
• mix: Missing data methods for mixed categorical and continuous
data.
• mlmmm: Estimation for mixed linear models with missing data.
• mvnmle: MLE for multivariate normal with missing data.
• norm: Estimation and imputation for multivariate normal data with
missings.
• VIM: Visualization and Imputation of Missing Values.
There are many more!
19
Download