Week 10 Hour 3: - Imputation, an MCAR Example

advertisement
Week 10 Hour 3:
- Imputation, an MCAR Example
- Imputation and amount of missingness.
- The mice() function and mice package
Italics Denotes direct quotes from 'Flexible Imputation of Missing Data' by Stef van
Buuren.
Stat 302 Notes. Week 10, Hour 3, Page 1 / 33
Recall from last time.
Missing, or incomplete, data is any information that is recorded
for some observations, but not others.
Stat 302 Notes. Week 10, Hour 3, Page 2 / 33
Sometimes data is missing for reasons unrelated to the study.
Often, this is a technology-based problem.
Examples:
Telescope surveys missing days due to cloud cover.
'A weighing scale that has run out of batteries'
Tracking something online and the Wi-Fi fails occasionally.
Stat 302 Notes. Week 10, Hour 3, Page 3 / 33
We usually assume any observation in the data is equally likely
to be subject these sort of technical failures. 'Some of the data
will be missing simply because of bad luck.'
This sort of missingness is called MCAR, Missing Completely At
Random
This is the easiest to impute, because the distribution of the
variable with missing info should be the same even with some
of the data missing.
What does 'the distribution should match' mean?
Stat 302 Notes. Week 10, Hour 3, Page 4 / 33
Distribution with MCAR:
Consider the 100 randomly generated data points.
Stat 302 Notes. Week 10, Hour 3, Page 5 / 33
(For those interested, this distribution is the exponential with mean 1)
40% of these data points are 0,
another 41% are 1,
another 10% are 2,
about another 5% are 3 and 4 each,
and a few are higher....
This is an estimation of the distribution of the data points.
Stat 302 Notes. Week 10, Hour 3, Page 6 / 33
If we remove 50 of these data points completely at random,
this distribution stays the same, and the estimates stay close.
After removal,
21/50 = 42% of data points are 0, 44% are 1, 10% are 2,
none are 3, and 4% are 4.
Stat 302 Notes. Week 10, Hour 3, Page 7 / 33
Why does this happen?
Because when we remove one data point, it has about a...
...40% of being in the '0' bin,
...40% of being in the '1' bin, and so on...
So, over the long run,
40% of the data points are '0',
40% of those are '1', and so on...
In other words, the more common a value is, the greater
number of those values will be missing.
Stat 302 Notes. Week 10, Hour 3, Page 8 / 33
Therefore, if we were to impute the missing values. That is, fill
them in in such a way to restore whatever we can about the
original data, the imputed data should have the same
distribution as the missing data.
Let's say we only had the data that was half missing,
how could we use this 'equal distribution' property to make a
good set of replacement values?
Stat 302 Notes. Week 10, Hour 3, Page 9 / 33
Since we know the existing values have a similar distribution to
the missing ones, we will use that distribution to recreate the
data.
For each missing value,
fill it in with a '0' with a 42% chance, the same as the
proportion of observed values that have '0' to begin with.
The chance of other values are found similarly.
Stat 302 Notes. Week 10, Hour 3, Page 10 / 33
Why 42% and not 40%?
Assume that we were just given the data with all the Nas in
place of the real values.
We wouldn't actually know this 40% proportion to go by, so
42% is our best guess.
Also, the rare case where the value was '5' doesn't show up in
the data with missingness, so we have no real way to predict
that 5.
This is all part of the additional uncertainty that having missing
data brings you.
Stat 302 Notes. Week 10, Hour 3, Page 11 / 33
You've dealt with uncertainty thus far. That's cause for
celebration.
Stat 302 Notes. Week 10, Hour 3, Page 12 / 33
Going back to these original 100 points, if we were to remove
only 10 of those 100...
35/90 = 39% of the remaining data is 0
37/90 = 41% is 1, and
9/90 = 10% is 2.
Stat 302 Notes. Week 10, Hour 3, Page 13 / 33
When fewer data points are removed, the data set with
missingness more closely resembles the original, complete
data.
This one of the big trends of imputation.
Have close to complete data produces good estimations of the
complete data.
And this should make intuitive sense:
Imputation works better if you have more to work with.
Stat 302 Notes. Week 10, Hour 3, Page 14 / 33
Consider the opposite trend, where 80 of the 100 data points
are removed.
Now 5/20 = 25% of the remaining data is 0
10/20 = 50% is 1, and
2/20 = 10% is 2.
Stat 302 Notes. Week 10, Hour 3, Page 15 / 33
When we remove 80% of the data, the estimates of the
complete-data distribution are worse than the 50%-removed
case.
Typically, imputation is considered when less than 20% of the
data is missing. The quality of the imputation depends on both
the proportion of data that is missing, and the pattern, if any,
to the missingness.
In other words: Imputation is only as reliable and valid as the
data it draws from. It isn't a magic method that makes real
information out of nothing.
Stat 302 Notes. Week 10, Hour 3, Page 16 / 33
Sometimes data is missing for reasons related to the study, but
we have all the information to infer why the data is missing.
Scale Example:
'A weighing scale that fails often when it is on a soft surface,
but rarely on a hard surface'
In this case, the chance of not having the weight data isn't the
same across all observations...
...but it is the same within each surface type.
Stat 302 Notes. Week 10, Hour 3, Page 17 / 33
This sort of missingness is not MCAR because of the unequal
chances of observations being missing.
However, it is Missing At Random (MAR), so we can still
impute, but more care is needed.
Because we can assume an equal chance of missingness within
the observations on a surface, we will use two distributions to
impute...
1. The distribution of weights on hard surfaces.
2. The distribution of weights on soft surfaces.
Stat 302 Notes. Week 10, Hour 3, Page 18 / 33
Missing values on hard surfaces would be filled according to
the first line in the above table.
Missing values on soft surfaces would be filled according to the
second line.
Stat 302 Notes. Week 10, Hour 3, Page 19 / 33
Medical example:
Values for an “exercising heart rate” variable would be missing
for patients at a hospital that are too sick to exercise.
If we want to impute the data for the missing heart rates, we
would need similar cases to draw from. Unfortunately, all of
those similar cases are the heart problem cases.
Since the chance of missingness relies on information that is
itself missing, we call this pattern NMAR, Not Missing At
Random
Stat 302 Notes. Week 10, Hour 3, Page 20 / 33
The problem with NMAR data is that we can't tell if the missing
data has a different distribution from the observed data.
In this medical example, the exercising heart rates for those
that are too sick to exercise would probably be a lot higher
than those not too sick. However, we don't have any way of
knowing how much higher.
If we were to use the distribution of heart rates that we have
as a basis for imputation, we would underestimate the heart
rates of those that are too sick.
Stat 302 Notes. Week 10, Hour 3, Page 21 / 33
Irritating, but there's nothing satisfactory about NMAR data.
Stat 302 Notes. Week 10, Hour 3, Page 22 / 33
Using the group distributions (like in the scale example) is one
way to impute MAR data, but it's by no means the only way.
For MAR and MCAR data, we can impute by regression.
1. Using the complete observations, make a regression model
that uses the variable to be imputed as a response.
2. Take the predictions from the model as the imputed values
for that dataset.
This is like using the group (e.g. hard and soft surface)
distributions except with a more flexible and usable definition
of 'similar'.
Stat 302 Notes. Week 10, Hour 3, Page 23 / 33
Consider the dataset airquality and the imputation method
mice() found in the package 'mice'.
'mice' stands for “Multiple Imputation by Chained Equations”.
The 'Multiple' part is not covered here, but 'Chained' reveals
something very clever:
The mice() function uses the observed data to impute for one
variable, then it uses those the data AND those imputations to
impute the NEXT variable with missing data, and so on.
It literally chains together imputations.
Stat 302 Notes. Week 10, Hour 3, Page 24 / 33
Imputed values have their own uncertainty. We won't cover it
here, but there is an extension of imputation called 'multiple
imputation' that handles this uncertainty.
Stat 302 Notes. Week 10, Hour 3, Page 25 / 33
We can use the mice() function to impute all the missing data
in the airquality dataset.
First, as it is raw:
3 of the first 10 observations made unusable by missing data.
Stat 302 Notes. Week 10, Hour 3, Page 26 / 33
Now after using the mice() function and the complete()
function on it.
Stat 302 Notes. Week 10, Hour 3, Page 27 / 33
Summary of distributions BEFORE imputation
Summary of distributions AFTER imputation
Stat 302 Notes. Week 10, Hour 3, Page 28 / 33
Who knew mice could be so capable?
Stat 302 Notes. Week 10, Hour 3, Page 29 / 33
One final comment:
Sometimes data is missing on purpose, usually because the
responses have no meaningful value, or can be inferred.
Stat 302 Notes. Week 10, Hour 3, Page 30 / 33
Example 1: Asking someone who doesn't drink coffee their
favourite brand of coffee.
'Favourite brand' doesn't mean anything. Impute 'no
preference', or just ignore.
Example 2: Asking a male about the number of times they have
been pregnant.
'Times pregnant' can be imputed to be zero.
This sort of missingness is not a concern. (Technically, it's MAR)
Stat 302 Notes. Week 10, Hour 3, Page 31 / 33
Take home messages about missing data and imputation:
- Missing data is a problem because it makes other variables for
an observation harder or impossible to use.
- Missing data can also introduce biases if the observations
with missing values are somehow different than the complete
observations.
- Imputation methods fill in missing data with reasonable
replacements.
Stat 302 Notes. Week 10, Hour 3, Page 32 / 33
- Imputation works by taking information from similar
observations.
- Imputation becomes less reliable as the proportion of data
that's missing increases.
- NMAR data can introduce biases if the observations with
missing values are somehow different than the complete
observations.
Stat 302 Notes. Week 10, Hour 3, Page 33 / 33
Download