Week 10 Hour 3: - Imputation, an MCAR Example - Imputation and amount of missingness. - The mice() function and mice package Italics Denotes direct quotes from 'Flexible Imputation of Missing Data' by Stef van Buuren. Stat 302 Notes. Week 10, Hour 3, Page 1 / 33 Recall from last time. Missing, or incomplete, data is any information that is recorded for some observations, but not others. Stat 302 Notes. Week 10, Hour 3, Page 2 / 33 Sometimes data is missing for reasons unrelated to the study. Often, this is a technology-based problem. Examples: Telescope surveys missing days due to cloud cover. 'A weighing scale that has run out of batteries' Tracking something online and the Wi-Fi fails occasionally. Stat 302 Notes. Week 10, Hour 3, Page 3 / 33 We usually assume any observation in the data is equally likely to be subject these sort of technical failures. 'Some of the data will be missing simply because of bad luck.' This sort of missingness is called MCAR, Missing Completely At Random This is the easiest to impute, because the distribution of the variable with missing info should be the same even with some of the data missing. What does 'the distribution should match' mean? Stat 302 Notes. Week 10, Hour 3, Page 4 / 33 Distribution with MCAR: Consider the 100 randomly generated data points. Stat 302 Notes. Week 10, Hour 3, Page 5 / 33 (For those interested, this distribution is the exponential with mean 1) 40% of these data points are 0, another 41% are 1, another 10% are 2, about another 5% are 3 and 4 each, and a few are higher.... This is an estimation of the distribution of the data points. Stat 302 Notes. Week 10, Hour 3, Page 6 / 33 If we remove 50 of these data points completely at random, this distribution stays the same, and the estimates stay close. After removal, 21/50 = 42% of data points are 0, 44% are 1, 10% are 2, none are 3, and 4% are 4. Stat 302 Notes. Week 10, Hour 3, Page 7 / 33 Why does this happen? Because when we remove one data point, it has about a... ...40% of being in the '0' bin, ...40% of being in the '1' bin, and so on... So, over the long run, 40% of the data points are '0', 40% of those are '1', and so on... In other words, the more common a value is, the greater number of those values will be missing. Stat 302 Notes. Week 10, Hour 3, Page 8 / 33 Therefore, if we were to impute the missing values. That is, fill them in in such a way to restore whatever we can about the original data, the imputed data should have the same distribution as the missing data. Let's say we only had the data that was half missing, how could we use this 'equal distribution' property to make a good set of replacement values? Stat 302 Notes. Week 10, Hour 3, Page 9 / 33 Since we know the existing values have a similar distribution to the missing ones, we will use that distribution to recreate the data. For each missing value, fill it in with a '0' with a 42% chance, the same as the proportion of observed values that have '0' to begin with. The chance of other values are found similarly. Stat 302 Notes. Week 10, Hour 3, Page 10 / 33 Why 42% and not 40%? Assume that we were just given the data with all the Nas in place of the real values. We wouldn't actually know this 40% proportion to go by, so 42% is our best guess. Also, the rare case where the value was '5' doesn't show up in the data with missingness, so we have no real way to predict that 5. This is all part of the additional uncertainty that having missing data brings you. Stat 302 Notes. Week 10, Hour 3, Page 11 / 33 You've dealt with uncertainty thus far. That's cause for celebration. Stat 302 Notes. Week 10, Hour 3, Page 12 / 33 Going back to these original 100 points, if we were to remove only 10 of those 100... 35/90 = 39% of the remaining data is 0 37/90 = 41% is 1, and 9/90 = 10% is 2. Stat 302 Notes. Week 10, Hour 3, Page 13 / 33 When fewer data points are removed, the data set with missingness more closely resembles the original, complete data. This one of the big trends of imputation. Have close to complete data produces good estimations of the complete data. And this should make intuitive sense: Imputation works better if you have more to work with. Stat 302 Notes. Week 10, Hour 3, Page 14 / 33 Consider the opposite trend, where 80 of the 100 data points are removed. Now 5/20 = 25% of the remaining data is 0 10/20 = 50% is 1, and 2/20 = 10% is 2. Stat 302 Notes. Week 10, Hour 3, Page 15 / 33 When we remove 80% of the data, the estimates of the complete-data distribution are worse than the 50%-removed case. Typically, imputation is considered when less than 20% of the data is missing. The quality of the imputation depends on both the proportion of data that is missing, and the pattern, if any, to the missingness. In other words: Imputation is only as reliable and valid as the data it draws from. It isn't a magic method that makes real information out of nothing. Stat 302 Notes. Week 10, Hour 3, Page 16 / 33 Sometimes data is missing for reasons related to the study, but we have all the information to infer why the data is missing. Scale Example: 'A weighing scale that fails often when it is on a soft surface, but rarely on a hard surface' In this case, the chance of not having the weight data isn't the same across all observations... ...but it is the same within each surface type. Stat 302 Notes. Week 10, Hour 3, Page 17 / 33 This sort of missingness is not MCAR because of the unequal chances of observations being missing. However, it is Missing At Random (MAR), so we can still impute, but more care is needed. Because we can assume an equal chance of missingness within the observations on a surface, we will use two distributions to impute... 1. The distribution of weights on hard surfaces. 2. The distribution of weights on soft surfaces. Stat 302 Notes. Week 10, Hour 3, Page 18 / 33 Missing values on hard surfaces would be filled according to the first line in the above table. Missing values on soft surfaces would be filled according to the second line. Stat 302 Notes. Week 10, Hour 3, Page 19 / 33 Medical example: Values for an “exercising heart rate” variable would be missing for patients at a hospital that are too sick to exercise. If we want to impute the data for the missing heart rates, we would need similar cases to draw from. Unfortunately, all of those similar cases are the heart problem cases. Since the chance of missingness relies on information that is itself missing, we call this pattern NMAR, Not Missing At Random Stat 302 Notes. Week 10, Hour 3, Page 20 / 33 The problem with NMAR data is that we can't tell if the missing data has a different distribution from the observed data. In this medical example, the exercising heart rates for those that are too sick to exercise would probably be a lot higher than those not too sick. However, we don't have any way of knowing how much higher. If we were to use the distribution of heart rates that we have as a basis for imputation, we would underestimate the heart rates of those that are too sick. Stat 302 Notes. Week 10, Hour 3, Page 21 / 33 Irritating, but there's nothing satisfactory about NMAR data. Stat 302 Notes. Week 10, Hour 3, Page 22 / 33 Using the group distributions (like in the scale example) is one way to impute MAR data, but it's by no means the only way. For MAR and MCAR data, we can impute by regression. 1. Using the complete observations, make a regression model that uses the variable to be imputed as a response. 2. Take the predictions from the model as the imputed values for that dataset. This is like using the group (e.g. hard and soft surface) distributions except with a more flexible and usable definition of 'similar'. Stat 302 Notes. Week 10, Hour 3, Page 23 / 33 Consider the dataset airquality and the imputation method mice() found in the package 'mice'. 'mice' stands for “Multiple Imputation by Chained Equations”. The 'Multiple' part is not covered here, but 'Chained' reveals something very clever: The mice() function uses the observed data to impute for one variable, then it uses those the data AND those imputations to impute the NEXT variable with missing data, and so on. It literally chains together imputations. Stat 302 Notes. Week 10, Hour 3, Page 24 / 33 Imputed values have their own uncertainty. We won't cover it here, but there is an extension of imputation called 'multiple imputation' that handles this uncertainty. Stat 302 Notes. Week 10, Hour 3, Page 25 / 33 We can use the mice() function to impute all the missing data in the airquality dataset. First, as it is raw: 3 of the first 10 observations made unusable by missing data. Stat 302 Notes. Week 10, Hour 3, Page 26 / 33 Now after using the mice() function and the complete() function on it. Stat 302 Notes. Week 10, Hour 3, Page 27 / 33 Summary of distributions BEFORE imputation Summary of distributions AFTER imputation Stat 302 Notes. Week 10, Hour 3, Page 28 / 33 Who knew mice could be so capable? Stat 302 Notes. Week 10, Hour 3, Page 29 / 33 One final comment: Sometimes data is missing on purpose, usually because the responses have no meaningful value, or can be inferred. Stat 302 Notes. Week 10, Hour 3, Page 30 / 33 Example 1: Asking someone who doesn't drink coffee their favourite brand of coffee. 'Favourite brand' doesn't mean anything. Impute 'no preference', or just ignore. Example 2: Asking a male about the number of times they have been pregnant. 'Times pregnant' can be imputed to be zero. This sort of missingness is not a concern. (Technically, it's MAR) Stat 302 Notes. Week 10, Hour 3, Page 31 / 33 Take home messages about missing data and imputation: - Missing data is a problem because it makes other variables for an observation harder or impossible to use. - Missing data can also introduce biases if the observations with missing values are somehow different than the complete observations. - Imputation methods fill in missing data with reasonable replacements. Stat 302 Notes. Week 10, Hour 3, Page 32 / 33 - Imputation works by taking information from similar observations. - Imputation becomes less reliable as the proportion of data that's missing increases. - NMAR data can introduce biases if the observations with missing values are somehow different than the complete observations. Stat 302 Notes. Week 10, Hour 3, Page 33 / 33