STAT 3130 Statistical Methods II Missing Data and Imputation STAT3130 – Missing Data Rarely does “real” data come to you without any missing values. And “missing” can take several forms: 1. Truly “missing” – meaning no value is present; 2. Coded – meaning that there is a “value” but it means something different from the scale of the data; 3. Miscoded – meaning that there is a “value”, but it is wrong... STAT3130 – Missing Data Lets take each one individually… 1) Truly Missing Data. Consider the GSS08 dataset. You will see missing data which is character as well as numeric. Note that missing values for a character variable are identified in SAS as a blank, while missing values for numeric variables are identified in SAS as a “.”. This is the most obvious form of missing data. Note that you can check for truly missing data by using the following SAS Code: Proc Means data = data nmiss; Var var; Run; Proc Freq data = data; Tables var; Run; STAT3130 – Missing Data Lets take each one individually… 2) Coded Data Frequently, when data is input into a database, any values which were missing, incorrect, illegible, etc. will be coded at the time of entry. These codes are typically (but not always) provided to you in a data dictionary. Coded values are sometimes easy to spot (the codes are character when the rest of the data is numeric) or not easy to spot (the coded values are numeric, but not part of the “true” range of the data). Consider the GSS08 dataset again. Take a look at the age variable – there are coded values there. What are they and how would you know? STAT3130 – Missing Data Lets take each one individually… 3) MisCoded Data Humans make mistakes – sometimes in weird ways computers make mistakes too. When data is entered incorrectly, this can really mess things up when you are trying to run a model or a test. Consider the age variable again in the GSS08 dataset… STAT3130 – Missing Data With all of these issues, you also need to determine if the data is missing: 1) Completely randomly – also called MCAR. This means that the missing values have no pattern. In other words, the missing values cannot be predicted in any way. 2) Missing at random – also called MAR. This means that the missing values can be predicted using the other data available for an observation. In these instances, you may want to assign a categorical value (when the variable is categorical) with an indicator of “MISSING” to identify these observations differently. 3) Missing that depends upon latent variables. For example, there could be a latent (unobserved) variable which is highly correlated with the missing values. A familiar example from medical studies is that if a particular treatment causes discomfort, a patient is more likely to drop out of the study. This “missingness” is not at random (unless “discomfort” is measured and observed for all patients). STAT3130 – Missing Data In all instances, the data values need to be replaced or “imputed” with a logical, meaningful value. Before we discuss the strategies for imputation…lets make a quick point regarding why this needs to be done… All analytical software packages – including SAS – require “complete case” for an observation to be included in the analysis. This means that if there are 100 variables and an observation is missing just ONE value, the entire case is removed from the analysis. And, you lose the other 99 perfectly good values. STAT3130 – Missing Data Think about this…if you are missing only 1% of your data and you have 1,000,000 observations and 50 variables, you could lose as much as 395,000 observations when you go to model… [total observations – (((1-percent missing)^variables)*total observations)] or [1,000,000 – (((1-.01)^50)*1,000,000)] = 394,994 That is A LOT of valid data that you would lose! And, it could bias your results. STAT3130 – Imputation We need a way to replace those values – logically. Many options for imputation exist. Here are four of the primary methods: 1. 2. 3. 4. Mean based imputation Median based imputation Stratified imputation Regressed imputation – difficult with MCAR Each of these will be discussed briefly in turn. STAT3130 – Imputation Imputation Strategies: 1) Mean Based Imputation – this process is the most simple. This involves replacing the missing values with the mean of the variable. But…before you do this, think through these questions: a. How would the distribution of the variable affect/and be affected by this imputation decision? b. What happens to the mean of the variable? c. What happens to the standard deviation of the variable? d. How might the results be biased? STAT3130 – Imputation Imputation Strategies: 2) Median Based Imputation – this process is also very simple. This involves replacing the missing values with the median of the variable. But…before you do this, think through these questions: a. How would the distribution of the variable affect/and be affected by this imputation decision? b. What happens to the mean of the variable? c. What happens to the standard deviation of the variable? d. How might the results be biased? STAT3130 – Imputation Imputation Strategies: 3) Stratified Imputation – this process is slightly more involved. This involves replacing the missing values with the mean or median of the variable but with consideration for similar strata of observations. But…before you do this, think through these questions: a. How would the distribution of the variable affect/and be affected by this imputation decision? b. What happens to the mean of the variable? c. What happens to the standard deviation of the variable? d. How might the results be biased? STAT3130 – Imputation Imputation Strategies: 4) Regressed Imputation – This process involves actually predicting the value of the missing values using Regression. It works well if: the variables are related to each other and if you only have one or two variables with missing data. But…before you do this, think through these questions: a. How would the distribution of the variable affect/and be affected by this imputation decision? b. What happens to the mean of the variable? c. What happens to the standard deviation of the variable? d. How might the results be biased?