Statistical Methods Principles Missing Data Dr Eleni Matechou matechou@stats.ox.ac.uk References: • R.J.A. Little and D.B. Rubin 2nd edition “Statistical Analysis with Missing Data” • J.L. Schafer and J.W. Graham 2002 “Missing Data: Our View of the State of the Art” 1 Missing data indicator matrix Usually, the data set consists of a matrix Y with n rows and p columns. A row traditionally corresponds to a case and a column to a variable i.e. yij is the value of variable j for individual i. With real data sets, it is not uncommon for an entry to be missing. Denote by M the missing data indicator matrix with entry Mij equal to 1 if observation yij is missing and 0 otherwise. Missing data mechanisms Question: Are the missing values related to the underlying values of the variables in the data set? The data can be: • missing completely at random (MCAR) • missing at random (MAR) • not missing at random (NMAR) Denote by P (M |Y, θ) the distribution of M , where θ are unknown parameters. MCAR If missingness does not depend on the values of the data set, observed or unobserved, then: P (M |Y, θ) = P (M |θ) and the data are MCAR. Example: n individuals had their blood pressure measured and a random 0 sample of size n < n also had their weight measured. 4 MAR If missingness does not depend on the unobserved values of the data set but does depend on the observed then: P (M |Y, θ) = P (M |Yobs , θ) and the data are MAR. Example: n individuals had their blood pressure measured and only those individuals with high blood pressure also had their weight measured. 5 NMAR If missingness depends on the unobserved values of the data set then: P (M |Y, θ) = P (M |Ymiss , θ) and the data are NMAR. Example: n individuals had their blood pressure measured but only overweight individuals also had their weight measured. Example Suppose Yi1 is the age and Yi2 is the income of individual i. Define Mi = 1 if information on income of individual i is missing and 0 otherwise. If the probability that Mi = 1 • is the same for all individuals ⇒ MCAR. • depends on Yi1 then ⇒ MAR (given Yi1 is observed). • depends on Yi2 then ⇒ NMAR. Before collecting the data • In the design of the data collection, take care that missingness is avoided/minimized; but missingness by design is allowed. • If missingness is unavoidable, then collect variables that are predictive of missingness, and of the unobserved values. This helps for the plausibility of the MAR assumption. • If there are missing values, try to understand how they arose, and describe their frequency and patterns. 8 When there are missing data Part of the descriptive analysis of the data set should always be an informal investigation of the missingness patterns: • how many missing values are there? • how are they clustered for certain variables? • are there systematic differences with respect to observed variables between cases with and without missing values? Knowing what is predictive for missingness within the available data can help understand the processes leading to missingness. How to deal with the missing values? A naive approach is to delete the cases that have missing values and only analyse the complete data set. If the data are not MCAR the results can be seriously biased because the complete cases are probably not a representative sample of the population. Even if the data are MCAR, case deletion can result in a large portion of the data set to be discarded even if the proportion of missingness is not that high. 10 How to deal with the missing values? A frequently used approach is to base inference on the likelihood function for the incomplete data by treating M as a random variable and specifying the joint distribution of M and Y . Generally, closed form expressions for the maximum likelihood estimates cannot be found and iterative processes are required. The Expectation-Maximisation algorithm is usually the method of choice. If MAR holds then the missingness mechanism does not need to be explicitly modelled. 11 How to deal with the missing values? An attractive approach is to impute the missing values i.e. to fill them in, but with what? • In mean substitution the missing values are replaced by the average of the observed values. • In hot deck imputation the missing values of one or more variables for a nonrespondent (called the recipient) are replaced with observed values from a respondent (the donor) that is similar to the non-respondent with respect to characteristics observed by both cases. • In conditional mean imputation the missing values are predicted using a model which has as a response the variable with missing values and the rest of the variables as predictors (of course using only the complete cases). • In conditional distribution imputation the missing values are replaced by random draws from the conditional distribution of the variable to be imputed on the other variables. How do these imputation methods perform? Schafer and Graham report on their simulations: “Mean substitution and the hot deck produce biased estimates for many parameters under any type of missingness. Conditional mean imputation performs slightly better but still may introduce bias. Imputing from a conditional distribution is essentially unbiased under MCAR or MAR but potentially biased under NMAR” 13 Conditional distribution imputation Adds random variability to reflect the additional uncertainty caused by the imputation. For example, if Y has a multivariate normal distribution, then for each case i we may substitute missing values by random draws from the conditional normal distribution of the missing data, given the observed data. If MAR holds, then this will be a reasonable procedure; The multivariate normality assumption is not very critical if the number of missing values is not too high. Multiple imputation For a given incomplete data set, the missing data is imputed independently D times by drawing from the conditional distribution of the missing data given the observed data. This leads to D complete data sets, that differ only with respect to the imputed values. For each complete data set the desired analysis is executed; standard errors of parameters are a combination of the within-data set standard errors, and the variability of estimates between the data sets. How are the data sets combined? Suppose the parameter of interest is a scalar γ. The estimate obtained for the √ parameter of interest from the dth data set is γ bd and its standard error is Ud . The overall estimate is simply the average over the D data sets: PD bi i=1 γ γ̄ = D The uncertainty in γ̄ is: TD = ŪD + (1 + D−1 )BD 16 Where ŪD is the average within imputation variance: PD i=1 Ui ŪD = and BD D is the between-imputations variance: PD BD = − γ̄)2 D−1 γi i=1 (b See MissingData.R for an example. 17 What about NMAR data? In this case the distribution of the missingness must be explicitly specified. • In selection models a distribution for the complete data is specified first and then a distribution for the missingness is specified conditional on that of the complete data. • In pattern-mixture models individuals are classified by their missingness and the observed data are fitted within each missing group. 18 R packages useful for inference with missing data • Amelia II: Bootstrap EM imputation. • cat: Missing data methods for categorical data. • mi: Missing Data Imputation and Model Checking. • mice: Multiple Imputation and generalized linear regression by Chained Equations. • mix: Missing data methods for mixed categorical and continuous data. • mlmmm: Estimation for mixed linear models with missing data. • mvnmle: MLE for multivariate normal with missing data. • norm: Estimation and imputation for multivariate normal data with missings. • VIM: Visualization and Imputation of Missing Values. There are many more! 19