What do we mean by missing data

What do we mean by missing data? Missing data are simply observations that we intended to be made but did not. For example, an individual may only respond to certain questions in a survey, or may not respond at all to a particular wave of a longitudinal survey. In the presence of missing data, our goal remains making inferences that apply to the population targeted by the complete sample - i.e. the goal remains what it was if we had seen the complete data. However, both making inferences and performing the analysis are now more complex. We will see we need to make assumptions in order to draw inferences, and then use an appropriate computational approach for the analysis. We will avoid adopting computationally simple solutions (such as just analysing complete data or carrying forward the last observation in a longitudinal study) which generally lead to misleading inferences. In practice the data consist of (a) the observations actually made (where '?' denotes a missing observation): Figure 1: Typical partially observed data set and (b) the pattern of missing values: Figure 2: Pattern of missing values for the data in Figure 1. A '1' indicates that an observation is seen, a '0' that it is missing Inferential framework When it comes to analysis, whether we adopt a frequentist approach (Figure 3) or a Bayesian approach (Figure 4), the likelihood is central. In these notes, for convenience, we discuss issues from a frequentist perspective, although often we use appropriate Bayesian computational strategies to approximate frequentist analyses. Figure 3: Schematic for frequentist (sometimes termed traditional) paradigm of inference The actual sampling process involves the 'selection' of the missing values, as well as the units. So to complete the process of inference in a justifiable way we need to take this into account. Figure 4: Schematic for Bayesian paradigm of inference The likelihood is a measure of comparative support for different models given the data. It requires a model for the observed data, and as with classical inference this must involve aspects of the way in which the missing data have been selected (i.e. the missingness mechanism). Assumptions We distinguish between item and unit nonresponse (missingness). For item missingness, values can be missing on response (i.e. outcome) variables and/or on explanatory (i.e. design/covariate/exposure/confounder) variables. Missing data can effect properties of estimators (for example, means, percentages, percentiles, variances, ratios, regression parameters and so on). Missing data can also affect inferences, i.e. the properties of tests and confidence intervals, and Bayesian posterior distributions. A critical determinant of these effects is the way in which the probability of an observation being missing (the missingness mechanism) depends on other variables (measured or not) and on its own value. In contrast with the sampling process, which is usually known, the missingness mechanism is usually unknown. The data alone cannot usually definitively tell us the sampling process. Likewise, the missingness pattern, and its relationship to the observations, cannot definitively identify the missingness mechanism. The additional assumptions needed to allow the observed data to be the basis of inferences that would have been available from the complete data can usually be expressed in terms of either 1. the relationship between selection of missing observations and the values they would have taken, or 2. the statistical behaviour of the unseen data. These additional assumptions are not subject to assessment from the data under analysis; their plausibility cannot be definitively determined from the data at hand. The issues surrounding the analysis of data sets with missing values therefore centre on assumptions. We have to 1. decide which assumptions are reasonable and sensible in any given setting; - contextual/subject matter information will be central to this 2. ensure that the assumptions are transparent; 3. explore the sensitivity of inferences/conclusions to the assumptions, and 4. understand which assumptions are associated with particular analyses. Getting computation out of the way The above implies it is sensible to use approaches that make weak assumptions, and to seek computational strategies to implement them. However, often computationally simple strategies are adopted, which make strong assumptions, which are subsequently hard to justify. Classic examples are completers analysis (i.e. only including units with fully observed data in the analysis) and last observation carried forward. The latter is sometimes advocated in longitudinal studies, and replaces a unit's unseen observations at a particular wave with their last observed values, irrespective of the time that has elapsed between the two waves. Simple, ad-hoc methods and their shortcomings In contrast to principled methods, these usually create a single 'complete' dataset, which is analysed as if it were the fully observed data. Unless certain, fairly strong, assumptions are true, the answers are invalid. We briefly review the following methods:     Analysis of completers only Imputation of simple mean Imputation of regression mean Last observation carried forward Completers analysis The data on the left below has one missing observation on variable 2, unit 10.     Completers analysis deletes all units with incomplete data from the analysis (here unit 10). It is inefficient. It is problematic in regression when covariate values are missing and models with several sets of explanatory variables need to be compared. Either we keep changing the size of the data set, as we add/remove explanatory variables with missing observations, or we use the (potentially very small, and unrepresentative) subset of the data with no missing values. When the missing observations are not a completely random selection of the data, a completers analysis will give biased estimates and invalid inferences. Simple mean imputation The data on the left below has one missing observation on variable 2, unit 10. We replace this with the arithmetic average of the observed data for that variable. This value is shown in red in the table below.    This approach is clearly inappropriate for categorical variables. It does not lead to proper estimates of measures of association or regression coefficients. Rather, associations tend to be diluted. In addition, variances will be wrongly estimated (typically under estimated) if the imputed values are treated as real. Thus inferences will be wrong too. Regression mean imputation Here, we use the completers to calculate the regression of the incomplete variable on the other complete variables. Then, we substitute the predicted mean for each unit with a missing value. In this way we use information from the joint distribution of the variables to make the imputation. Example Consider again our dataset with two variables, which is missing variable 2 on unit 10: To perform regression imputation, we first regress variable 2 on variable 1 (note, it doesn't matter which of these is the 'response' in the model of interest). In our example, we use simple linear regression: V2 = Using units 1-9, we find that + V1 + e. = 6.56 and = - 0.366, so the regression relationship is Expected value of V2 = 6.56 - 0.366V1. For unit 10, this gives 6.56 - 0.366 x 3.6 = 5.24. This value is shown in red below: Results of regression mean imputation. Note   Regression mean imputation can generate unbiased estimates of means, associations ad regression coefficients in a much wider range of settings than simple mean imputation. However, one important problem remains. The variability of the imputations is too small, so the estimated precision of regression coefficients will be wrong and inferences will be misleading. Creating and extra category When a categorical variable has missing values it is common practice to add an extra 'missing value' category. In the example below, the missing values, denoted '?' have been given the category 3. This is bad practice because:     the impact of this strategy depends on how missing values are divided among the real categories, and how the probability of a value being missing depends on other variables; very dissimilar classes can be lumped into one group; sever bias can arise, in any direction, and when used to stratify for adjustment (or correct for confounding) the completed categorical variable will not do its job properly. Last observation carried forward (LOCF) This method is specific to longitudinal data problems. For each individual, missing values are replaced by the last observed value of that variable. For example: Here the three missing values for unit 1, at times 4, 5 and 6 are replaced by the value at time 3, namely 2.0. Likewise the two missing values for unit 3, at times 5 and 6, are replaced by the value at time 4, which is 3.5. Using LOCF, once the data set has been completed in this way it is analysed as if it were fully observed. For full longitudinal data analyses this is clearly disastrous: means and covariance structure are seriously distorted. For single time point analyses the means are still likely to be distorted, measures of precision are wrong and hence inferences are wrong. Note this is true even if the mechanism that causes the data to be missing is completely random. For a full discussion download the talk 'LOCF - time to stop carrying it forward' from the preprints page of this site. Conclusions Unless the proportion missing is so small as to be unlikely to affect inferences, these simple ad-hoc methods should be avoided. However, note that 'small' is hard to define: estimates of the chances of rare events can be very sensitive to just a few missing observations; likewise, a sample mean can be sensitive to missing observations which are in the tails of the distribution. They usually conflict with the statistical model that underpins the analysis (however simple and implicit this might be) So they introduce bias. As the assumptions about the reason for the data being missing that they implicitly make are often difficult to describe (e.g. with LOCF), they can make it very hard to know what assumptions are being made in the analysis. They do not properly reflect statistical uncertainty: data are effectively 'made up' and no subsequent account is taken of this. Some notation The data We denote the data we intended to collect, by Y, and we partition this into Y = {Yo,Ym}. where Yo is observed and Ym is missing. Note that some variables in Y may be outcomes/responses, some may be explanatory variables/covariates. Depending on the context these may all refer to one unit, or to an entire dataset. Missing value indicator Corresponding to every observation Y, there is a missing value indicator R, defined as: R= with R corresponding to Y. Missing value mechanism The key question for analyses with missing data is, under what circumstances, if any, do the analyses we would perform if the data set were fully observed lead to valid answers? As before, 'valid' means that effects and their SE's are consistently estimated, tests have the correct size, and so on, so inferences are correct. The answer depends on the missing value mechanism. This is the probability that a set of values are missing given the values taken by the observed and missing observations, which we denote by Pr(R | yo, ym) Examples of missing value mechanisms 1. 2. 3. 4. The chance of nonresponse to questions about income usually depend on the person's income. Someone may not be at home for an interview because they are at work. The chance of a subject leaving a clinical trial may depend on their response to treatment. A subject may be removed from a trial if their condition is insufficiently controlled. Missing Completely at Random (MCAR) Suppose the probability of an observation being missing does not depend on observed or unobserved measurements. In mathematical terms, we write this as Pr(r | yo, ym) = Pr(r) Then we say that the observation is Missing Completely At Random, which is often abbreviated to MCAR. Note that in a sample survey setting MCAR is sometimes called uniform non-response. If data are MCAR, then consistent results with missing data can be obtained by performing the analyses we would have used had their been no missing data, although there will generally be some loss of information. In practice this means that, under MCAR, the analysis of only those units with complete data gives valid inferences. An example of a MCAR mechanism would be that a laboratory sample is dropped, so the resulting observation is missing. However, many mechanisms that initially seem to be MCAR may turn out not to be. For example, a patient in a clinical trial may be lost to follow up after 'falling' under a bus; however if it is a psychiatric trial, this may be an indication of poor response to treatment. Likewise, if a response to a postal questionnaire is missing because the questionnaire was lost or stolen in the post, this may not be random but rather reflect the area in which the sorting office is located. As we have already said, under MCAR analyses of completers only (a short hand for including in the analysis only units with fully observed data) give valid inferences. So do analyses based on moment based estimators (for example, generalised estimating equations), and other estimators derived from consistent estimating equations. By consistent estimating equations we mean functions of the data and unknown parameters whose expectation, taken over the complete data at the population parameter values, is zero. Under MCAR, they still have expectation zero, and so still lead to valid inferences. Saying the same thing mathematically, an estimating equation can be written as U(y, ), and at the estimate , U(y, ) = 0. The estimating equation is consistent because EU(Y, ) = 0 (where is the population parameter value). It remains consistent if the data are missing completely at random (MCAR) because, even then, still EU(Yo, ) = 0. A simple example of a consistent is estimating equation is the sample mean, U(y, ) = - . Missing At Random (MAR) After considering MCAR, a second question naturally arises. That is, what are the most general conditions under which a valid analysis can be done using only the observed data, and no information about the missing value mechanism, Pr(r | yo, ym)? The answer to this is when, given the observed data, the missingness mechanism does not depend on the unobserved data. Mathematically, Pr(r | yo, ym) = Pr(r | yo). This is termed Missing At Random, abbreviated MAR. This is equivalent to saying that the behaviour of two units who share observed values have the same statistical behaviour on the other observations, whether observed or not. For example: As units 1 and 2 have the same values where both are observed, given these observed values, under MAR, variables 3, 5 and 6 from unit 2 have the same distribution (NB not the same value!) as variables 3, 5 and 6 from unit 1. Note that under MAR the probability of a value being missing will generally depend on observed values, so it does not correspond to the intuitive notion of 'random'. The important idea is that the missing value mechanism can expressed solely in terms of observations that are observed. Unfortunately, this can rarely be definitively determined from the data at hand! Examples of MAR mechanisms   A subject may be removed from a trial if his/her condition is not controlled sufficiently well (according to pre-defined criteria on the response). Two measurements of the same variable are made at the same time. If they differ by more than a given amount a third is taken. This third measurement is missing for those that do not differ by the given amount. A special case of MAR is uniform non-response within classes. For example, suppose we seek to collect data on income and property tax band. Typically, those with higher incomes may be less willing to reveal them. Thus, a simple average of incomes from respondents will be downwardly biased. However, now suppose we have everyone's property tax band, and given property tax band non-response to the income question is random. Then, the income data is missing at random; the reason, or mechanism, for it being missing depends on property band. Given property band, missingness does not depend on income itself. Therefore, to get an unbiased estimate of income, we first average the observed income within each property band. As data are missing at random given property band, these estimates will be valid. To get an estimate of the overall income, we simply combine these estimates, weighting by the proportion in each property band. In this example, a simple summary statistic (average of observed incomes) was biased. Conversely, a simple model (estimate of income conditional on property band), where we condition on the variable that makes the data MAR, led to a valid result. This is an example of a more general result. Methods based on the likelihood are valid under MAR. However, in general non-likelihood methods (e.g. based on completers, moments, estimating equations & including generalised estimating equations) are not valid under MAR, although some can be 'fixed up'. In particular, ordinary means, and other simple summary statistics from observed data, will be biased. Finally, note that in a likelihood setting the term ignorable is often used to refer to and MAR mechanism. It is the mechanism (i.e. the model for Pr(R | yo)) which is ignorable - not the missing data! Missing Not At Random (MNAR) When neither MCAR nor MAR hold, we say the data are Missing Not At Random, abbreviated MNAR. In the likelihood setting (see end of previous section) the missingness mechanism is termed non-ignorable. What this means is 1. Even accounting for all the available observed information, the reason for observations being missing still depends on the unseen observations themselves. 2. To obtain valid inference, a joint model of both Y and R is required (that is a joint model of the data and the missingness mechanism). Unfortunately 1. We cannot tell from the data at hand whether the missing observations are MCAR, NMAR or MAR (although we can distinguish between MCAR and MAR). 2. In the MNAR setting it is very rare to know the appropriate model for the missingness mechanism. Hence the central role of sensitivity analysis; we must explore how our inferences vary under assumptions of MAR, MNAR, and under various models. Unfortunately, this is often easier said than done, especially under the time and budgetary constraints of many applied projects. Summary We have defined, in non-technical language, the commonly used terms MCAR, MAR and NMAR, together with ignorable and non-ignorable. We have seen that 1. The implications of missingness for the analysis depend on the missing value mechanism , which is rarely known. 2. The intuitive notion of randomness for the missing value mechanism is called Missing Completely at Random (MCAR). A wide range of analyses are valid under the assumption of MCAR. 3. A special intermediate case between 'missing completely at random' and 'not missing at random' is Missing at Random (MAR). Assuming MAR, particular analyses that ignore the missing value mechanism are valid under MAR (e.g. likelihood) and others can be fixed up (e.g. estimating equations can be fixed up by weighting). 4. In most situations, the true mechanism is probably MNAR. Important 1. We cannot tell from the data at hand whether the missing observations are MCAR, NMAR or MAR (although we can distinguish between MCAR and MAR). 2. In the MNAR setting it is very rare to know the appropriate model for the missingness mechanism. Hence the central role of sensitivity analysis; we must explore how our inferences vary under assumptions of MAR, MNAR, and under various models. Unfortunately, this is often easier said than done, especially under the time and budgetary constraints of many applied projects. Principled methods These all have the following in common:   No attempt is made to replace a missing value directly. i.e. we do not pretend to 'know' the missing values. Rather: available information (from the observed data and other contextual considerations) is combined with assumptions not dependent on the observed data. This is used to 1. either generate statistical information about each missing value, e.g. distributional information: given what we have observed, the missing observation has a normal distribution with mean and variance , where the parameters can be estimated from the data. 2. and/or generate information about the missing value mechanism The great range of ways in which these can be done leads to the plethora of approaches to missing values. Here are some broad classes of approach:   Wholly model based methods. Simple stochastic imputation.   Multiple stochastic imputation. Weighting methods Wholly model based methods A full statistical model is written down for the complete data. Analysis (whether frequentist or Bayesian) is based on the likelihood. Assumptions must be made about the missing data mechanism:   If it is assumed MCAR or MAR, no explicit model is needed for it. Otherwise this model must be included in the overall formulation. Such likelihood analyses requires some form of integration (averaging) over the missing data. Depending on the setting this can be done implicitly or explicitly, directly or indirectly, analytically or numerically. The statistical information on the missing data is contained in the model. Examples of this would be the use of linear mixed models under MAR in SAS PROC MIXED or MLwiN. Simple stochastic imputation      Instead of replacing a value with a mean, a random draw is made from some suitable distribution. Provided the distribution is chosen appropriately, consistent estimators can be obtained from methods that would work with the whole data set. Very important in the large survey setting where draws are made from units with complete data that are 'similar' to the one with missing values (donors). There are many variations on this hot-deck approach. Implicitly they use non-parametric estimates of the distribution of the missing data: typically need very large samples. Although the resulting estimators can behave well, for precision (and inference) account must be taken of the source of the imputations (i.e. there is no 'extra' data). This implies that the usual complete data estimators of precision can't be used. Thus, for each particular class of estimator (e.g. mean, ratio, percentile) each type of imputation has an associated variance estimator that may be design based (i.e. using the sampling structure of the survey) or model based, or model assisted (i.e. using some additional modelling assumptions). These variance estimators can be very complicated and are not convenient for generalization. Multiple (stochastic) imputation This is very similar to the single stochastic imputation method, except there are many ways in which draws can be made (e.g. hot-deck non-parametric, model based). The crucial difference is that, instead of completing the data once, the imputation process is repeated a small number of times (typically 5-10). Provided the draws are done properly, variance estimation (and hence constructing valid inferences) is much more straightforward. As is discussed more in the 'introduction to multiple imputation' document, the observed variability among the estimates from each imputed data set is used in modifying the complete data estimates of precision. In this way, valid inferences are obtained under missing at random. Weighting methods We give a simple illustration of weighting methods and contrast them with likelihood-based methods. Example: simple continuous problem Consider a simple linear regression setting: E(Yi) = + xi = xiT , where Yi are independent and identically distributed as N(0, i = 1,..., n, ). A typical data set might look like this: The ordinary least squares regression line (in this case maximum likelihood) is obtained by solving the normal equations for : xi(yi - xiT ) = 0. More generally we can get parameter estimates by solving estimating equations: U(Y; ) = Ui(yi; ) = 0. In this example, the estimates of the slope and intercept give the following line: Suppose now that some response (i.e.Y) observations are missing. The implications are (i) possible bias in the estimate of the intercept and slope and (ii) loss of precision in the estimate of the intercept and slope. Suppose in particular that the responses are MNAR; specifically that all observations greater than y=13 are unobserved. In other words we lose all observations above the horizontal line in the left-hand picture, leaving the observed data in the right hand picture: The 'completers' regression line is now biased (and inconsistent). However, because in this case we know the missing value mechanism and the distribution involved (which is unlikely in real applications) we can do a valid analysis using likelihood methods. In this special case the likelihood method is known as Tobit regression. Both the 'completers' and Tobit regression line are shown in the figure below, where the completers line is bottom line at the right hand end: To make it a little more realistic, suppose now that an observation greater than 13 ha a probability of 0.25 of being observed; in other words instead of seeing the left hand plot below, we see the right hand plot. The completers line is still inconsistent (lower line at right hand end): We could use Tobit regression to correct for this (top line at right hand end: original regression line; middle line at right hand end: Tobit regression line; bottom line at right hand end: 'completers' regression line) But there now exists an alternative correction, which requires only that we know the probability of Yi being missing given its value. In other words, we don't need to know the distribution of the observations as we do for the Tobit regression. Let Ri be a random variable indicating whether Yi is missing or not, so Ri = 0 implies Yi missing, and Ri = 1 implies Yi is observed. The following weighted estimating equation is unbiased for the regression parameters: = 0. In this (artificial) example Pr(Ri = 1 | Yi > 13) = and 1 otherwise, so we can use simple weighted least squares to make the correction. Comparison of weighting with other methods. At right hand end, top line is from weighted regression; second line is original regression line; third line is tobit regression and fourth line is completers analysis. We now look at the performance of these two methods in this simple regression setting where the probability of observations greater than 13 being seen is 0.25. For sample sizes of 20, 100 and 1000, the table below shows the mean and standard deviation of the slope estimators (true value 2) over 10,000 simulations. Estimator Expected value SE Completers only 1.73 0.39 Tobit 1.99 0.33 Weighted 1.95 0.45 1.75 0.20 n = 30 n = 100 Completers only Tobit 1.98 0.18 Weighted 1.99 0.23 n = 1000 Completers only 1.74 0.063 Tobit 1.98 0.055 Weighted 2.00 0.070 We see that both tobit and weighted regression are unbiased, but that estimates from a weighted analysis are far more variable. Conclusion Our simple examples have illustrated that there are broadly two forms of principled analysis: 1. likelihood methods, which make distributional assumptions about the unseen data, and assumptions about the form dropout mechanism. 2. weighting methods, which use the inverse of Pr(Ri = 1 | Yi) as weights. In its simple form, weighting is much less precise. However, in the session on weighting, we will see that this can be addressed, albeit with difficulty. In summary, in contrast to ad-hoc methods, principled methods are:    based on a well-defined statistical model for the complete data (assumptions), and explicit assumptions about the missing value mechanism. the subsequent analysis, inferences and conclusions are valid under these assumptions. this doesn't mean the assumptions are necessarily true but it does allow the dependence of the conclusions on these assumptions to be investigated. Modelling R If we have one partially observed variable, define the 'missingness indicator', Ri as before, and construct a logistic model: logitPr(Ri = 1) = + xi1 + xi2 + ... We can compare models using standard methods, and so select a final model for dropout. We should consider interactions if we suspect different mechanisms are causing missing observations in different data subgroups. Such models are not only useful guides to interpreting analyses, they also indicate which variables we should include for our models to be valid under missing at random (MAR) and provide estimates of the weights for methods that use inverse probability weights. We can generalise this approach to cope with the situation were we have two partially observed variables, and the second is always unobserved when the first is (i.e. loss to follow-up): 1. Construct a logistic model for the probability of the first variable being observed. 2. For those units for which the first variable is observed, construct a logistic model for the probability of the second variable being observed. Then Pr(second variable observed) = Pr(second variable observed given first variable observed) X Pr(first variable observed) Introduction The aim of this document is to: 1. introduce the ideas of multiple imputation; 2. outline how to carry out multiple imputation, and 3. provide an intuitive justification for multiple imputation. Why do multiple imputation? One of the main problems with the single stochastic imputation methods is the need for developing appropriate variance formulae for each different setting. Multiple imputation attempts to provide a procedure that can get the appropriate measures of precision relatively simply in (almost) any setting. It was developed by Rubin is a survey setting (where it feels very natural) but has more recently been used more widely. (1) (2) (3) Below, we assume we have an established method for fitting our model, had the data been completely observed. - e.g. regression, glm, ... Some notation For simplicity, suppose we have only two variables in our data set. Suppose one of them is observed on every unit. Call this Y1. Suppose one is only observed on some units. Call this Y2. The key idea The key idea is to use the data from units where both (Y1, Y2) are observed to learn about the relationship between Y1 and Y2. Then, we use this relationship to complete the data set by drawing the missing observations from the distribution of Y2| Y1. We do this K (typically 5) times, giving rise to K complete data sets. We analyse each of these data sets in the usual way. We combine the results using particular rules. Intuition behind multiple imputation First, we model observed (Y1, Y2) pairs. These are shown below, with a regression line through them. It's crucial that the variable with the missing values is the response, whether or not it is going to be the response in the final model of interest. The '?' indicates we have the value of Y1, but that for Y2 is missing. Next, we draw missing Y2 by (i) drawing from distribution of regression line (ii) drawing from variablity about that line. In the picture below, the dotted line is the regression line from the observed data (as on the previous picture) and the red line is drawn from the estimated distribution of the regression line (i.e. the red line's intercept and slope are drawn from the estimated bivariate normal distribution of the intercept and slope). Then, a draw is made from the estimated normal distribution of the residuals, and added to the line, to give the imputed points, shown by red triangles. From this graph we can see straight away why replacing the missing observations with the mean of Y2 is a bad idea. For instance, the leftmost '?' in the fist picture above would be given a value far above the regression line (which represents its expected value given Y1). We can also see why a single imputation on the regression line - i.e. where the imputed data (triangles in the graph above) lies on the regression line - is inadequate. This would be an over-confident prediction of the missing value. Systematically doing this would lead to estimates of standard errors that were too small, and inferences that were therefore over-confident. However, a single imputation of each missing value is not adequate, because we only know the distribution of the missing values. Thus, we need to repeat the imputation process a number of times, each time drawing a new regression line, and new residuals about that regression line. We thus end up with a number of completed data sets as follows: Notation for analyses of imputed data sets As described above, we have imputed K complete data sets. Analysing each of them in the usual way (i.e. using the model intended for the complete data) gives us K estimates of the original quantity of interest, Q. Denote these estimates Q1,..., QK. So, each Q could represent a regression coefficient from a regression model of interest which we fit to each imputed data set in turn. The analysis of each imputed data set will also give an estimate of the variance of Qk, say the usual variance estimate from the model. We combine these quantities to get our overall estimate and its variance using certain rules. Intuition for combining the estimates Consider the imputation of just 1 missing observation. . Again, this is Imagine a 3-d representation, with the Ymiss axis going back into the screen. Given a particular value of Yobs the imputations (numbered 1,2,3,4) combine with the observed data to give the estimates of Q shown by the black dots. Each of these estimates also has a variance, which is represented by the line through the black dot. Now we project this into twodimensions, over Ymiss. The multiple imputation estimate is going to be the average of the black dots. In other words, the average over the distribution of YM given YO of Q, which is itself calculated from the observed and 'missing' data: QMI = EYM| YOE[Q(YO, YM)]. The variance has to reflect two components; the variance of the Q's from the imputed datasets about their average and also the variance of each Q estimate. In fact, it is the sum of these two; i.e. in this case (with Q1,..., Q4) the sample variability of Q1,..., Q4 about their mean, plus the average of the variance of Q1,..., Q4. These are known respectively as the between imputation variance and the within imputation variance. Mathematically, V[QMI] = EYM| YOV[Q(YO, YM)] + VYM| YOE[Q(YO, YM)]. This motivates the formulae for combining the estimates and calculating the variance, which are given in the next section. Testing hypotheses We assume that, if the data were all observed, then our estimator Q would have a normal distribution. If this is so, we can compare t , a t-distribution with degrees of freedom, where = (K - 1) 1+ . The rate of missing information If there were no missing data, and we used multiple imputation, we should find that (1 + 1/K) the relative increase in variance due to the missing data is r= . Alternatively, the 'rate of missing information' is = . It turns out a better estimate of this quantity is = . Combining the estimates Let the multiple imputation estimate of Q be QMI. Then, following from the above, QMI = Qk. Further define the within imputation and between imputation components of variance by = 0. Thus = where we recall our definition of , and (Qk - QMI)2, = = V[Qk]. Then the variance of QMI is = 1+ + . How do we draw YM| YO? In the pictures above, we described a regression method for drawing YM| YO. This should work reasonably if the data set is large, as it is then an approximation to a Bayesian rule. This rule says that, if is the parameter describing the joint distribution of (YO, YM) : Posterior distn of Joint distn of (YM, ) given YO (YM, YO) given We put an uninformative distribution on sample from YM| YO. X distn of . , and discard the values of drawn from the posterior, leaving a Frequently asked questions     How many imputations? o With 50% missing information, an estimate based on 5 impuitations has SD 5% wider than one with an infinite number of impuations. What if not MAR? o Most software implementations assume MAR, but this is not necessary. Why not compute just one imputation? o Underestimates variance, as can't estimate . What if I am interested in more than one parameter? o Imputation proceeds in the same way, as does finding the overall estimate of Q. However, the estimating the covariance matrix can be tricky. Typically more imputations will be needed. See Schafer (2000) for a discussion. Bibliography Allison, P. D. (2000) Multiple imputation for missing data: a cautionary tale. Sociological methods and Research, 28, 301-309. Burren, S. V., Boshuizen, H. C. and Knook, D. L. (1999) Multiple imputation of missing blood presure covariates in survival analysis. Statistics in Medicine, pp. 681-694. Gelman, A. and Raghunathan, T. E. (2001) Using conditional distributions for missing-data imputation, in discussion of `using conditional distributions for missing-data imputation' by Arnold et al. Statistical Science, 3, 268-269. Horton, N. J. and Libsitz, S. R. (2001) Multiple imputation in practice: comparison of software packages for regression models with missing variables. The American Statistician, pp. 244-254. Little, R. J. A. and Rubin, D. B. (2002) Statistical analysis with missing data (second edition). Chichester: Wiley. Raghunathan, T. E., Lepkowski, J. M., Van Hoewyk, J. and Solenberger, P. (2001) A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology, 27, 85-95. Royston, P. (2004) Multiple imputation of missing values. The Stata Journal, 3, 227-241. Rubin, D. (1996) Multiple imputation after 18 years. Journal of the American Statistical Association, 91, 473-490. Rubin, D. B. (1976) Inference and missing data. Biometrika, 63, 581-592. Schafer, J. L. (1997) Analysis of incomplete multivariate data. London: Chapman and Hall. Schafer, J. L. (1999) Multiple imputation: a primer. Statistical Methods in Medical Research, 8, 3-15. Taylor, J. M. G., Cooper, K. L., Wei, J. T., Sarma, R. V., Raghunathan, T. E. and Heeringa, S. G. (2002) Use of Multiple Imputation to Correct for Nonresponse Bias in a Survey of Urologic Symptoms among AfricanAmerican Men . American Journal of Epidemiology, 156, 774-782. Software Software for drawing YM| YO. We can use Markov Chain Monte Carlo (MCMC) methods to draw from this posterior distribution, and then we discard the 's and use the YM's as described above. This approach is implemented in MLwiN - see the software page on this website. Other options include WinBUGS (see the example analyses page on this website) or PROC MI in SAS. Note that drawing from YM| YO and then doing the analysis in WinBUGS can be unfeasibly slow even for moderate data sets. One alternative is to use 'chained equations' also known as 'regression switching' or 'sequential regression imputation' (all variants of the same approach) (see the links page of this website). Chained equations: some comments Roughly, multiple imputation using chained equations proceeds as follows. (We say 'roughly', as implementations vary): 1. To get started, for each variable in turn fill in missing values with randomly chosen observed values 2. 'filled-in' values in the first variable are discarded leaving the original missing values. These missing values are then imputed using regression imputation on all other variables. 3. The 'filled-in' values in the second variable are discarded. These missing values are then imputed using 'proper' regression imputation on all other variables. 4. This process is repeated for each variable in turn. Once each variable has been imputed using the regression method we have completed one 'cycle'. 5. The process is continued for several cycles, typically 10. Comments on chained equation method This was first published by Raghunathan et al. (2001); see also the SAS implementation at www.isr.urmich.edu/src/smp/ive For a medical example see Taylor et al. (2002). A Dutch group has developed related software; see Burren et al. (1999), and associated S+ software at www.multiple-imputation.com. This has been implemented in stata; see Royston (2004), and www.stata.com/support. All the implementations are slightly different! Although MICE is an attractive approach, overcoming some of the issues with binary and ordinal data that are difficult for proper multiple imputation, the lack of a well established theoretical basis means even those who propose it suggest it is used cautiously. To quote van Buuren and Oudshoorn (MICE): 'It is hard to establish convergence in the general case, but simulation studies suggest that the coverage properties in some important practical cases are quite good.' The problem is that you are in effect defining many conditional distributions, and this does not guarantee the existence of a joint distribution. Further discussion is given by Raghunathan et al. (2001) (the original paper), Gelman and Raghunathan (2001) and, briefly, in Little and Rubin (2002). Note further that, as implemented in stata it is inappropriate for hierarchical data; generally if data are hierarchical, so should the imputation be. See the article by Carpenter and Goldstein for the multilevel modelling newsletter, downloadable from the preprints page on this site. More generally, we think the general application of this approach to hierarchical data is problematic. Summary and conclusions        Untestable assumptions unavoidable with missing data. Shun unprincipled methods. MI is most convenient under MAR. o To increase the chance that this is approximately true, we may wish to include several predictors of missingness that we do not want to adjust for in the final analysis. Multiple imputation is particularly useful for missing covarites, especially in: o survey settings where there is a separate imputer and analyst; o large and messy problems, where a full likelihood or Bayesian analysis is impractical. For models with missing responses, provided the covariates predictive of dropout are included, similar results are obtained to regression models (or mixed models, for longitudinal data). o in most missing outcome situations, preferable not to use multiple imputation, as it wastes information. Ideally, should consider a form of sensitivity analysis, though this is often not straightforward. o proper MI analyses are awkward under MNAR; it is necessary to make proper imputations from the posterior conditional on the missing value indicator. o Instead we can modify the imputation model to assess sensitivity, for example by using a postulated accept-reject mechanism on imputations. Often, serious thought unavoidable! Introduction The aim of this document is to     give an intuitive justification for Inverse Probability Weighting (IPW); look at a simple example; discuss methods to improve efficiency, and contrast with multiple imputation. Idea behind inverse probability weighting Suppose the full data are Group: A B C Response: 1 1 1 2 2 2 3 3 3 The average response is 2. However, we observe: Group: A B C Response: 1 ? ? 2 2 2 ? 3 3 From the observed data, the average response is 13/6, biased. Notice the probability of response is 1/3 in group A, 1 in group B and 2/3 in group C. Calculate weighted average, where each observation is weighted by 1/{Probability of response}: = 2. IPW has eliminated the bias in this case; more generally it will give estimators the property they 'home in' on the truth as the sample size increases (i.e. they are consistent). A more mathematical view Most estimators are the solution of an equation like For example, if Ui(xi, ) = (xi - gives = U(xi, ) = 0. (xi - )=0 ), solving xi/n. Theory says that if the average of Ui(Xi, ) over samples from the population is zero, our estimate will 'home in' on the truth as the sample gets large (this is called consistency). If some of the observations xi are unobserved, then it follows the corresponding U's are missing from the above sum. Thus the average of Ui( However, now let ) is not zero, so estimates won't `home in' on the truth. Ri = Then, the average (over repeated samples from the population) of U(xi, ) = 0, so parameter estimates will 'home in' on the truth as the sample size gets large. In general, inverse probability weighting recovers consistent estimates when data are missing at random.

What do we mean by missing data

Related documents

Products

Support

What do we mean by missing data

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib