Missing Values Adapting to missing data Sources of Missing Data • People refuse to answer a question • Responses are indistinct or ambiguous • Numeric data are obviously wrong • Broken objects cannot be measured • Equipment failure or malfunction • Detailed analysis of subsample Assumptions 1 • Missing Completely at Random – probability of data missing on X is unrelated to the value of X or to values on other variables in data set • Missing at Random – the probability of missing data on X is unrelated to the value of X after controlling for other variables in the analysis Assumptions 2 • Ignorable – MAR plus parameters governing missing data process unrelated to parameters being estimated • Nonignorable – If not MAR, missing data mechanism must be modeled to get good estimates of parameters Methods 1. 2. 3. 4. Listwise Deletion Pairwise Deletion Dummy Variable Adjustment Imputation Listwise Deletion 1 • Delete any samples with missing data – Can be used for any statistical analysis – No special computational methods • If data are MCAR (esp if random sample of full data set), they are an unbiased estimate of the full data set Listwise Delete 2 • If data are MAR, can produce biased estimates if missing values in independent variables are dependent on dependent variable • Main issue is the loss of observations and the increase in standard errors (meaning a decrease in the power of the test) Listwise Deletion 3 • In anthropology listwise deletion often includes removal of variables (columns) as well as cases (rows) • Finding an optimal complete data set involves removing variables with many missing variables and then rows still having missing variables Pairwise Deletion 1 • Compute means using available data and covariances using cases with observations for the pair being computed • Uses more of the data • If MCAR, reasonably unbiased estimates, but if MAR, estimates may be seriously biased Pairwise Deletion 2 • Covariance/Correlation matrix may be singular • Less of an issue with distance matrices Dummy Variable • Create variable to flag observations missing on a particular variable • Used in regression analysis but provides biased estimators Imputation • Replace missing values with an estimate: 1. Mean for that variable – biased estimates of variances and covariances 2. Multiple regression to predict value – complicated with multiple variables containing missing values, but can still lead to underestimated standard errors Maximum Likelihood • Try to reconstruct the complete data set by selecting values that would maximize the probability of observing the actually observed data • Categorical and continuous data • Expectation-maximization algorithm gives estimates of means and covariances Expectation Maximization • Iterative steps of expectation and maximization to produce estimates that converge on the ML estimates • These estimates will generally underestimate the standard errors in regression and other statistical models Multiple Imputation 1 • Has the same optimal properties of ML but several advantages • Can be used with any kind of data and any kind of statistical model • But produces multiple estimates which must be combined • Random component used to give unbiased estimates Multiple Imputation 2 • Multivariate normal model (relatively resistant to deviations) • Each variable represented as a linear function of the other variables • Methods – Data Augmentation, package norm – Sampling Importance/Resampling, package amelia Multiple Imputation 3 • Categorical data, multinomial model, package cat • Categorical and interval/ratio data, package mix • Also can use multivariate normal models with dummy variables Multiple Imputation 4 • Predictive mean matching – use regression to predict values for a particular variable. Find complete cases that have predictions similar to the case with a missing value on that variable and randomly one of the actual values, package Hmisc, function aregImpute Analysis • The analysis is run on each imputed data set and the estimates (e.g. regression coefficients are combined) • Packages such as zelig provide ways of combining the datasets for generalized linear models Missing Data with R 1 • NA is used to identify a missing value • is.na() is used to test for a missing value: is.na(c(1:4, NA, 6:10)) • na.omit(dataframe) will delete all cases with missing data (Rcmdr: Data | Active Data set| Remove cases with missing values Missing Data with R 2 • Some functions have an na.rm= option. True means remove cases with missing values, False means do not remove them so that the function returns NA if there are missing values. Missing Data in R 3 • Other functions (e.g. lm, princomp, glm) have an na.action= option that must can be set to one of the following options: na.fail, na.omit, na.exclude to remove cases (omit, exclude) or have the analysis fail Missing Data in R 4 • Other functions (e.g. cor, cov, var) have a use= option: – – – – – everything (NA’s propagate) all.obs (NA causes error) complete.obs (delete cases with NA’s) na.or.complete (delete cases with NA’s) pairwise.complete.obs (complete pairs of observations) Example 1 • ErnestWitte data set has missing values among the 242 cases and 38 variables • Using R to remove all cases with missing values reduces the number of cases to 52! • If we don’t need all of the variables we can retain more cases Example 2 • Total NA’s in ErnestWitte (815) • sum(is.na(ErnestWitte)) • Check missing values by variable: • sort(apply(ErnestWitte, 2, function(x) sum(is.na(x))), decreasing=TRUE) • Looking has 171, SkullPos 126, Depos 112 • Removing these gives 112 cases Multiple Imputation with R • A wide variety of options: – Packages norm, cat, mix – Package amelia – Package mi (relatively new, but flexible)