Lecture 22

advertisement
Missing Values
Adapting to missing data
Sources of Missing Data
• People refuse to answer a question
• Responses are indistinct or
ambiguous
• Numeric data are obviously wrong
• Broken objects cannot be measured
• Equipment failure or malfunction
• Detailed analysis of subsample
Assumptions 1
• Missing Completely at Random
– probability of data missing on X is
unrelated to the value of X or to values
on other variables in data set
• Missing at Random
– the probability of missing data on X is
unrelated to the value of X after
controlling for other variables in the
analysis
Assumptions 2
• Ignorable
– MAR plus parameters governing
missing data process unrelated to
parameters being estimated
• Nonignorable
– If not MAR, missing data mechanism
must be modeled to get good estimates
of parameters
Methods
1.
2.
3.
4.
Listwise Deletion
Pairwise Deletion
Dummy Variable Adjustment
Imputation
Listwise Deletion 1
• Delete any samples with missing
data
– Can be used for any statistical analysis
– No special computational methods
• If data are MCAR (esp if random
sample of full data set), they are an
unbiased estimate of the full data set
Listwise Delete 2
• If data are MAR, can produce biased
estimates if missing values in
independent variables are
dependent on dependent variable
• Main issue is the loss of
observations and the increase in
standard errors (meaning a
decrease in the power of the test)
Listwise Deletion 3
• In anthropology listwise deletion
often includes removal of variables
(columns) as well as cases (rows)
• Finding an optimal complete data
set involves removing variables with
many missing variables and then
rows still having missing variables
Pairwise Deletion 1
• Compute means using available data
and covariances using cases with
observations for the pair being
computed
• Uses more of the data
• If MCAR, reasonably unbiased
estimates, but if MAR, estimates may
be seriously biased
Pairwise Deletion 2
• Covariance/Correlation matrix may
be singular
• Less of an issue with distance
matrices
Dummy Variable
• Create variable to flag observations
missing on a particular variable
• Used in regression analysis but
provides biased estimators
Imputation
• Replace missing values with an
estimate:
1. Mean for that variable – biased
estimates of variances and covariances
2. Multiple regression to predict value –
complicated with multiple variables
containing missing values, but can still
lead to underestimated standard errors
Maximum Likelihood
• Try to reconstruct the complete data
set by selecting values that would
maximize the probability of observing
the actually observed data
• Categorical and continuous data
• Expectation-maximization algorithm
gives estimates of means and
covariances
Expectation Maximization
• Iterative steps of expectation and
maximization to produce estimates
that converge on the ML estimates
• These estimates will generally
underestimate the standard errors
in regression and other statistical
models
Multiple Imputation 1
• Has the same optimal properties of
ML but several advantages
• Can be used with any kind of data
and any kind of statistical model
• But produces multiple estimates
which must be combined
• Random component used to give
unbiased estimates
Multiple Imputation 2
• Multivariate normal model (relatively
resistant to deviations)
• Each variable represented as a
linear function of the other variables
• Methods
– Data Augmentation, package norm
– Sampling Importance/Resampling,
package amelia
Multiple Imputation 3
• Categorical data, multinomial model,
package cat
• Categorical and interval/ratio data,
package mix
• Also can use multivariate normal
models with dummy variables
Multiple Imputation 4
• Predictive mean matching – use
regression to predict values for a
particular variable. Find complete
cases that have predictions similar
to the case with a missing value on
that variable and randomly one of
the actual values, package Hmisc,
function aregImpute
Analysis
• The analysis is run on each imputed
data set and the estimates (e.g.
regression coefficients are
combined)
• Packages such as zelig provide
ways of combining the datasets for
generalized linear models
Missing Data with R 1
• NA is used to identify a missing
value
• is.na() is used to test for a missing
value: is.na(c(1:4, NA, 6:10))
• na.omit(dataframe) will delete all
cases with missing data (Rcmdr:
Data | Active Data set| Remove
cases with missing values
Missing Data with R 2
• Some functions have an na.rm=
option. True means remove cases
with missing values, False means do
not remove them so that the function
returns NA if there are missing
values.
Missing Data in R 3
• Other functions (e.g. lm, princomp,
glm) have an na.action= option that
must can be set to one of the
following options: na.fail, na.omit,
na.exclude to remove cases (omit,
exclude) or have the analysis fail
Missing Data in R 4
• Other functions (e.g. cor, cov, var)
have a use= option:
–
–
–
–
–
everything (NA’s propagate)
all.obs (NA causes error)
complete.obs (delete cases with NA’s)
na.or.complete (delete cases with NA’s)
pairwise.complete.obs (complete pairs of
observations)
Example 1
• ErnestWitte data set has missing
values among the 242 cases and 38
variables
• Using R to remove all cases with
missing values reduces the number
of cases to 52!
• If we don’t need all of the variables
we can retain more cases
Example 2
• Total NA’s in ErnestWitte (815)
• sum(is.na(ErnestWitte))
• Check missing values by variable:
• sort(apply(ErnestWitte, 2, function(x)
sum(is.na(x))), decreasing=TRUE)
• Looking has 171, SkullPos 126,
Depos 112
• Removing these gives 112 cases
Multiple Imputation with R
• A wide variety of options:
– Packages norm, cat, mix
– Package amelia
– Package mi (relatively new, but flexible)
Download