2_Treatment of missi.. - EPPS Academic Computing

advertisement
Treatment of missing values
1
1.11) Missing values in autoregressive and crosslagged models: diagnostics and therapy.
• What is a missing value? There is a unit nonresponse and an item-non-response.
• Also answers such as „Don‘t know“,
„Refused“, „no opinion“ are often considered
as missing.
• In longitudinal studies the problem of missing
values is especially disturbing due to panel
mortality, that is unit non response. This is
also called wave non response, attrition or
drop out.
2
Different patterns of non
reponse
• 1) Univariate pattern: for some indicators or
items we have full observations, and for some
other items we have missing values- no
answers. These items may be fully or partly
missing.
• 2) A monotone pattern: may arise in longitudinal
studies with attrition. If an item is missing in
some wave, it continues to be missing in the
next waves.
• 3) An arbitrary pattern: any set of variables may
be missing for any unit.
3
y1 y2 y3….
yp
y1 y2 y3….
yp
y1 y2 y3…yp
?
1
2
.
.
.
.
.
.
.
N
?
?
?
?
?
?
1) univariate pattern
2) monotone pattern.
3)arbitrary pattern
4
• The literature differentiates three kinds of missing
values:
• 1) MCAR-missing completely at randommeans that whether the data are missing is
entirely unrelated statistically to the values that
would have been observed. MCAR is the most
restrictive assumption. MCAR can be sometimes
established by randomly assigning test booklets or
blocks of survey questions to different
respondents.
• 2) MAR-missing at random- is a somewhat more
relaxed condition. It means that missingness is
statistically unrelated to the variable itself.
However, it may be related to other variables in
the data set. One way to establish MAR
processes is to include completely observed
variables that are highly predictive of incomplete 5
data.
• 3) MNAR- missing not at random- or
nonignorable missing data, where
missingness conveys probablistic
information about the values that would
have been observed.
6
Example:
• Let us take two variables, education and income.
Education has no missing values, and income has.
• MCAR would mean that the missing values of income
are dependent neither on education nor on income.
• MAR would mean that the missing values of income
are dependent on education. That is, education can
predict the missing values in income.
• MNAR would mean that the missingness in income is
not independent of the values of the missings,
controlling for the prediction of education. That is, for
example high income values are more often missing
than low income values.
7
• The diagnosis of the kind of missing is
very tricky, and cannot always be
established. So often researchers assume
the kind of missingness in their data.
Fortunately, there are solutions which are
independent of the kind of missing data,
even if we have MNAR.
8
• MAR: logistic regression is a possible test for
MAR. But if there is a significant effect of some
non-missing variables of the missingness, it
cannot exclude MNAR.
• Only experimental designs with a
representative sample of the missing, which
has no missing, can help design a model for the
missingness in the full data set.
9
So far diagnostics, now therapy:
• Traditional methods:
• 1) Listwise deletion (LD)- deleting every case
which has any missing value.
• Advantage: consistent solution.
• Disadvantage: not efficient, and causes often a
drastic reduction in sample size, especially in
studies where multiple indicators are involved,
and sensitive questions such as income.
10
Therapy (2)
• 2) Pairwise deletion (PD) (also called available
case (AC) analysis): calculates each correlation
separately. This method excludes an observation
from the calculation when it is missing a value that
is needed for the computation of that particular
correlation.
• Advantages: smaller loss of cases than in the LD.
• Disadvantages: not efficient, and could create
problems in estimation, because the observed
correlation matrix may not be positive definite.
• There is no defined N for the sample, since it
depends on the computed pair.
11
• Advantages of the case deletion
methods: simplicity.
• If a missing data problem can be
resolved by discarding only a small part
of the sample, then the method can be
quite effective.
• However, even in that situation, one
should explore the data to make sure
that the discarded cases are not
influential.
12
Reweighting
• In some non-MCAR situations, it is possible
to reduce biases by applying weights. After
incomplete cases are removed, the
remaining complete cases are reweighted
so that their distribution more closely
resembles that of the full sample or
population with respect to auxiliary variables
(Little and Rubin 1987).
• It requires some model for the probabilities
of response, to calculate the weights. Better
for the univariate and monotone missing
patterns. Becomes complicated to apply if
13
missing is in an arbitrary pattern
Older methods of multiple
imputation
• Imputation means filling in missing values
with plausible values, and continuing with the
analysis.
• Advantages: potentially more efficient than
discarding the unit.
• Prevention of loss of power due to
decreasing sample size.
• Disadvantages: imputation may be difficult to
implement well.
14
• 1) Imputing unconditional means: average is
preserved, but distribution aspects such as
variance are distorted.
• 2) Hot deck imputation: filling in respondents‘ data
with values from actual respondents randomly.
Advantage: It preserves the variable‘s distribution.
Disadvantage: the method still distorts correlations
and other measures of association.
• 3) Imputing conditional means by regression: the
model is first fit for cases to which y is known. After
we have a regression parameter from X to Y we
use it to forecast missing values of Y by known
values of X. It is almost optimal with some
corrections for standard errors (Schafer&Schenker
2000). Not recommended for analyses of
covariances or correlations, since it overstates the
15
relation between Y and X.
• 4) Imputing from a conditional distribution.
Distortion of covariances can be eliminated if
each missing value of Y is replaced not by a
regression prediction but by a random draw
from the conditional or predictive distribution
of Y given X plus an error term.
16
1) Mean Substitution
2) Hot Deck
.
.
.
.
. .
y
y
.
. .
.
.
.
.
. .
. ….
..
.
.
.
.
x
…. .
x
3) Conditional Mean
4) Predictive distribution
.
.
.
.
.
. .
y
.
.
y
. ...
.. . .. .
. ..
…..
.. . .
. …… .
..
17
x
x
• 1) Mean substitution causes all imputed values to fall
on a horizontal line- produces biased estimates for
any type of missingness according to simulation
studies.
• 2) Conditional mean substitution causes them to fall
on a regression line-introduces bias.
• 3) The hot deck produces an elliptical cloud with too
little correlation- produces biased estimates for any
type of missingness.
• 4) The only method which produces a reasonable
point cloud is imputation from the conditional
distribution of Y on X-it is unbiased.
• However, in all methods coverage is very low (see
Schafer and Graham 2002 for details p.161).
• Solution: modern methods of imputation: MI and ML
18
Therapy (3)
• Modern methods:
• 1) FIML-full information maximum likelihood. The FIML
discrepancy function maximizes the sum of N casewise
contributions to the likelihood function that measure the
discrepancy between the observed data and current
parameter estimates using all available data for a given
case. FIML is a direct method in the sense that model
parameters and standard errors are estimated directly
from the available data. Missing data (MD) points are not
estimated or imputed, and are essentially treated as
values that were never intended to be sampled.
• Advantage: the algorithm uses all the available
information, and the method is both consistent and
efficient for MAR.
• Disadvantage: the method is model dependent, as it uses
information only from variables in the model (different
variables in the model-different results).
19
Assumptions of ML estimates
• 1) they assume that the sample is large enough and normally
distributed for the ML estimates to be approximately unbiased.
• 2) they assume some model for the complete data and MAR.
• 3) However, in many realistic applications and according to
simulation studies departures from the last two assumptions are
not large enough to effectively invalidate the results.
• 4) According to simulations, non-normality is not a crucial
problem. In Graham and Schafer (1999) non-normal missing
data when imputed even with small samples reported excellent
performance.
• 4) For large enough Ns (over 250), MI and ML estimates are
very similar.
• 5) Conclusions: FIML, and Bayesian are state of the art and
should be used.
20
• However, the FIML procedure is attractive,
as it is very easy to implement. FIML is
available in most SEM programs (AMOS,
LISREL, Mplus). Data imputation is
available in LISREL, Mplus and Amos.
21
Download