Multiple imputation: handling interactions

advertisement
Multiple Imputation : Handling
Interactions
Michael Spratt
1
Introduction
• Missing data is considerable problem
• Complete case analysis will generally lead
to systematic bias
• Have to make some assumption
– Most commonly used is Missing at Random
(MAR)
– MNAR uses different assumptions
• In this talk we are discussing analysis
when MAR assumption is made
2
Introduction : MAR
• In MAR, the probability of being missing
does not depend on the missing data
itself, given the observed data and the
model parameters
– Unlike MNAR analysis, we do not have to
explicitly the model missingness mechanism
3
Introduction : MAR and multiple
imputation
• Most common approach : perform imputation of
missing data and save multiple imputed datasets
– Each imputed dataset differs (slightly) due to
stochastic nature of imputation
• Then carry out substantive analysis on each of
the imputed datasets
• Then combine the individual results (using
Rubin’s Rules) to obtain combined imputation
estimates and standard errors
4
MICE/ICE for imputation assuming MAR
• MICE : Multiple Imputation using Chained Equations has
been widely used for imputation
– Sometimes called FCS (fully conditional specification)
– For general missingness patterns (does not not have to assume
monotone missingness)
• Implemented in
–
–
–
–
•
MICE package in R (van Buuren et. al.)
ICE command in Stata (Royston)
IVEWARE (Raghunathan et. al.)
Potentially task-specific versions be written in other programs e.g.
WinBUGS
Ref : Multiple imputation of missing blood pressure covariates in survival
analysis. Van Buuren, Boshuizen, Knook. Statistics in Medicine 1999; 18(6):
681–94.
5
MICE/ICE for imputation assuming MAR
• X1, X2, X3 …Xn partially observed
• Zobs represents set of fully observed variables
Chained equations are :
• X1 ~ f(X2, X3 , X4 … Xn, Zobs)
• X2 ~ f(X1, X3 , X4 … Xn, Zobs)
• X3 ~ f(X1, X2 , X4 … Xn, Zobs)
etc.
• Comparable to Gibbs Sampler
• Much shorter chains which on termination produce an
imputed dataset
6
Interactions in the Analysis Model
• A useful practical guide to using imputation to
perform analysis in the presence of missing data
are
• Multiple imputation: current perspectives (Kenward and
Carpenter Statistical Methods in Medical Rearch16: 199–
218)
• Multiple imputation for missing data in epidemiological and
clinical research: potential and pitfalls (Sterne, White,
Carpenter et. al. BMJ 2009;338:b2393) also contains useful
guidance
• The imputation model should be at least as rich
as the substantive model
– The imputation model should preserve the structure
of the data
7
MICE/ICE for imputation
• For most datasets where distributional
assumptions are met, MICE/ICE has been
shown in practice to work well for MAR data
• More care is needed when models contain
structures such as interactions, multi-level, nonlinearity etc. In particular the structure of the
substantive model should be reflected in the
imputation model
• This talk focuses on interactions
8
Why omitting interactions in the
imputation may cause problems
• Take as an example 3 binary variables X, Y, Z
• We are interested in a substantive analysis in
the presence of missing data of the logistic
regression of Y on X and Z with an interaction
• logit(P(Y=1| X, Z)) = b0+ bxx + bzz + bxzx.z
• We initially have a full [X,Y,Z] dataset, but it then
becomes subject to missingness (MAR
mechanisms)
• We would like the parameter estimates after
MAR followed by imputation occurs to be the
same as the full data estimates
9
Why omitting interactions in the
imputation may cause problems
• The coefficients are the same as the coefficients
of the corresponding log-linear model
– logistic :
• logit(Y) = b0+ bxx + bzz + bxzx.z
– log-linear
• log(mxyz) = m0+ mxx + myy + mzz +
mxyx.y + mxzx.z + myzy.z + mxyzx.y.z
– Examining the bias of bx is equivalent to examining
the bias of mxy; same for bz and myz; and for bxz and
mxyz
10
Why omitting interactions in the
imputation may cause problems
• Omitting interactions terms in the full conditional models will lead to
interactions in the log-linear model being underestimated and hence P(X,
Y, Z) being incorrectly estimated
• This can also be seen by looking at the number of parameter estimates
needed
– If just X is subject to missingness, we need to be able to estimate
P(X | Y,Z) P(Y, Z)
• 4 parameter estimates needed for P(X | Y, Z)
• This cannot be done with chained equation without interaction
as there are only 3 free parameters
X = a+ by Y + bz z
– If X and Y are subject to missingess, we need to be able to estimate
P(X, Y | Z) P(Z)
• 8 parameter estimates in general needed for P(X, Y | Z)
• This cannot be done with chained equations without interaction x = ax+ bxy y + bxz z
and y = ay + byx x + byz z as there are only 6 free parameters
11
Passive Imputation
• Imputation interactions are needed. Both the Stata
program ICE (and also the R MICE package) support
passive imputation
• The interaction term is recalculated from the main effects
after every mice cycle and can then be made use of in
the subsequent chained equations in the cycle for the
imputation of other variable(s)
– Other possible approaches :
– Von Hippel “How to impute interactions, squares and other transformed
variables”, Sociological Methodology 39:265-291 2009 is a less
established alternative to passive imputation
– It is also worth noting that where a categorical variable is fully observed
an alternative method of imputation is to split it by values of the fully
observed variable and separately impute subsets of data
12
Simulation Structure
1.
2.
3.
4.
We created a [X,Z] dataset
We created Y stochastically given X and Z
We stochastically created missingness (MAR) in 1, 2
or 3 variables
Using a number of imputation models we did the
imputation and performed the substantive analysis
Steps 2-4 were repeated 100 times and parameter
estimates and standard errors were recorded
We tabulated the median of the parameter
estimates, the median of the confidence intervals
and the coverage of the original data generation
parameter within the parameter estimate’s
confidence intervals
13
Simulations
• We examined the effect of interactions on
analysis of imputed data in a series of simulation
scenarios involving 3 variables;
– Regression with outcome Y and covariates X and Z
• The simulation scenarios ranged through all 3
variables being binary; 2 variables binary and
one variable normal; one variable binary and 2
normal variables; to 3 normal variables
– In each case varying combinations of outcomes and
covariates complete/incomplete
• We present a subset of the simulation scenarios
14
All variables binary; X and Z
incomplete
• Dataset : 20,000 observations, Y generated stochastically
logit(Y) = 0.5 × X + 0.5 × Z + 0.6 × X × Z
• Data divided into 2 sections with Bernoulli distribution (p = 0.5)
[splitting allows missingness to be MAR]
• Two stratified MAR patterns :
– logit(Z is missing) = -2 + X + Y
– logit(X is missing) = -2 + 1.3 × Z + 0.8 × Y
(In one section of data)
(Other section of data)
• Imputation then substantive analysis performed
• In a second simulation scenario there were 3 stratified MAR
patterns :
– P(Z missing | X, Y)
– P(X missing | Y, Z)
– P(X and Z jointly missing | Y)
(In section 1 of data)
(In section 2 of data)
(In section 3 of data)
15
All variables binary; X and Z incomplete
(Z ~ X + Y;
Var
X
Z
XZ
Missingness
mechanism
% missing values
(95% range)
14.0 (13.6, 14.4)
2 stratified MAR
23.0 (22.7, 23.4)
patterns
(interaction)
X ~ Y + Z)
Full data
Complete case
Imputed, no interaction
Median OR (median
CI)
Median OR (median CI
Median OR (median
CI)
coverage CI)
CI coverage
0.49 (0.37,0.62)
0.32 (0.18,0.46)
0.27
0.58 (0.44,0.71)
0.81
0.49 (0.36,0.63)
0.54 (0.39,0.70)
0.90
0.59 (0.44,0.73)
0.82
0.61 (0.45,0.76)
0.58 (0.40,0.76)
0.91
0.45 (0.28,0.62)
0.62
X
21.8 (21.4, 22.2)
0.49 (0.37,0.62)
0.45 (0.31,0.59)
0.82
0.59 (0.45,0.72)
0.74
Z
3 stratified MAR 23.1 (22.6, 23.5)
patterns
0.49 (0.36,0.63)
0.43 (0.28,0.59)
0.86
0.61 (0.46,0.76)
0.69
0.61 (0.45,0.76)
0.59 (0.41,0.77)
0.90
0.45 (0.28,0.62)
0.58
XZ
(Z ~ X + Y + XY;
Var
X
Z
XZ
X
Z
XZ
X ~ Y + Z)
X ~ Y + Z + YZ)
(Z ~ X + Y + XY;
X ~ Y + Z + YZ)
Imputed, YZ interaction
Median OR (median
CI)
CI coverage
Imputed, XY, YZ interaction
0.90
0.55 (0.42,0.68)
0.85
0.50 (0.36,0.64)
0.93
0.87
0.55 (0.40,0.69)
0.89
0.95
0.89
0.51 (0.34,0.69)
0.88
0.50 (0.35,0.65)
0.60 (0.42,0.77)
0.53 (0.39,0.67)
0.94
0.56 (0.43,0.70)
0.89
0.50 (0.36,0.64)
0.93
0.55 (0.39,0.70)
3 stratified
MAR patterns 0.53 (0.35,0.71)
0.86
0.87
0.56 (0.41,0.72)
0.50 (0.32,0.67)
0.85
0.81
0.49 (0.33,0.65)
0.61 (0.42,0.79)
0.94
0.93
Missingness
mechanism
Imputed, XY interaction
Median OR (median CI
CI)
coverage
(Z ~ X + Y;
0.54 (0.41,0.67)
2 stratified
MAR patterns 0.56 (0.41,0.70)
(interaction) 0.51 (0.33,0.69)
Median OR (median CI)
CI coverage
0.93
16
All variables binary; X, Z and Y
incomplete
• Data generated stochastically
logit(Y) = 0.5 × X + 0.5 × Z + 0.6 × X × Z
• Data divided randomly into 3 sections with equal probability
• 3 stochastic stratified MAR patterns :
– logit(Z is missing) = -2 + X + Y
(In section 1 of data)
– logit(X is missing) = -2 + 1.3 × Z + 0.8 × Y (In section 2 of data)
– logit(Y is missing) = -1.5 + 1.9 × Z + 0.6 × X (In section 3 of data) data)
• In a second simulation scenario there were 6 stratified MAR patterns :
–
–
–
–
–
–
P(Z missing | X, Y)
P(X missing | Y, Z)
P(Y missing | X, Z)
P(X and Y jointly missing | Z)
P(X and Z jointly missing | Y)
P(Y and Z jointly missing | X)
(In section 1 of data)
(In section 2 of data)
(In section 3 of data)
(In section 4 of data)
(In section 5 of data)
(In section 6 of data)
17
All variables binary; X, Z and Y incomplete
Var
X
Z
XZ
Missingness
mechanism
% Missing values
(95% range)
11.9 (11.6, 12.2)
3 stratified
MAR patterns 13.0 (12.7, 13.3)
for Z, X and Y
Y
17.2 (16.9,17.6)
X
15.2 (14.8, 15.6)
6 stratified
MAR patterns 20.9 (20.5, 21.2)
for Z, X and Y
Z
XZ
Y
Var
X
Z
XZ
X
Z
XZ
Full data
Median OR
(median CI)
Complete case
Median OR (median CI
CI)
coverage
Imputed, no interaction
Median OR (median CI
CI)
coverage
0.49 (0.37,0.62)
0.44 (0.31,0.58)
0.82
0.56 (0.43,0.70)
0.83
0.49 (0.36,0.63)
0.40 (0.25,0.56)
0.76
0.61 (0.46,0.76)
0.67
0.61 (0.46,0.77)
0.56 (0.37,0.74)
0.91
0.43 (0.25,0.60)
0.50
0.49 (0.37,0.62)
0.46 (0.32,0.59)
0.84
0.57 (0.44,0.71)
0.82
0.49 (0.36,0.63)
0.45 (0.30,0.60)
0.85
0.60 (0.45,0.75)
0.73
0.61 (0.45,0.76)
0.58 (0.40,0.76)
0.92
0.44 (0.26,0.61)
0.57
21.4 (21.0,21.7)
Imputed, XY, YZ interactions
Imputed, YZ, XZ interactions
Median OR
(median CI)
CI
coverage
Median OR (median CI
Median OR
CI)
coverage (median CI)
3 stratified MAR 0.53 (0.39,0.66)
patterns for Z, X 0.55 (0.40,0.70)
and Y
0.96
0.53 (0.40,0.66)
0.96
0.87
0.53 (0.38,0.68)
0.91
0.52 (0.34,0.69)
0.87
0.55 (0.37,0.73)
6 stratified MAR 0.54 (0.40,0.67)
patterns for Z, X 0.55 (0.40,0.71)
and Y
0.93
Missingness
mechanism
0.51 (0.33,0.68)
Imputed, XY, XZ interactions
Imputed, XY, YZ, XZ
interactions
CI
coverage
Median OR
(median CI)
CI
coverage
0.52 (0.39,0.66)
0.96
0.49 (0.36,0.63)
0.94
0.89
0.94
0.54 (0.39,0.70)
0.54 (0.35,0.71)
0.90
0.49 (0.33,0.65)
0.61 (0.43,0.79)
0.95
0.93
0.53 (0.40,0.67)
0.94
0.51 (0.38,0.65)
0.93
0.50 (0.36,0.63)
0.89
0.90
0.53 (0.38,0.69)
0.94
0.97
0.85
0.55 (0.37,0.72)
0.92
0.52 (0.37,0.67)
0.56 (0.39,0.74)
0.50 (0.34,0.65)
0.60 (0.42,0.78)
0.96
0.90
0.92
18
Y continuous, X and Z binary; X, Z
and Y incomplete
• Data generated stochastically, this time Y is continuous
Y ~ 0.45 × X + 0.55 × Z + 0.6 × X × Z + N(0, 1)
• Data divided randomly into 3 sections with equal
probability
• 3 stochastic stratified MAR patterns :
logit(Z is missing) = -2.5 + 1.5 × X + Y
logit(X is missing) = -3 + 2.5 × Z + 0.8 × Y
logit(Y is missing) = -4 + 2.5 × Z + 2.0 × X
(In section 1 of data)
(In section 2 of data)
(In section 3 of data)
19
Y continuous, X and Z binary; X, Z
and Y incomplete
Var
% missing values
(95% range)
Full data
Median coef
(median CI)
Complete case
Median coef (median CI
CI)
coverage
Imputed, no interaction
Median coef (median
CI)
CI coverage
X
13.0 (12.7, 13.3)
0.46 (0.40,0.52)
0.40 (0.34,0.46)
0.58
0.50 (0.43,0.56)
0.74
Z
14.6 (14.2, 14.9)
0.55 (0.49,0.62)
0.47 (0.39,0.54)
0.34
0.62 (0.55,0.69)
0.49
0.59 (0.52,0.67)
0.47 (0.39,0.55)
0.12
0.48 (0.40,0.56)
0.20
XZ
Y
11.2 (10.9, 11.5)
Var
Imputed, XY, YZ interaction
Imputed, YZ, ZX interaction
Imputed, XY, XZ interaction
Median coef
Median coef
Median coef (median CI
(median CI)
CI coverage (median CI)
CI coverage CI)
coverage
X
0.46 (0.40,0.53)
0.87
0.48 (0.42,0.55)
0.83
0.47 (0.40,0.53)
0.87
0.46 (0.39,0.52)
0.90
Z
0.57 (0.50,0.64)
0.94
0.58 (0.51,0.65)
0.89
0.81
XZ
0.56 (0.48,0.65)
0.83
0.55 (0.47,0.63)
0.80
0.59 (0.52,0.67)
0.55 (0.46,0.63)
0.56 (0.48,0.63)
0.59 (0.51,0.67)
0.95
0.94
0.77
Imputed, XY, YZ, XZ interaction
Median coef
(median CI)
CI coverage
20
Further simulations
• In further simulations similar results were
obtained, where the distributional assumptions
of the imputations models were adhered to
• In each case omitting an interaction in a chained
equation produced biased results. All 2-way
interactions had to be included
• Starting with a tri-variate normal distribution and
introducing a slight interaction (slight nonnormality results) also gave imputed estimates
closest to the full data estimates when the full
interactions were introduced into the imputation
model
21
Conclusions
• In general the imputation models should
reflect the structure of the substantive
analysis, and should be at least as rich as
the analysis model
• In order to reflect the structure of the
substantive model, the imputation model
should not exclude its interactions, and
should also include any corresponding
interactions involving the outcome variable
22
Acknowledgements
• This work was done in collaboration with
Jonathan Sterne, Kate Tilling and James
Carpenter
• Helpful comments and suggestions from
Paul Clarke are gratefully acknowledged
23
Download