Missing data Methods - Dundee University School of Medicine

advertisement
Missing Data:
Where has my data gone?
Peter T. Donnan
Professor of Epidemiology and Biostatistics
The Tao of Missingness
“The inside and the
outside are one”
Zen philosopher
“Nothing is more real than nothing”
Samuel Beckett
Overview
• Why missing data matters
• Some useful definitions
• Practical issues
• Methods for imputation
Missing data is inevitable!
• Trials or observational studies are set up to
•
•
•
•
•
obtain complete data from everyone
Multiple reminders for questionnaire data
Important to distinguish valid unknown, not
applicable, lost to follow-up, etc
It’s not missing, it’s unknown!
Despite investigators’ best efforts missing
data is inevitable
The key is to minimise loss of data in the
first place
Why does data go missing?
• Poor trial management, lack of follow•
•
•
•
up
Patients have Adverse Events (AE) and
drop-out
Patients fail to attend clinic / fill in
questionnaire
Migrate with no information available
(They don’t write, they don’t call!)
Leave study for no apparent reason
Some real examples of
reasons for missing data
• “Emergency Christmas shopping” (reason for missed
•
•
•
•
visit, early November)
“The drugs will interfere with my drinking” (reason
for eligible pt saying No to trial)
“No you can’t come and see me: I’m better” (pt
dropping out at V3)
Changed address and/or phone number rendered pts
untraceable (more frequent in the West)
Two pts co-operated but refused photographs, one
on religious grounds (despite giving consent)
Does it matter?
• Missing data can seriously damage
a study’s credibility
• Two main problems;
May introduce bias
Reduces Power
Note that even
worse in
regression:
•Pairwise comparisons
leave out 38%+
•So two-group
comparisons not too
bad
ID
BMI
HBA1c LDL
Chol
HDL
1
35.2
9.1
5.8
0.8
2
26.3
7.0
4.3
1.1
28.3
11.3
5.4
6.1
0.7
8.4
3.9
4.1
1.0
3
4
5
6
40.7
10.2
4.0
7
30.5
9.3
2.9
•Regression or any
8
26.1
3.5
5.2
other multidimensional
analysis leaves out 75%
of data
- COMPLETE-CASE ONLY ANALYSIS
Practical Tip 1
• Complete Case analysis is where the
•
•
•
•
missing data problem is ignored
Patients with missing data are excluded
This will be obvious from the
constructed tables
The n in the tables reporting the
analysis will be less than the N enrolled
Even worse the dataset used may
differ by outcome as n may change
Practical Tip 1
• A useful and informative procedure is
to create a table comparing the
characteristics of the complete case
dataset and those missing e.g.
Factor
Complete Cases
Missing at 8
weeks
Mean Age
32
50
Mean BMI
19
28
% Male
50%
65%
One Solution? –
Missing-indicator method
• Code all missing as unknown and include
•
•
•
•
•
unknown category in regression model
(Mea culpa!)
Advantage that no subject excluded
Difficult to interpret
Does not deal with main issue of
potential BIAS
In fact, it will add bias…..
Fudge rather than solution
Example: Unknown stage (n=40/476) in
Cox PH model for colorectal cancer
Variables in the Equation
age
s exnum
dukes
dukes (1)
dukes (2)
dukes (3)
dukes (4)
cs core
hyperco
B
.022
.031
SE
.007
.122
.183
.882
1.961
1.427
.033
.369
.443
.423
.423
.448
.020
.136
N.b. Effect of known
stages are now biased
Wald
11.000
.064
114.441
.170
4.344
21.461
10.144
2.815
7.386
df
1
1
4
1
1
1
1
1
1
Sig.
.001
.800
.000
.680
.037
.000
.001
.093
.007
Exp(B)
1.022
1.031
1.200
2.415
7.106
4.166
1.034
1.446
HR Unknown Stage vs. Stage A
Imputation: Another
Solution
• Impute missing values and then carry
out analysis with complete dataset
• Advantage that no subject excluded
• Many methods of estimating the
missing values
1.
2.
3.
4.
LVCF (LOCF) Last Value Carried
Forward
Mean or median value of measurements
Expected value based on regression
Expected value based on E-M algorithm
Some notation
• Yobs – observed data
• Ymiss – missing data
• R – missing data indicator:
R
R
• Prob
data
= 1 indicates data observed,
= 0 missing
[R = 0 | Yobs ] prob of missing
given values of observed data
Some very difficult, opaque,
but essential definitions (1)
Missing Completely at Random (MCAR)
• Prob (Missing) is independent of both:
•
•
•
•
1) observed data and
2) unobserved data
Essentially observed data is a random
sample of full data
MCAR is what everyone falsely assumes!
If MCAR is assumed, observed-case or
complete-case analysis is valid.
Observed-case analysis is software default!
Representation of R as a stratification
factor for responses
Response Indicators
Response Vector
R1
R2
R3
R4
Y1
Y2
Y3
Y4
1
1
1
1
y
y
y
y
1
0
1
1
y
*
y
y
1
1
0
0
y
y
*
*
For MCAR:
Prob [R = 0 | Yobs, Ymiss, X ] = Prob [ R = 0 | X]
Possible to test for MCAR
Park-Lee* test for MCAR
• Within framework of GEE (Liang and Zeger)
• Define indicator variables for each missing
•
•
data pattern
Fit model with indicators as covariates
Test regression coefficients for indicators
and if significant missing data mechanism is
not MCAR
*Park T and Lee S-Y. A test of missing completely at random for longitudinal
data with missing observations. Statist Med 1997; 16: 1859-1871
Example of Park-Lee* test for
MCAR
Fit three indicator
variables
Missing
data
pattern
1
2
3
0
O
O
O
1
O
M
O
Covariate Est/SE
2
O
O
M
I1
0.65
2.03*
3
O
M
M
I2
I3
3.51*
Wave
Ik = 1 if missing pattern k,
= 0 otherwise
For overall test p = 0.0023
Examples: MCAR
• Six Cities Air Pollution Study –
children changed schools because of
parents so unrelated to health of
children
• In a trial Practice changed computer
system so missing observations not
related to previous observed or
future values
Practical Tip 2
• Check data for MCAR
Little’s test)
(note SPSS carries out
• If assumption seems reasonable analyse
•
•
•
using complete-case only with impunity
If missing data constitutes < 5% probably
reasonable to assume MCAR
If not, complete-case analysis is likely to
be biased
N.b. MCAR not that common
Another essential definition
Missing At Random (MAR)
• Prob (Missing) is independent of:
•
•
•
1) unobserved data but
2) dependent on observed data
Essentially observed data is a random
sample of full data in each stratum
MAR is weaker version of MCAR assumption
If MAR is assumed, many methods possible
to impute data using observed data.
Missing At Random (MAR)
• Prob (missing) depends on Yobs but not on
•
•
•
•
missing Ymiss
Prob [ R = 0 | Yobs, Ymiss, X] =
Prob [ R = 0 | Yobs, X]
MCAR is a special case of MAR
Use fact that missing Y for a person with
same age, gender, BP, chol, BMI, etc.
will be similar to a person with same
characteristics who does have outcome
Allows imputation methods based on
observed data e.g. mean, regression
Examples: MAR
• Six Cities Air Pollution Study – children
•
•
moved out of area because of nonrespiratory problems (e.g. type 1
diabetes)
Men less likely to attend for follow-up
visit but not related to values of their
likely outcomes
Repeated measures where missingness is
not related to values would have obtained
Single Imputation
• Most common approach
•
•
•
is to add mean of
values observed to
impute missing
Takes no account of
differences related to
other factors eg.
HbA1c
Takes no account of
uncertainty in
estimating missing value
Makes clinicians uneasy!
ID
BMI
1
2
HBA
1c
LDL
Chol
HDL
35.2 9.1
5.8
0.8
4.3
1.1
3
26.3 7.0
31.2
4
28.3 11.3 5.4
6.1
0.7
5
31.2 8.4
6
40.7 10.2 4.0
7
30.5 9.3
2.9
4.1
1.0
8
26.1
3.5
5.2
3.9
Single Imputation
• Common method in
•
•
•
•
longitudinal data
Last Value Carried
Forward (LVCF or LOCF)
Common in RCTs
Some journals and even
FDA endorse
But statistically unsound
unless strong and
unrealistic assumptions
met (see LSHTM
website)
ID
Baseline
4 weeks
8 weeks
1
15
13
13
2
29
32
32
3
43
4
32
29
25
5
19
36
26
6
10
10
13
7
31
25
20
8
19
43
19
43
18
Examples: Single
Imputation
• Last Value Carried Forward (LVCF or
LOCF) very common in RCTs
• Adalimumab in severe Crohn’s disease,
nearly 50% of patients were lost-tofollow-up at 52 weeks in one trial and
LVCF used (but relapsing-remitting
condition!)
• But legitimate use in Bell’s Palsy Trial!
• No disagreement among statisticians
that method is unsound
Solution is Multiple
Imputation!
1. Assumes data MAR
2. Missing data filled in
m times
3. The m complete
datasets are each
analysed by using
standard procedures
4. The results for the m
complete datasets are
combined for
inference
ID
Baseli
ne
4
weeks
8
weeks
1
15
13
12
2
29
32
30
3
35
4
ID
5
119
32
2
3
4
5
36
Baseli
29
ne
36
15
29
ID
43
44
4
25
weeks
26
13
32
Baseli
ne
32
1929
19
36
1
8
weeks
13
32
4
8
43 week
week
s
s
25
15
13
2
29
32
3
39
28
40
4
32
29
25
5
19
36
26
26
16
25
Multiple Imputation
(MI)
• Process derived by Donald Rubin (1987)
• Replace missing values with set of plausible
•
•
•
values that also…
Represents the uncertainty about the
correct value
Requires MAR assumption but NOT MCAR
Many methods of estimating imputed
values 1) regression, 2) propensity score,
3) MCMC
Step 1: Multiple
Imputation Methods
1) Regression – Missing values
predicted by regression model of
previous values and covariates
•
•
•
Fit model Xβ using any variables available
(previous values and covariates)
Repeat if further follow-up results missing
Extract predicted value and save new dataset
with predicted value inserted
Missingness
Model
• How do I choose what factors to use in
•
•
•
•
predicting imputed values?
All factors related to outcome (i.e. all
Xs)
Plus importantly the outcome
Any other factors possibly related to the
reason for being missing
Better to be overly inclusive and
statistical significance not important
Multiple Imputation:
A Cautionary tale
• Hippisley-Cox et al, BMJ 2007 developed a
•
•
•
•
risk algorithm for CVD called QRISK
70% of Cholesterol values were missing and
imputed using MI assuming data MAR
Found NO association between CVD and
cholesterol
Investigation showed they had not used CVD
outcome in the imputation model
When rectified ‘true’ association found!
Step 1: Multiple
Imputation Methods
2) Propensity score –
• create indicator variable R=0 for missing
• Fit logistic model Xβ of propensity to be
•
•
missing (R=0).
Divide observations by quintiles of
propensity score
Allow random draws (~Bayesian bootstrap)
of values from observed data in matching
quintile to fill in missing data
Step 1: Multiple
Imputation Methods
3) Monte Carlo Markov Chain (MCMC) –
• Imputation draws from conditional
•
•
•
•
distribution of Ymiss | Yobs
Posterior step simulates posterior mean
and covariance matrix
New estimates used iteratively in
imputation step
Process converges (hopefully)
Incorporates EM algorithm
Step 1: Multiple
Imputation
• All available in PROC MI in SAS
•
•
•
•
software and creates m number of
datasets
Now available in SPSS v. 17
Note SPSS carries out Little’s test
for MCAR
S-plus – some functions
Stata has full set of programs for MI
How many (m)
datasets do I need?
Too many leads to
data management
problem
Relative efficiency of
using finite m
imputations is given by
RE = ( 1 + λ / m) -1
where λ is fraction of
missing information
RE
λ
m
10%
20%
30%
50%
70%
3
0.97
0.94
0.91
0.86
0.81
5
0.98
0.96
0.94
0.91
0.88
10
0.99
0.98
0.97
0.95
0.93
20
0.99
0.99
0.98
0.98
0.97
Step 2: Multiple
Imputation
• Analyse the now complete datasets
in standard way
• T-test, Regression, Survival,
Logistic, GLM, Mixed model, etc…
• Creates a set of parameter
estimates for each of m datasets
Step 3: Multiple
Imputation
• Combine results from m datasets
• Standard way is calculate mean and
variance of parameter estimate
1
Q 
m

m
i 1
Qi
Let Ü be within-imputation variance and B the betweenimputation variance then the total variance T is T  U  (1
1
)B
m
Step 3: Multiple
Imputation
• Relatively easy, but fortunately SAS
•
•
•
•
has a procedure to implement this
called PROC MIANALYZE
Good documentation
SPSS now does this step in v.17!
MI now considered gold standard
methodology for drawing valid
inferences in the face of missing data
(with MAR)
Still many people wary
Alternative Solution:
Weighting
• Weight observed data to take account of
•
•
•
•
under-representation of certain response
profiles
Does not involve imputation but assumes
MAR
First proposed in sample survey literature
Relatively easy as most standard
programs allow addition of weighting
factor
Requires weight wi and then complete
case analysis weighted by 1/wi
Alternative Solution:
Weighting
• Estimate wi = Pr [R = 0 | Yobs, X]
• Repeat for multiple time points
• Analyse complete cases weighted by
wi
• Example GEE with MAR
• Intuitively good as weight people
with missing data as similar
to those with observed data
Practical Tip 3
• If we assume MAR, method of MI
provides means of valid inference
• Comprehensive software in SAS and
now SPSS
• Other software incorporate as
standard (Stata)
• Consider weighting method as
intuitively appealing
Another essential definition
Missing Not At Random (MNAR)
• Prob (Missing) is dependent on both:
•
•
•
•
•
1) unobserved data and
2) observed data
Often referred to as nonignorable missing
mechanism or informative missingness
MNAR is completely unverifiable from the data
Need to assess the sensitivity of results to
different plausible explanations
All standard methods are NOT valid
Ongoing area of research in statistical methods
Examples: NMAR
• QOL missing in those with low quality
of life and so missingness related to
what might have been QOL
• Measurement of weight loss more
likely to be missing if weight loss
likely to be low
Missing Not At Random
(MNAR)
One method uses Structural Equation
Modelling (SEM)
• Requires specialist software
• Often referred to as nonignorable missing
•
•
•
•
mechanism or informative missingness
MNAR is completely unverifiable from the data
Need to assess the sensitivity of results to
different plausible explanations
All standard methods are NOT valid
Ongoing area of research in statistical methods
Summary
•
•
•
•
•
•
•
•
Consider hierarchy of missing data
MCAR, MAR, MNAR
Ideal is to use MI if MAR
or Weighting methods if MAR
Tools now in SPSS
Need to model missingness mechanism
jointly with analysis of outcome if
MNAR
Complete case analysis needs to be
justified!
LVCF needs to be justified!
Summary
“…it is time to place CC
analysis and simple imputation
methods, in particular LOCF,
in the Museum of Statistical
Science..”
Geert Molenberghs
Editorial JRSS A, 2007:861-863
References
• LSHTM website on missing data, sponsored by ESRC
•
•
•
•
(www.lshtm.ac.uk/missingdata/start.html)
Donders AR, van der Heijden GJ, Stijnen T, Moons KG. Review: a gentle
introduction to imputation of missing values. J Clin epidemiol 2006; 59:
1087-91
Sterne JAC, White IR, Carlin JB, Spratt M, Royston P, Kenward MG,
Wood AM, Carpenter JR. Multiple imputation for missing data in
epidemiological and clinical research:potential and pitfalls. BMJ 2009;
338: b2393.
Hippisley-Cox J, Coupland C, Vinogradova Y, Robson J, May M, Brindle P.
Derivation and validation of QRISK, a new cardiovascular disease risk
score for the United Kingdom: prospective open cohort study. BMJ 2007;
335: 136.
Little, Roderick JA and Rubin, Donald B. (1987). Statistical Analysis with
Missing Data John Wiley and Sons, New York.
References
• Dempster AP, Laird NM and Rubin DB. Maximum Likelihood from
•
•
•
•
Incomplete Data via the EM Algorithm, Journal of the Royal Statistical
Society 1977; Ser. B., 39: 1 - 38.
Rubin DB. (1987). Multiple imputation for nonresponse in surveys. John
Wiley & Sons, New York.
Yuan YC. Multiple imputation for missing data: concepts and new
development. SAS Institute Inc (P267-25)
Software Documentation for SAS®, S-PLUS® and SPSS®.
R Development Core Team(2005). R: A language and environment for
statistical computing. R Foundation for Statistical Computing, Vienna,
Austria.
The Tao of Missingness
“There are known knowns. These
are things we know that we know.
There are known unknowns. That is
to say, there are things that we know
we don't know.
But there are also unknown unknowns.
There are things we don't know we
don't know.”
Donald Rumsfeld
Download