Analysis of incomplete data regression models in health services studies

advertisement
Analysis of incomplete data regression
models in health services studies
Nicholas J. Horton
Department of Mathematics and Statistics
Smith College
nhorton@email.smith.edu
http://www.math.smith.edu/∼nhorton
June 25th, 2006
Acknowledgements
Joint work with Ken Kleinman, Department of
Ambulatory Care Policy, Harvard Medical School
funding from NIH MH54693
Plan for talk
• Introduction and motivation
• Example dataset
• Missing data nomenclature
• Missing data methods
• Application
• Concluding thoughts
Motivation
• missing data a common problem
• may be due to design or happenstance
• ignoring missing data may lead to inefficiency
• ignoring missing data may lead to bias
Motivation (cont.)
• particularly salient for health services research
– more opportunity for missingness in larger
studies
– administrative datasets may not have complete
coverage
– some items may be intentionally censored
• software to fit incomplete data regression
models is improving (but not yet entirely there!)
Example Dataset
• Kids’ Inpatient Database (KID)
• developed by Heathcare Cost and Utilization
Project (HCUP)
• sponsored by Agency for Healthcare Research
and Quality (AHRQ)
• Year 2000 dataset contains data from 27 State
Inpatient Databases
Inferential Goal of Analysis
What factors predict
• whether 10-20 year old subjects
• with a primary, secondary or tertiary diagnosis
of mental health or substance abuse issues
• are discharged from a hospitalization in a
routine fashion (e.g. not AMA, transferred to
another facility, or died)
Predictors with Complete Data
• AGE (in years)
• LOS (length of stay, in days)
• NDX (number of medical diagnoses)
• WEEKEND (=1 if admitted on a weekend)
• FEMALE (=1 if female)
Predictors with Missing Data
• RACE (1=Caucasian, 2=Black, 3=Hispanic,
4=Other)
• SEASON (Winter, Spring, Summer, Fall)
• ATYPE (Admission type: 1=emergency, 2=urgent,
3=elective, 4=other)
• TOTCHG (Total charges, in dollars)
• reasons for missingness?
Missing Data Patterns
(Splus missing data library)
10 variables, 135344 observations, 12 patterns
4 vars. (40%) have at least one missing value
55770 obs. (41%) have at least one missing value
Breakdown by variable
V O
name Missing % missing
1 8 TOTCHG
5021
4
2 2
ATYPE
15093
11
3 10 SEASON
15616
12
4 7
RACE
21888
16
Missing Data Patterns (cont.)
1234
....
...m
..m.
.m..
m...
..mm
.m.m
count
79574
21335
15354
13601
3665
213
234
11 mm..
1213
1
2
3
4
5
6
7
<<<<<<-
complete cases
missing RACE
missing SEASON
missing ATYPE
missing TOTCHG
missing SEASON + RACE
Missing Data Nomenclature:
monotonicity
• hierarchy exists such that completeness in one
variable determines completeness of another
• monotone patterns simplify analysis
1
2
3
4
5
1234
....
...m
..mm
.mmm
mmmm
Missing Data Nomenclature:
monotonicity
• KID dataset is decidedly non-monotone
• To create a monotone pattern would require
arbitrarily dropping some observations
1234 count
1 .... 79574
2 ...m 21335
6 ..mm
213
<- complete cases
<- missing RACE
<- missing SEASON + RACE
Notation
• Y outcome of regression model
• X predictor in regression model (typically a
vector, X1, X2, . . . , Xp)
• f (Y |X, β) regression model of interest
Missing data nomenclature: mechanisms
• Introduced by Little and Rubin (text, 1987, 2002)
• Let R = 1 denote whether a particular variable
(say Y2) is missing in a longitudinal study
• What assumptions are we willing to make
regarding the missingness law:
f (R|Y1, Y2, X, γ)?
Missing data nomenclature:
MCAR (Missing Completely at Random)
• f (R|Y1, Y2, X) = f (R)
• Missingness does not depend on observed or
unobserved quantities
• Example: data fell from the truck
Missing data nomenclature:
MAR (Missing at Random)
• f (R|Y1, Y2, X) = f (R|Y1, X)
• Missingness does not depend on unobserved
quantities
• Example: doctor took a subject off a longitudinal
trial because they were too sick (based on
observed Y1)
Missing data nomenclature:
NINR (Nonignorable nonresponse)
• f (R|Y1, Y2, X) = f (R|Y1, Y2, X) (no simplification)
• Missingness depends on unobserved quantities
• Example: subject missed their observation Y2
because they were too sick to get out of bed
Missing data nomenclature
• Little and Rubin showed that if MAR missingness,
then likelihood based approaches can ignore
missing data mechanism and still yield the right
answer
• MAR impossible to verify without auxiliary
information
• NINR models require a lot of work modeling
missingness, best used for sensitivity analyses
(Partial) Taxonomy of methods
• Complete case
• Multiple imputation methods
• Maximum likelihood methods
• Excellent review by Ibrahim and colleagues
(JASA 2005)
Complete case approach
• Simple
• Main drawback: inefficient
• Use only 59% of the KID dataset!
• May yield bias
Multiple imputation
• ‘fill-in’ the missing values with some ‘appropriate’
value to give a completed dataset
• repeat this process multiple times
• combine results from each of these multiple
imputations
• requires a model to ‘fill-in’ the values
• Originally due to Rubin (1978)
Specifying imputation model
• Most complicated task (since running the
separate analyses is fast and cheap)
• Simpler when the predictors and outcome are
plausibly multivariate normal
• Harder with categorical missing values
• Even harder if non-monotone
Specifying imputation model for
dichotomous variable (cont.)
• Use a normal model (but how to include in
analysis?)
• Use a normal model and round (leads to bias,
Horton et al 2003)
• Use a discriminant model (slightly harder but
feasible)
Models for dichotomous imputation
Correctly specifying the imputation model for
dichotomous variables is feasible (Rubin, 1987,
p.169)
• estimate probability that Yi = 1
• for each imputation, generate uniform (0,1) RV
• set Yi = 1 if the uniform random variable is less
than the estimated prob. (and 0 o.w.)
Specifying imputation model (cont.)
• What if there are multiple categorical variables
with missing values?
• Straightforward if monotone pattern (SAS
PROC MI)
• Use of MICE in R or Stata (Multiple Imputation using
Chained Equation, van Buuren et al 1999, Royston 2005):
impute one value, use that to impute the next,
and repeat
Maximum likelihood
• Typically we are interested in f (Y |X, β) where
the covariates are assumed fixed
• To gain information from partially observed
subjects, posit a distribution for f (X|α)
• Maximize likelihood of f (Y, X|β, α), typically
through use of the EM (Expectation-Maximization)
algorithm
• Originally due to Ibrahim (1990)
Maximum likelihood (via EM)
Alternate:
• calculating the Expected value of the missing
observations
• Maximizing the complete data log likelihood
given those values
• formalized by Dempster, Laird and Rubin (1977)
Maximum likelihood implementations in
LogXact
• Supports up to 10 categorical covariates with
missing values, allows covariates to take up to
5 values (e.g. 0,1,2,3,4)
• Uses a simplifying approach due to Lipsitz and
Ibrahim (1996) to partition the joint distribution
of the missing values
Maximum likelihood implementations in
LogXact (cont.)
f (X1, X2, X3, X4) =
f (X1)f (X2|X1)f (X3|X1, X2)f (X4|X1, X2, X3)
• Support for continuous missing values feasible
in future versions (but requires use of MCEM
and further modeling assumptions)
Maximum likelihood implementations in
S-Plus
• Supports multivariate normal for continuous
random variables conditional on categorical
ones (conditional Gaussian) (Schafer, 1997)
• Requires full specification of a log-linear model
for the categorical random variables
Results for KID (descriptive statistics)
variable
ROUTINE
WEEKEND
FEMALE
WHITE
percentage
86%
20%
54%
57%
Results for KID (descriptive statistics)
variable
mean (SD)
AGE
16.3 (2.7)
LOS
6.4 (12.7)
TOTCHG
$9,230 ($17,371)
NDX
3.5 (2.0)
Results for KID (complete case model)
parameter
AGE
WEEKEND
FEMALE
LOS
TOTCHG
NDX
OR
0.96
0.94
1.09
0.997
0.999
0.90
p-value
<0.001
0.025
<0.001
<0.001
<0.001
<0.001
Results for KID (CC, cont.)
parameter
SEASON
RACE
ATYPE
df
3
3
3
p-value
0.006
<0.001
<0.001
SEASON: winter and fall least likely non-routine
RACE: white more likely routine
ATYPE: non-emergency most likely routine
Results for KID (comparisons)
How does accounting for missingness (in this
case using SAS PROC MI and LogXact) affect our
estimates (log OR)?
param
CC
PROC MI
LogXact
AGE est (se)
-0.040 (0.0040)
-0.036 (0.0032)
-0.039 (0.0031)
FEMALE est (se)
0.088 (0.0210)
0.118 (0.0170)
0.106 (0.0161)
Discussion
• Complete case estimator simple, but may
be inefficient and biased (particularly when
missingness depends on Y )
• Missing data methods are available, require
imposition of assumptions (MAR) and additional
effort, but yield efficiency gains (of approximately
25% in our example)
Discussion (cont.)
• a variety of models have been proposed in
the statistical literature, many of these make
simplifying assumptions or have been coded
specifically for a given situation
• general methods to handle missingness in this
setting remains difficult (requires compromises)
Future work
• further work is needed to assess sensitivity to
assumptions and areas where these methods
have greatest potential
• use of NINR models in this setting
• accounting for clustering and survey design
(straightforward in Stata)
References
Dempster AP et al (1977) Maximum likelihood from incomplete data via
the EM algorithm, JRSS-B, 39:1-22.
Horton NJ and Laird NM (1999) Maximum likelihood analysis of
generalized linear models with missing covariates, SMIMR, 8:37-50.
Horton NJ and Lipsitz SR (2001) Multiple imputation in practice:
comparison of software packages for regression models with missing
variables, TAS, 55:244-254.
Ibrahim JG (1990) Incomplete data in generalized linear models, JASA,
85:765-769.
Ibrahim JG et al (2005) Missing-data methods for generalized linear
models: a comparative review, JASA, 100:332-346.
Lipsitz SR and Ibrahim JG (1996) A conditional model for incomplete
covariates in parametric regression, Biometrika, 83:916-922.
Little RJA (1992) Regression with missing X’s: a review, JASA, 87:12271237.
Little RJA and Rubin DB (2002) Statistical analysis with missing data,
2nd edition, Wiley.
Royston P (2005) Multiple imputation of missing values, Stata Technical
Journal, 5(4):527-536.
Rubin DB (1987) Multiple imputation for nonresponse in surveys, Wiley.
Rubin DB (1996) Multiple imputation after 18+ years, JASA, 91:473-489.
Schafer, JL (1997) Analysis of incomplete multivariate data, Chapman
and Hall.
van Buuren S et al (1999) Multiple imputation of missing blood pressure
covariates in survival analysis, Statistics in Medicine, 18:681-694.
Analysis of incomplete data regression
models in health services studies
Nicholas J. Horton
Department of Mathematics and Statistics
Smith College
nhorton@email.smith.edu
http://www.math.smith.edu/∼nhorton
June 25th, 2006
Download