Testing the incremental predictive accuracy of new markers

advertisement
Testing the Incremental Predictive Accuracy of New
Markers
Presenter: Colin Begg
Co-Authors: Mithat Gonen
Venkatraman Seshan
Department of Epidemiology and Biostatistics
Memorial Sloan-Kettering Cancer Center
5th Annual UPenn Conference on Statistical Issues
in Clinical Trials
April 2012
The Issue
• In evaluating the incremental predictive or diagnostic
accuracy of a new marker it is commonplace to use
nested models and then test using a comparison of
AUCs derived from the baseline model and from the
model augmented with the new marker
• Often two tests are performed
– A Wald test or equivalent from the regression
– A Delong test of the AUCs
From Kwon et al. Radiology 2011
Detection of Coronary Artery Disease
General Observations
• The Wald test and the ROC AUC test would appear to be testing the
same hypothesis
• Doesn’t feel right to require a marker to survive 2 significance tests
• Anecdotal evidence that ROC test is frequently non-significant for
markers significant in regression tests
• The use of one test (or testing strategy) with clear statistical
properties seems advisable
– But which test should be used?
Simulations
• Standard bivariate normal markers generated with mean
(μ1,μ2) and correlation ρ
• μ2 = effect of new marker
• μ1 = effect (collective) of existing predictors
• Generate datasets and analyze the incremental effect of
μ2 using the Wald test and the Delong et al. AUC test
From Vickers et al. BMC Med Res Meth 2011
Marker Effect
Base Effect
μ2
μ1
Proportion Significant
ρ
Wald
AUC
0.0
.06
.00
0.3
.05
.00
0.0
.04
.00
0.3
.05
.00
0.0
.61
.17
0.3
.64
.20
0.0
.59
.10
0.3
.64
.11
0.1
0.0
(Null)
0.3
0.1
0.2
0.3
AUC test is exceptionally conservative with very low power.
The Delong et al. (1988) AUC Test
Simulations from Venkatraman and Begg (Biometrika 1996)
AUC of 1st
marker
0.6
0.8
AUC of 2nd
marker
Test Size
Sample
Size
ρ = 0.0
ρ = 0.5
80
.06
.06
160
.05
.04
80
.04
.04
160
.06
.04
0.6
0.8
Key assumption: observations (pairs) are i.i.d.
Why is the AUC Test Invalid in this Context?
• Notation
– Baseline predictor – m1i (could be multivariate)
– New marker – m2i
– Outcome (binary) – yi
– Model
E(logit (Yi ))  0  1m1i  2m2i
• Derived predictors – w1i and w2i
ˆ 
ˆ m
w1i  
0
1 1i
~
~
~
w 2i  0  1m1i  2 m 2i
• AUC approach derives the AUC of (yi,w1i) and the AUC of (yi,w2i)
and compares them using the Delong et al. (1988) test
• This test accommodates the fact that w1i and w2i are correlated.
• What’s wrong with this approach?
Problems with AUC Test in Nested Models
Two fundamental problems
• (w1i, w2i) are not independent between patients
– corr(w1i, w1j), corr(w2i, w2j) and corr(w1i, w2j) are all
typically strongly positive
– In later example we will use (n=55, μ1=0.3, μ2=0.0)
these are, respectively, 0.50, 0.41 and 0.35
• Derived AUCs influenced strongly by the concept of
known directionality
What is Known Directionality?
• Predictors from regression models inherently strive to improve
predictability
• If a new marker is truly null, it is equally likely to produce a positive
or a negative regression parameter estimate
• Either way the model interprets a non-zero parameter estimate as a
contribution of predictive information, and this will generally
increment the AUC upwards regardless of the sign of the effect
• In contrast, if we are comparing distinct markers as in a conventional
ROC comparison of diagnostic tests, the AUC of the new marker is
equally likely to be smaller or larger than the control AUC under the
null
Consequences for AUC Test
Solid curve – observed test statistic from 5000 simulations
Dashed curve – asymptotic null distribution
μ1 = 0.3, μ2 = 0.0, ρ = 0, n = 500
Although the test statistic is biased upwards, its greatly reduced
variance (relative to the asymptotic variance) makes it unlikely for
the statistic to be in the critical region.
Consequences
Impact of Known Directionality
A Valid Area Test
• Define w1={m1i} and w2={m2i}
• Construct an orthogonal decomposition of w2
– w2 = Pw2 + (I – P)w2
– P = (X’X)-1X where X = (1 w1 z) and z represents other
covariates
• w2c = (I – P)w2 forms an exchangeable sequence under the null
hypothesis
• To create a valid reference distribution we can permute w2c,
regenerate w2 = Pw2 + permuted w2c, perform the regressions and
the AUC test
• Simulations show that this has size and power similar to (slightly
lower than) the Wald test.
Power of the “Projection AUC” Test
(n=500; 5000 simulations)
Marker
Effect
Base
Effect
μ2
μ1
Proportion Significant
ρ
Wald
AUC
Proj. AUC
0.0
.05
.00
.05
0.5
.05
.00
.05
0.0
.06
.00
.06
0.5
.05
.00
.05
0.0
.59
.18
.57
0.5
.75
.29
.72
0.0
.60
.11
.53
0.5
.70
.17
.68
0.0
0.0
0.3
0.0
0.2
0.3
The Projection AUC Test has power approaching the Wald test
Conclusions
• The asymptotic reference distribution of the DeLong et al. AUC test
is grossly invalid when the markers being compared are derived
predictors from nested regression models.
• The test statistic (difference in AUCs) is fine – with a valid reference
distribution the power approaches the Wald Test
Two natural follow-up questions:• Independent validation samples
– These are widely used to correct for over-confidence in risk
prediction models
– Is the strategy of comparing AUCs from “validation” samples
appropriate and valid?
• Non-nested models
– Is the AUC test valid in this context?
Independent Validation Samples
• The samples being compared are new cases to which the predictors
(and parameter estimates) from the former models are applied
– One cannot perform a Wald-type test on the predictors, though
one could repeat the Wald analysis on the entire validation
dataset
• One could take these predictors and perform an AUC test
• But this is problematic
-- Since the AUC test is a rank test we are in effect comparing {w1i*}
versus {w2i*} where
*
w1i  m1i
~
~
w *2i  m1i  2 m 2i / 1
~
E
(

-- Since
2 )  0 under the null we are comparing {w1i} with the
same marker with additional noise.
-- This does not feel entirely satisfactory
Size of AUC Test in Validation Samples
n
(training set)
n
(test set)
250
250
μ1
(baseline)
0.0
500
250
0.0
500
250
0.3
500
μ2
(new marker)
0.0
500
250
10000
0.3
500
0.0
ρ
Test Size
0.0
.04
0.5
.05
0.0
.05
0.5
.05
0.0
.07
0.5
.06
0.0
.09
0.5
.07
0.0
.03
0.5
.03
0.0
.03
0.5
.02
Power of AUC Test in Validation Samples
Power
n
(training set)
250
n
(test set)
μ1
(baseline)
250
0.0
500
250
0.2
500
250
0.3
500
μ2
(new marker)
500
0.2
ρ
AUC
Wald
0.0
.24
.36
0.5
.29
.44
0.0
.42
.59
0.5
.50
.75
0.0
.14
.36
0.5
.17
.43
0.0
.20
.60
0.5
.25
.70
Comment: Testing the AUC increment using the AUC test does not
appear to be a sensitive approach, especially as the baseline
predictiveness increases.
Is the Test Valid for Non-Nested Models? (n=500)
Marker Effects
μ2
μ3
Base Effect
Corr (m2,m3)
Proportion Significant
μ1
ρ
AUC Test
0.0
.00
0.5
.00
0.0
.00
0.5
.00
0.0
.05
0.5
.04
0.0
.03
0.5
.04
0.0
.11
0.5
.17
0.0
.06
0.5
.11
0.0
0.
(null)
0.0
(null)
0.3
0.0
0.2
(null)
0.2
(null)
0.3
0.0
0.0
0.2
0.3
General Conclusions Regarding Testing
• ROC curve (and the AUC measure) is one among many measures
that have been proposed for characterizing predictive accuracy
• However, ROC curves derived from predictors from regression
models must be used with caution
– They are affected by optimism bias
– Predictors cannot be used as original data to perform
conventional tests of incremental predictive accuracy
• Use of validation samples
– AUC difference is still not strictly unbiased
– Confirming a positive result is much more powerful using Wald
• Non-nested models
– AUC test is invalid as in the nested setting
Measuring Improvements in Prediction
• ROC curves have been promoted for use as descriptive
tools for characterizing the degree of improved predictive
power
• Several other measures have also been promoted in
recent years: notably:– NRI – net reclassification improvement
– IDI – integrated discrimination improvement
• Are these tools affected by similar biases?
Net Reclassification Improvement (NRI)
• The general goal of this measure is to determine the extent to which
the new predictive rule improves the classification of patients into
clinically distinct categories
• There are various reclassification indices that depend on predefined classification points
• Here we use the “continuous” version (Pencina Stat Med 2008)
– NRI = # times new predictor improves upon old predictor less the
number of times the new predictor is inferior. Contribution of
cases is weighted by (prevalence)-1 and of controls by (1 –
prevalence)-1. An improvement is defined by
–
pˆ new  pˆ old if subject is a case
–
pˆ new  pˆ old if subject is a control
Plot of Wald Statistic Versus Standardized NRI
Null Effect
NRI
WALD
Model fitting ensures that there is a strong positive bias in the NRI.
Simulations: n=250; μ1 = 0.3; μ2 = 0.0; ρ = 0.
Corresponding Densities
Red curve – Wald statistic
Black curve – NRI statistic
Integrated Discrimination Improvement (IDI)
• This measure is an average of the improvements in
sensitivity and specificity due to the additional marker
• IDI = mean of i (pˆ new  pˆ old )
• where  i is (prevalence)-1 for cases and – (1 –
prevalence)-1 for controls
Plot of Wald Statistic Versus Standardized IDI
Null Effect
IDI
Wald
Clearly the two statistics are closely related.
Corresponding Densities
Red curve – Wald statistic
Black curve – IDI statistic
IDI and |Wald| Are Closely Related
Red curve – Wald statistic
Black curve – IDI statistic
Green curve -- |Wald| statistic
Conclusions
• Biases induced by known directionality affect all measures of
discrimination performance
• The use of risk predictors derived from regression models as data
inputs for subsequent tests or measures of discrimination leads to
profound bias, especially when the effects of new markers are close
to the null
• Tests of new markers (in a nested framework) should be
accomplished using traditional Wald or likelihood ratio tests from the
regression model.
• Use of independent validation samples is essential for measuring
the magnitude of the impact of new markers
References
• The work in this talk is mostly based on material in the following two
articles
– Vickers AJ, Cronin AM, Begg CB. One statistical test is sufficient
for assessing new predictive markers. BMC Med Res Methodol.
2011 Jan 28;11:13
– Seshan VE, Gonen M, Begg CB. Comparing ROC curves
derived from regression models. Manuscript under review.
Available at http://www.bepress.com/mskccbiostat/paper20/
Download