Testing the Incremental Predictive Accuracy of New Markers Presenter: Colin Begg Co-Authors: Mithat Gonen Venkatraman Seshan Department of Epidemiology and Biostatistics Memorial Sloan-Kettering Cancer Center 5th Annual UPenn Conference on Statistical Issues in Clinical Trials April 2012 The Issue • In evaluating the incremental predictive or diagnostic accuracy of a new marker it is commonplace to use nested models and then test using a comparison of AUCs derived from the baseline model and from the model augmented with the new marker • Often two tests are performed – A Wald test or equivalent from the regression – A Delong test of the AUCs From Kwon et al. Radiology 2011 Detection of Coronary Artery Disease General Observations • The Wald test and the ROC AUC test would appear to be testing the same hypothesis • Doesn’t feel right to require a marker to survive 2 significance tests • Anecdotal evidence that ROC test is frequently non-significant for markers significant in regression tests • The use of one test (or testing strategy) with clear statistical properties seems advisable – But which test should be used? Simulations • Standard bivariate normal markers generated with mean (μ1,μ2) and correlation ρ • μ2 = effect of new marker • μ1 = effect (collective) of existing predictors • Generate datasets and analyze the incremental effect of μ2 using the Wald test and the Delong et al. AUC test From Vickers et al. BMC Med Res Meth 2011 Marker Effect Base Effect μ2 μ1 Proportion Significant ρ Wald AUC 0.0 .06 .00 0.3 .05 .00 0.0 .04 .00 0.3 .05 .00 0.0 .61 .17 0.3 .64 .20 0.0 .59 .10 0.3 .64 .11 0.1 0.0 (Null) 0.3 0.1 0.2 0.3 AUC test is exceptionally conservative with very low power. The Delong et al. (1988) AUC Test Simulations from Venkatraman and Begg (Biometrika 1996) AUC of 1st marker 0.6 0.8 AUC of 2nd marker Test Size Sample Size ρ = 0.0 ρ = 0.5 80 .06 .06 160 .05 .04 80 .04 .04 160 .06 .04 0.6 0.8 Key assumption: observations (pairs) are i.i.d. Why is the AUC Test Invalid in this Context? • Notation – Baseline predictor – m1i (could be multivariate) – New marker – m2i – Outcome (binary) – yi – Model E(logit (Yi )) 0 1m1i 2m2i • Derived predictors – w1i and w2i ˆ ˆ m w1i 0 1 1i ~ ~ ~ w 2i 0 1m1i 2 m 2i • AUC approach derives the AUC of (yi,w1i) and the AUC of (yi,w2i) and compares them using the Delong et al. (1988) test • This test accommodates the fact that w1i and w2i are correlated. • What’s wrong with this approach? Problems with AUC Test in Nested Models Two fundamental problems • (w1i, w2i) are not independent between patients – corr(w1i, w1j), corr(w2i, w2j) and corr(w1i, w2j) are all typically strongly positive – In later example we will use (n=55, μ1=0.3, μ2=0.0) these are, respectively, 0.50, 0.41 and 0.35 • Derived AUCs influenced strongly by the concept of known directionality What is Known Directionality? • Predictors from regression models inherently strive to improve predictability • If a new marker is truly null, it is equally likely to produce a positive or a negative regression parameter estimate • Either way the model interprets a non-zero parameter estimate as a contribution of predictive information, and this will generally increment the AUC upwards regardless of the sign of the effect • In contrast, if we are comparing distinct markers as in a conventional ROC comparison of diagnostic tests, the AUC of the new marker is equally likely to be smaller or larger than the control AUC under the null Consequences for AUC Test Solid curve – observed test statistic from 5000 simulations Dashed curve – asymptotic null distribution μ1 = 0.3, μ2 = 0.0, ρ = 0, n = 500 Although the test statistic is biased upwards, its greatly reduced variance (relative to the asymptotic variance) makes it unlikely for the statistic to be in the critical region. Consequences Impact of Known Directionality A Valid Area Test • Define w1={m1i} and w2={m2i} • Construct an orthogonal decomposition of w2 – w2 = Pw2 + (I – P)w2 – P = (X’X)-1X where X = (1 w1 z) and z represents other covariates • w2c = (I – P)w2 forms an exchangeable sequence under the null hypothesis • To create a valid reference distribution we can permute w2c, regenerate w2 = Pw2 + permuted w2c, perform the regressions and the AUC test • Simulations show that this has size and power similar to (slightly lower than) the Wald test. Power of the “Projection AUC” Test (n=500; 5000 simulations) Marker Effect Base Effect μ2 μ1 Proportion Significant ρ Wald AUC Proj. AUC 0.0 .05 .00 .05 0.5 .05 .00 .05 0.0 .06 .00 .06 0.5 .05 .00 .05 0.0 .59 .18 .57 0.5 .75 .29 .72 0.0 .60 .11 .53 0.5 .70 .17 .68 0.0 0.0 0.3 0.0 0.2 0.3 The Projection AUC Test has power approaching the Wald test Conclusions • The asymptotic reference distribution of the DeLong et al. AUC test is grossly invalid when the markers being compared are derived predictors from nested regression models. • The test statistic (difference in AUCs) is fine – with a valid reference distribution the power approaches the Wald Test Two natural follow-up questions:• Independent validation samples – These are widely used to correct for over-confidence in risk prediction models – Is the strategy of comparing AUCs from “validation” samples appropriate and valid? • Non-nested models – Is the AUC test valid in this context? Independent Validation Samples • The samples being compared are new cases to which the predictors (and parameter estimates) from the former models are applied – One cannot perform a Wald-type test on the predictors, though one could repeat the Wald analysis on the entire validation dataset • One could take these predictors and perform an AUC test • But this is problematic -- Since the AUC test is a rank test we are in effect comparing {w1i*} versus {w2i*} where * w1i m1i ~ ~ w *2i m1i 2 m 2i / 1 ~ E ( -- Since 2 ) 0 under the null we are comparing {w1i} with the same marker with additional noise. -- This does not feel entirely satisfactory Size of AUC Test in Validation Samples n (training set) n (test set) 250 250 μ1 (baseline) 0.0 500 250 0.0 500 250 0.3 500 μ2 (new marker) 0.0 500 250 10000 0.3 500 0.0 ρ Test Size 0.0 .04 0.5 .05 0.0 .05 0.5 .05 0.0 .07 0.5 .06 0.0 .09 0.5 .07 0.0 .03 0.5 .03 0.0 .03 0.5 .02 Power of AUC Test in Validation Samples Power n (training set) 250 n (test set) μ1 (baseline) 250 0.0 500 250 0.2 500 250 0.3 500 μ2 (new marker) 500 0.2 ρ AUC Wald 0.0 .24 .36 0.5 .29 .44 0.0 .42 .59 0.5 .50 .75 0.0 .14 .36 0.5 .17 .43 0.0 .20 .60 0.5 .25 .70 Comment: Testing the AUC increment using the AUC test does not appear to be a sensitive approach, especially as the baseline predictiveness increases. Is the Test Valid for Non-Nested Models? (n=500) Marker Effects μ2 μ3 Base Effect Corr (m2,m3) Proportion Significant μ1 ρ AUC Test 0.0 .00 0.5 .00 0.0 .00 0.5 .00 0.0 .05 0.5 .04 0.0 .03 0.5 .04 0.0 .11 0.5 .17 0.0 .06 0.5 .11 0.0 0. (null) 0.0 (null) 0.3 0.0 0.2 (null) 0.2 (null) 0.3 0.0 0.0 0.2 0.3 General Conclusions Regarding Testing • ROC curve (and the AUC measure) is one among many measures that have been proposed for characterizing predictive accuracy • However, ROC curves derived from predictors from regression models must be used with caution – They are affected by optimism bias – Predictors cannot be used as original data to perform conventional tests of incremental predictive accuracy • Use of validation samples – AUC difference is still not strictly unbiased – Confirming a positive result is much more powerful using Wald • Non-nested models – AUC test is invalid as in the nested setting Measuring Improvements in Prediction • ROC curves have been promoted for use as descriptive tools for characterizing the degree of improved predictive power • Several other measures have also been promoted in recent years: notably:– NRI – net reclassification improvement – IDI – integrated discrimination improvement • Are these tools affected by similar biases? Net Reclassification Improvement (NRI) • The general goal of this measure is to determine the extent to which the new predictive rule improves the classification of patients into clinically distinct categories • There are various reclassification indices that depend on predefined classification points • Here we use the “continuous” version (Pencina Stat Med 2008) – NRI = # times new predictor improves upon old predictor less the number of times the new predictor is inferior. Contribution of cases is weighted by (prevalence)-1 and of controls by (1 – prevalence)-1. An improvement is defined by – pˆ new pˆ old if subject is a case – pˆ new pˆ old if subject is a control Plot of Wald Statistic Versus Standardized NRI Null Effect NRI WALD Model fitting ensures that there is a strong positive bias in the NRI. Simulations: n=250; μ1 = 0.3; μ2 = 0.0; ρ = 0. Corresponding Densities Red curve – Wald statistic Black curve – NRI statistic Integrated Discrimination Improvement (IDI) • This measure is an average of the improvements in sensitivity and specificity due to the additional marker • IDI = mean of i (pˆ new pˆ old ) • where i is (prevalence)-1 for cases and – (1 – prevalence)-1 for controls Plot of Wald Statistic Versus Standardized IDI Null Effect IDI Wald Clearly the two statistics are closely related. Corresponding Densities Red curve – Wald statistic Black curve – IDI statistic IDI and |Wald| Are Closely Related Red curve – Wald statistic Black curve – IDI statistic Green curve -- |Wald| statistic Conclusions • Biases induced by known directionality affect all measures of discrimination performance • The use of risk predictors derived from regression models as data inputs for subsequent tests or measures of discrimination leads to profound bias, especially when the effects of new markers are close to the null • Tests of new markers (in a nested framework) should be accomplished using traditional Wald or likelihood ratio tests from the regression model. • Use of independent validation samples is essential for measuring the magnitude of the impact of new markers References • The work in this talk is mostly based on material in the following two articles – Vickers AJ, Cronin AM, Begg CB. One statistical test is sufficient for assessing new predictive markers. BMC Med Res Methodol. 2011 Jan 28;11:13 – Seshan VE, Gonen M, Begg CB. Comparing ROC curves derived from regression models. Manuscript under review. Available at http://www.bepress.com/mskccbiostat/paper20/