Imperfect Gold Standards for Biomarker Evaluation Rebecca A. Betensky Conference on Statistical Issues in Clinical Trials University of Pennsylvania April 18, 2012 Outline • Motivation: need for kidney injury biomarkers for diagnosis of acute kidney injury (AKI) • Impact of imperfect gold standard on apparent sensitivity and specificity of perfect biomarker • Examine conditional independence assumption: implicit restrictions • Bounds on true sensitivity and specificity Serum creatinine for AKI • Clinicians have used SCr to diagnose AKI for decades. • Acknowledged as inadequate gold standard: – Poor specificity in some settings that are not associated with kidney injury – Poor sensitivity in setting of adequate renal reserve – Relatively slow kinetics after injury • Considerable interest in identifying better biomarkers of tubular injury: potentially more accurate and earlier diagnosis. How to evaluate new biomarkers? • Studies have used changes in SCr as the gold standard against which to test novel tubular injury biomarkers. • Aside from problems of specificity and sensitivity, – SCr does not directly reflect tubular function or injury – Based on a cutoff, which will impact its true spec and sens, and thus that of novel marker. Conceptual framework • Actual disease that is the target of the diagnostic test (AKI) is not synonymous with clinical conditions identified by imperfect gold standard (SCr). • AKI is difficult to establish without invasive and risky histopathological assessment. • Using imperfect gold standard (i.e., imperfect reference test) may distort apparent diagnostic performance of novel biomarker. Idealized example of perfect novel biomarker disease prevalence=20% imperfect gold standard sensitivity=80%, specificity=80% Relative to imperfect gold standard, a perfect novel biomarker will have apparent sensitivity of 50% and apparent specificity of 64/68=94%. At lower prevalence, dominant effect of imperfect gold standard is on perfect biomarker’s apparent sensitivity: apparent sens= 1 1 p 1 specG 1 p sensG 1 apparent spec= p 1 sensG 1 1 p specG This is similar to imperfect gold standard=“need for dialysis”. At prevalence of 20%, apparent sensitivity of perfect biomarker is 100% and apparent specificity is 84%. The bounds of the apparent AUC are 0.84-1.00. Even rare false positives (imperfect gold standard spec=99%) lead to apparent sensitivity of 86% and bounds of apparent AUC of 0.72-0.98. Cut-offs for SCr • Recent clinical studies of novel AKI biomarkers have used a variety of SCr criteria to define AKI. • These examples illustrate that different choices of cut-off’s can lead to hugely different apparent properties of a novel biomarker. What if new biomarker is not perfect? • Need assumptions on relationship between new biomarker and imperfect gold standard and disease to evaluate new biomarker. • Conditional independence is convenient; allows for latent class models. • However, it introduces implicit restrictions. What can we learn for imperfect novel biomarker? • Previous illustration assumes perfect novel biomarker. • Common assumption is conditional independence: P(B=b|G=g,D=d)=P(B=b|D=d) • Apparent sensitivity of B relative to G: p SeG SeB (1 p) (1 SpG ) (1 SpB ) p SeG (1 p) (1 SpG ) • Apparent specificity of B relative to G: p (1 SeG ) (1 SeB ) (1 p) SpG SpB p (1 SeG ) (1 p) SpG • Use these to solve for “true sensitivity” and specificity of B relative to D • Bounds on apparent AUC: – Apparent AUC< apparent sens × apparent spec – Apparent AUC>apparent sens+(1-apparent sens) × apparent spec Problems with conditional independence • May not be plausible from mechanistic or physiological perspective; the two tests measure related phenomena. • May be association between disease severity and test results; two tests may be conditionally independent given disease severity, but not conditionally independent given presence or absence of disease. • Assumption of conditional independence constrains the disease prevalence; may not be plausible. Conditional Independence: disease severity • Independence given disease severity: P(G=1, B=1|D=1,X)=P(G=1|D=1,X)×P(B=1|D=1,X) does not imply independence given disease: P(G=1,B=1|D=1)=P(G=1|D=1)×P(B=1|D=1) Conditional Independence: disease prevalence Conditional independence may not be possible at a given disease prevalence. Bounds on prevalence under conditional independence G=1 G=0 B=1 a b B=0 c d Under conditional independence, split into two tables, with some constraints: D=1 D=0 G=1 G=0 B=1 a b B=0 c d G=1 G=0 B=1 (1-)a (1-)b B=0 (1-)c (1-)d p=P(D=1)= a+ b+c+ d Example Ignoring sampling variability, for p(0.285,0.715), conditional independence is not possible. G=1 G=0 B=1 30% 5% B=0 15% 50% Other dependence assumptions • With more tests, some methods model relationships between some tests. This is arbitrary, and cannot be tested without a rich enough study. • Discrepant resolution method; disfavored due to bias. • Composite reference method; success depends on reliability of reference tests. Bounds on true sensitivity and specificity of a new biomarker • Explore information available from the comparison of B and G, when no assumptions are made regarding their dependence. • Assume operating characteristics of G are known. • Derive bounds for operating characteristics of B. Idea • Simply by bounding cells in cross tabulation of G and (B,D) to be between 0 and 1 we derive bounds for – P(D=1, B=1|G=1) – P(D=0, B=0|G=0) • True sensitivity and specificity of G maximized at maxima of these and minimized at minima of these. Example B=1 B=0 • Apparent G=1 25 10 G=0 5 60 sens=25/35=71% • Apparent spec=60/65=92% • Suppose sens of G is 90% and spec of G is 95% • True sens of B is (61%,81%) • True spec of B is (87%,98%) • These bounds are reasonably narrow. Example B=1 B=0 • Apparent G=1 10 10 G=0 20 60 sens=50% • Apparent spec=75% • Suppose sens of G is 90% and spec of G is 95% • The true sens of B is (33%,67%) • True spec of B is (71%,78%) • Bound for sens is quite wide, ranging from poor test to possibly adequate; bound for spec is narrow. Conclusions • Low sensitivity of a promising kidney injury biomarker when expected prevalence of disease is low (e.g., contrast nephropathy – NGAL sensitivity=78%), raises question of imperfect specificity of “gold standard”. • Likewise, low specificity when expected prevalence is high (e.g., ICU with hypotension and sepsis – NGAL spec=76% when applied to critically ill patients) raises question of imperfect sensitivity of gold standard. Conclusions • Need “hard” clinical endpoints for use as gold standard, but even these have potential problems (e.g., long latency, confounding by other risk factors). • Could use exposure status, such as to nephrotoxic drug, to avoid SCr. • Amount of information in comparing new biomarker to imperfect gold standard may not be very high, even if imperfect gold standard is a good test itself. • Conditional independence is problematic – physiologically and technically. • Nonparametric bounds may or may not be useful; but certainly reflect true information content. • Ultimate validation of a biomarker’s utility is demonstration in a randomized clinical trial that it alters clinical management and improves clinical outcomes. Acknowledgments • Sarah Emerson, PhD • Sushrut Waikar, MD • Joseph Bonventre, MD Waikar SS, Betensky RA, Emerson SC, Bonventre JV (2012). Imperfect gold standards for kidney injury biomarker evaluation. J Am Soc Nephrol 23: 13-21. Emerson SC, Waikar SS, Bonventre JV, Betensky RA (2012). Biomarker validation with an imperfect reference: issues and bounds. Unpublished manuscript. With low prevalence, maintaining high specificity is more important than high sensitivity.