Imperfect Gold Standards for Biomarker Evaluation

advertisement
Imperfect Gold Standards for
Biomarker Evaluation
Rebecca A. Betensky
Conference on Statistical Issues in Clinical Trials
University of Pennsylvania
April 18, 2012
Outline
• Motivation: need for kidney injury biomarkers for
diagnosis of acute kidney injury (AKI)
• Impact of imperfect gold standard on apparent
sensitivity and specificity of perfect biomarker
• Examine conditional independence assumption:
implicit restrictions
• Bounds on true sensitivity and specificity
Serum creatinine for AKI
• Clinicians have used SCr to diagnose AKI for
decades.
• Acknowledged as inadequate gold standard:
– Poor specificity in some settings that are not
associated with kidney injury
– Poor sensitivity in setting of adequate renal reserve
– Relatively slow kinetics after injury
• Considerable interest in identifying better
biomarkers of tubular injury: potentially more
accurate and earlier diagnosis.
How to evaluate new biomarkers?
• Studies have used changes in SCr as the
gold standard against which to test novel
tubular injury biomarkers.
• Aside from problems of specificity and
sensitivity,
– SCr does not directly reflect tubular function
or injury
– Based on a cutoff, which will impact its true
spec and sens, and thus that of novel marker.
Conceptual framework
• Actual disease that is the target of the diagnostic
test (AKI) is not synonymous with clinical
conditions identified by imperfect gold standard
(SCr).
• AKI is difficult to establish without invasive and
risky histopathological assessment.
• Using imperfect gold standard (i.e., imperfect
reference test) may distort apparent diagnostic
performance of novel biomarker.
Idealized example of perfect novel biomarker
disease prevalence=20%
imperfect gold standard sensitivity=80%, specificity=80%
Relative to imperfect gold standard, a perfect novel biomarker will have apparent
sensitivity of 50% and apparent specificity of 64/68=94%.
At lower prevalence, dominant effect of imperfect gold standard is on perfect
biomarker’s apparent sensitivity:
apparent sens=
1
1  p 1  specG
1
p
sensG
1
apparent spec=
p 1  sensG
1
1  p specG
This is similar to imperfect gold standard=“need for dialysis”.
At prevalence of 20%, apparent sensitivity of perfect biomarker is
100% and apparent specificity is 84%. The bounds of the
apparent AUC are 0.84-1.00.
Even rare false positives (imperfect gold standard spec=99%)
lead to apparent sensitivity of 86% and bounds of apparent AUC
of 0.72-0.98.
Cut-offs for SCr
• Recent clinical studies of novel AKI
biomarkers have used a variety of SCr
criteria to define AKI.
• These examples illustrate that different
choices of cut-off’s can lead to hugely
different apparent properties of a novel
biomarker.
What if new biomarker is not perfect?
• Need assumptions on relationship
between new biomarker and imperfect
gold standard and disease to evaluate
new biomarker.
• Conditional independence is convenient;
allows for latent class models.
• However, it introduces implicit restrictions.
What can we learn for imperfect novel biomarker?
•
Previous illustration assumes perfect novel biomarker.
•
Common assumption is conditional independence: P(B=b|G=g,D=d)=P(B=b|D=d)
•
Apparent sensitivity of B relative to G:
p  SeG  SeB  (1  p)  (1  SpG )  (1  SpB )
p  SeG  (1  p)  (1  SpG )
•
Apparent specificity of B relative to G:
p  (1  SeG )  (1  SeB )  (1  p)  SpG  SpB
p  (1  SeG )  (1  p)  SpG
•
Use these to solve for “true sensitivity” and specificity of B relative to D
•
Bounds on apparent AUC:
– Apparent AUC< apparent sens × apparent spec
– Apparent AUC>apparent sens+(1-apparent sens) × apparent spec
Problems with conditional independence
• May not be plausible from mechanistic or
physiological perspective; the two tests measure
related phenomena.
• May be association between disease severity
and test results; two tests may be conditionally
independent given disease severity, but not
conditionally independent given presence or
absence of disease.
• Assumption of conditional independence
constrains the disease prevalence; may not be
plausible.
Conditional Independence:
disease severity
• Independence given disease severity:
P(G=1, B=1|D=1,X)=P(G=1|D=1,X)×P(B=1|D=1,X)
does not imply independence given disease:
P(G=1,B=1|D=1)=P(G=1|D=1)×P(B=1|D=1)
Conditional Independence:
disease prevalence
Conditional independence may not be possible
at a given disease prevalence.
Bounds on prevalence under
conditional independence
G=1
G=0
B=1
a
b
B=0
c
d
Under conditional independence, split into two tables, with some constraints:
D=1
D=0
G=1
G=0
B=1
a
b
B=0
c
d
G=1
G=0
B=1
(1-)a
(1-)b
B=0
(1-)c
(1-)d
p=P(D=1)= a+ b+c+ d
Example
Ignoring sampling variability, for p(0.285,0.715),
conditional independence is not possible.
G=1 G=0
B=1 30% 5%
B=0 15% 50%
Other dependence assumptions
• With more tests, some methods model
relationships between some tests. This is
arbitrary, and cannot be tested without a rich
enough study.
• Discrepant resolution method; disfavored due
to bias.
• Composite reference method; success
depends on reliability of reference tests.
Bounds on true sensitivity and
specificity of a new biomarker
• Explore information available from the
comparison of B and G, when no
assumptions are made regarding their
dependence.
• Assume operating characteristics of G are
known.
• Derive bounds for operating characteristics
of B.
Idea
• Simply by bounding cells in cross
tabulation of G and (B,D) to be between 0
and 1 we derive bounds for
– P(D=1, B=1|G=1)
– P(D=0, B=0|G=0)
• True sensitivity and specificity of G
maximized at maxima of these and
minimized at minima of these.
Example
B=1
B=0
• Apparent
G=1
25
10
G=0
5
60
sens=25/35=71%
• Apparent spec=60/65=92%
• Suppose sens of G is 90% and spec of G is 95%
• True sens of B is (61%,81%)
• True spec of B is (87%,98%)
• These bounds are reasonably narrow.
Example
B=1
B=0
• Apparent
G=1
10
10
G=0
20
60
sens=50%
• Apparent spec=75%
• Suppose sens of G is 90% and spec of G is 95%
• The true sens of B is (33%,67%)
• True spec of B is (71%,78%)
• Bound for sens is quite wide, ranging from poor test to
possibly adequate; bound for spec is narrow.
Conclusions
• Low sensitivity of a promising kidney injury
biomarker when expected prevalence of disease
is low (e.g., contrast nephropathy – NGAL
sensitivity=78%), raises question of imperfect
specificity of “gold standard”.
• Likewise, low specificity when expected
prevalence is high (e.g., ICU with hypotension
and sepsis – NGAL spec=76% when applied to
critically ill patients) raises question of imperfect
sensitivity of gold standard.
Conclusions
• Need “hard” clinical endpoints for use as gold standard, but even
these have potential problems (e.g., long latency, confounding by
other risk factors).
• Could use exposure status, such as to nephrotoxic drug, to avoid
SCr.
• Amount of information in comparing new biomarker to imperfect gold
standard may not be very high, even if imperfect gold standard is a
good test itself.
• Conditional independence is problematic – physiologically and
technically.
• Nonparametric bounds may or may not be useful; but certainly
reflect true information content.
• Ultimate validation of a biomarker’s utility is demonstration in a
randomized clinical trial that it alters clinical management and
improves clinical outcomes.
Acknowledgments
• Sarah Emerson, PhD
• Sushrut Waikar, MD
• Joseph Bonventre, MD
Waikar SS, Betensky RA, Emerson SC, Bonventre JV (2012).
Imperfect gold standards for kidney injury biomarker evaluation. J
Am Soc Nephrol 23: 13-21.
Emerson SC, Waikar SS, Bonventre JV, Betensky RA (2012).
Biomarker validation with an imperfect reference: issues and
bounds. Unpublished manuscript.
With low prevalence, maintaining high specificity is more important than
high sensitivity.
Download