EP550

advertisement
EP550
Verification Bias and
Tarnished Gold Standards
In 1997, a m eta-analysis of 147 published reports sum m arized the
operating characteristics of the exercise-ECG test for diagnosing
coronary artery disease (CAD) as follows:
CAD Present
CAD Absent
ECG positive
7,830
2,896
10,726
ECG negative
3,686
9,662
13,348
11,516
12,558
24,074
Wednesday, February 2, 2011
Sensitivity = 7,830 / 11,516 = .68
Specificity = 9,662 / 12,558 = .77
Verification Bias
(Journal of the American College of Cardiology. 1997;30:260-311)
The assigned reading for this m aterial, which is optional, is Greenes and
Begg, "Assessm ent of Diagnostic Technologies: Methodology for
Unbiased Estim ation from Sam ples of Selectively Verified Patients."
(Note that the article uses a different, but equally valid, definition of
negative predictive value.)
One year later, a study in which all patients had both exercise-ECG
testing and coronary angiography described the operating characteristics
of the exercise-ECG test as follows:
CAD Present
CAD Absent
ECG positive
185
60
245
ECG negative
226
343
569
411
403
814
Sensitivity = 185 / 411 = .45
Specificity = 343 / 403 = .85
(Ann Intern Med. 1998;128:965-974)
Both studies enrolled sim ilar patients, perform ed the exercise-ECG test
the sam e way, and used coronary angiography as the gold standard
procedure for deciding whether coronary artery disease was present or
not.
W hy did the two studies report different sensitivities and specificities?
In the m eta-analysis, m ost patients who had a positive result after
2
exercise-ECG testing had coronary angiography. In contrast, m any
patients who had a negative result after exercise-ECG testing did not
have coronary angiography. Therefore, m ost patients with a positive test
result were included in the calculations, and m any patients with a
negative test result were excluded from the calculations. For exam ple,
m ore people in the m eta-analysis (45%) had positive test results than
people in the second study (30% ).
If all the m issing people with negative test results were added to the
table, the num ber of people with false-negative test results would
increase, the total num ber of people with disease would increase, and the
sensitivity would go down. Also, the num ber of people with true-negative
test results would increase, and the specificity would go up. Therefore,
the m easured sensitivity is higher than the actual sensitivity, and the
m easured specificity is lower than the actual specificity.
Adjusting for Verification Bias
It is possible to adjust for verification bias. Making this adjustm ent
requires two things. One, you m ust know the total num bers of people with
each type of test result, including the people who did not have the gold
standard procedure. Two, you m ust assum e that the m eaning of a test
result in the people who had the gold standard procedure is the sam e as
the m eaning of a test result in people who did not have the gold standard
procedure. Specifically, you m ust assum e that the positive and negative
predictive values in people who had the gold standard procedure is the
sam e as the positive and negative predictive values in people who did not
have the gold standard procedure.
This problem was avoided in the second study because every person
who had an exercise-ECG test also had coronary angiography, and thus
every person was included in the calculations.
W hen the results of a diagnostic test affect whether the gold standard
procedure is used to verify the test result, verification bias is introduced.
This problem is also called work up bias or referral bias.
Verification bias is com m on because m any gold standard procedures,
such as biopsy, surgery, and angiography, are invasive, risky, and
expensive. Under these conditions, physicians are reluctant to refer
patients for the gold standard procedure, and patients are reluctant to
undergo the gold standard procedure, unless prelim inary diagnostic tests
yield positive results. Other recent exam ples are described in NEJM
2003;349:335-342; Arch Int Med 2007;167:161-165; and Am J Rad
2009;193:1596-1602.
3
4
An Exam ple
Revised Table
Suppose that 80% of people with a positive exercise-ECG test result in
the m eta-analysis had coronary angiography -- the rem aining 20%
skipped angiography and were treated medically either because they had
low-risk treadm ill scores or their com orbidities precluded m echanical
revascularization. Also, 25% of people with negative exercise-ECG test
results had angiography, because their clinical story was convincing and
follow-up exercise im aging with radionuclides or echocardiography was
positive.
To adjust for verification bias, start with the original table.
CAD Present
CAD Absent
ECG positive
7,830
2,896
10,726
ECG negative
3,686
9,662
13,348
11,516
12,558
24,074
Sensitivity = 7,830 / 11,516 = .68
Specificity = 9,662 / 12,558 = .77
PPV = 7,830 / 10,726 = .73
NPV = 9,662 / 13,348 = .72
CAD Present
ECG positive
13,408
ECG negative
53,392
Assum e that a test result m eans the sam e thing whether it occurred in
som eone included or excluded from the original table. Therefore, the PPV
and the NPV are unchanged in the revised table. The next step is to
m ultiply PPV tim es the row total for a positive test to get the num ber of
true positive tests (.73 * 13,408 = 9,788). Sim ilarly, m ultiply NPV tim es
the row total for a negative test to get the num ber of true negative tests
(.72 * 53,392 = 38,648). By subtraction you can fill in the other cells.
Then, calculate the colum n totals. Finally, calculate sensitivity and
specificity.
CAD Present
ECG positive
ECG negative
Then calculate new row totals that correspond to the original sam ple of
people who underwent testing. W e know that 80% of people with a
positive test result were included in the analyses, so there were
10,726/.8, or 13,408 total people with a positive test result. Also, we
know that 25% of people with a negative test result were included in the
analyses, so there were 13,348/.25, or 53,392 total people with a
negative test result.
5
CAD Absent
CAD Absent
9,788
3,620
13,408
14,744
38,648
53,392
24,532
42,268
66,800
Sensitivity = 9,788 / 24,532 = .40
Specificity = 38,648 /42,268 = .91
6
Com pared with the values reported in the m eta-analysis, the adjusted
sensitivity is lower, and the adjusted specificity is higher.
Because we have adjusted for the effects of verification bias, the new
sensitivity and specificity should be the sam e as the values reported in
the optim al study that had every subject undergo both an exercise-ECG
test and coronary angiography.
Sensitivity
Specificity
Original m eta-analysis
.68
.77
Optim al study
.45
.85
Meta-analysis adjusted with
PPV and NPV
.41
.91
W hy are the adjusted values for the m eta-analysis different from the
values in the optim al study?
7
The correction assum ed that PPV and NPV were the sam e for people
who were and were not verified with coronary angiography (a/(a+b) =
A/(A+B) and d/(c+d) = D/(C+D)).
Verified
Not Verified
CAD+
CAD-
ECG +
a
b
ECG -
c
d
CAD+
CAD-
ECG +
A
B
ECG -
C
D
Therefore, the correction assum ed that the ratios of people with and
without disease who had a positive test result were the sam e (a/b = A/B)
and that the ratios of people with and without disease who had a negative
test result were the sam e (c/d = C/D). This assum ption m eans that
people with disease are as likely to be verified as people without disease.
Consider two people with suspected CAD, both with negative test results.
One person is a 50-year-old wom an with atypical chest pain and no risk
factors for CAD. The other person is a 65-year-old m an who sm okes
cigarettes and has typical angina and diabetes m ellitus. The wom an is
less likely to have CAD, and she is less likely to have her negative
exercise-ECG result "verified" with angiography. In contrast, the m an is
m ore likely to have CAD, and he is m ore likely to have his negative
exercise-ECG result "verified" with angiography. A sim ilar, but perhaps
less powerful, effect likely occurs when the test result is positive.
Therefore, in situations like this one people with disease are m ore likely
to have their test result verified with the gold standard procedure than
people without disease.
8
The process that we call verification bias and that produces the num bers
in the tables we usually see has two effects. One effect is related to the
relative absence from the typical table of people with negative test
results, which m eans the tables we see are skewed in the following
direction:
CAD+
CAD-
ECG +
The com bination of these two effects m eans that the adjustm ent used a
PPV that was too big, because the num ber of true-positive test results
was too big relative to the num ber of false-positive test results, and an
NPV that was too sm all, because the num ber of true-negative test results
was too sm all relative to the num ber of false-negative test results.
Therefore, we should have used a PPV lower than .73, for exam ple .60,
and an NPV higher than .72, for exam ple .85.
Adjustm ent with the original PPV (.73) and NPV (.72)
ECG -
9
9
CAD Present
Another effect is related to the relative excess in the typical table of
people with disease, which m eans the tables we usually see also are
skewed in a second direction:
CAD+
ECG +
ECG -
CAD-
8
9
13,408
14,744
38,648
53,392
24,532
42,268
66,800
CAD Present
CAD-
8
98
3,620
Sensitivity = 9,788 / 24,532 = .40
Specificity = 38,648 /42,268 = .91
8
CAD+
ECG -
ECG negative
9,788
Adjustm ent with the New PPV (.60) and NPV (.85)
The table we actually see has num bers that are a com bination of these
two effects.
ECG +
ECG positive
CAD Absent
9
CAD Absent
ECG positive
8,044
5,364
13,408
ECG negative
8,009
45,383
53,392
16,053
50, 747
66,800
Sensitivity = 8,044/16,053 = .50
Specificity = 45,383/50,747 = .89
A lower PPV decreases the num ber of true-positive test results, which
alone would lower the sensitivity, and it increases the num ber of falsepositive test results, which alone would lower the specificity. A higher
NPV increases the num ber of true-negative test results, which alone
would raise the specificity, and it decreases the num ber of false-negative
10
test results, which alone would raise the sensitivity. If the changes in
PPV and NPV are about the sam e size, as they are in this exam ple, the
effects on sensitivity and specificity are dom inated by the higher NPV
because there are m ore total negative than total positive test results to
start with.
Tarnished Gold Standards
Therefore, the new, adjusted sensitivity is higher than the old, adjusted
sensitivity, and the new, adjusted specificity is lower than the old,
adjusted specificity. The new, adjusted sensitivity is between the original
sensitivity and the old, adjusted sensitivity, and the new, adjusted
specificity is between the original specificity and the old, adjusted
specificity.
Not all gold standards are perfectly accurate. Many, m aybe m ost, m ake
errors when they are used to assign patients to disease or disease-free
states, and these errors com plicate how we interpret the results of other
diagnostic tests. Consider the following exam ple:
Sensitivity
Specificity
Original m eta-analysis
.68
.77
Meta-analysis adjusted with
new PPV and NPV
.50
.89
Optim al study
.45
Meta-analysis adjusted with
original PPV and NPV
.41
(Bieshuvel C, Irwig L, Bossuyt P. Observed differences in diagnostic test
accuracy between patient subgroups: Is it real or due to reference
standard m isclassification? Clinical Chemistry 2007;53(10):1725-1729.)
Perfect Gold
Disease
Present
In m ost situations affected by verification bias, the true values for
sensitivity and specificity are between the values reported in the original
article and the values that are calculated using the adjustm ent m ethod
described by Greenes and Begg. The original and adjusted values,
however, m ay be useful because they define ranges surrounding the true
values for sensitivity and specificity.
Disease
Absent
Index test positive
80
270
350
Index test negative
20
630
650
100
900
1000
.85
.91
True prevalence = 100/1000 = 0.10
True sensitivity = 80/100 = 0.80
True specificity = 630/900 = 0.70
Tarnished Gold
Disease
Present
Standard
Disease
Absent
Index test positive
99
251
350
Index test negative
81
569
650
180
820
1000
Observed prevalence = 180/1000 = 0.18
Observed sensitivity = 99/180 = 0.55
Observed specificity = 569/820 = 0.69
11
Standard
12
Note that the row totals (total num bers of positive and negative results for
the index test) do not change when an im perfect standard replaces a
perfect standard. The num bers in all four cells and the colum n totals do
change, however, because the inaccurate gold standard shifts som e
results from left cells to right cells and other results from right cells to left
cells. Because there are m ore results in right cells to start with (the
prevalence of disease is relatively low), m ore are shifted left than are
shifted right. The net effect is for the num bers in the left colum n to go up
and the num bers in the right colum n to go down, which causes the
observed prevalence to increase. Because there are m ore results in the
bottom row than in the top row to start with, m ore results are shifted from
right to left in the bottom row. This change increases the num ber of false
negatives (bottom -left cell) and thus the colum n total m ore than it
increases the num ber of true positives (top-left cell), which causes the
observed sensitivity to go down. In the right colum n the big decreases
that occur in the num ber of true negatives (bottom -right cell) also occur in
the colum n total, which causes the observed specificity to change less.
The net result is a big decrease in sensitivity and not m uch change in
specificity.
Consider a second exam ple, which has a high prevalence of disease:
Perfect Gold
Disease
Present
Standard
Disease
Absent
Index test positive
640
60
700
Index test negative
160
140
300
800
200
1000
True prevalence = 800/1000 = 0.80
True sensitivity = 640/800 = 0.80
True specificity = 140/200 = 0.70
13
Note that the sensitivity and specificity don’t change when the prevalence
changes if the gold standard is perfect (Com pare the true values in this
exam ple with the true values in the previous exam ple).
Tarnished Gold
Disease Present
Standard
Disease
Absent
Index test positive
582
118
700
Index test negative
158
142
300
740
260
1000
Observed prevalence = 740/1000 = 0.74
Observed sensitivity = 582/740 = 0.79
Observed specificity = 142/260 = 0.55
As before, the row totals do not change when an im perfect standard
replaces a perfect standard. The num bers in all four cells and the colum n
totals do change, however, because the inaccurate gold standard shifts
som e results from left cells to right cells and other results from right cells
to left cells. Because there are m ore results in left cells to start with (the
prevalence of disease is relatively high), m ore of them are shifted right
than are shifted left, which causes the observed prevalence to go down.
Because there are m ore results in the top row to start with, m ore results
are shifted from left to right in the top row than in the bottom row. This
change increases the num ber of false positives (top-right cell) and thus
the colum n total m ore than it increases the num ber of true negatives
(bottom -right cell), which causes the observed specificity to go down. In
the left colum n the big decreases that occur in the num ber of true
positives (top-left cell) also occur in the colum n total, which causes the
observed sensitivity to change less. The net result is a big decrease in
specificity and not m uch change in sensitivity.
14
This property can be used to determ ine how good the gold standard is.
Apply the index test and gold standard to subpopulations of people who
have different probabilities of disease, for exam ple, because they have
m ore or fewer risk factors, because they have m ore or fewer sym ptom s,
or because they have m ore severe or less severe sym ptom s. Look for
changes in sensitivity and specificity for the index test.
In sum m ary, when the gold standard is perfect, the sensitivity and
specificity of other diagnostic tests do not change when the prevalence of
disease changes. W hen the gold standard is im perfect, however, the
sensitivity and specificity of other diagnostic tests do change when the
prevalence of disease changes. In m ost real-world situations, when
prevalence changes from low to high, sensitivity increases and specificity
decreases.
15
16
Caveats
No Gold Standard
The changes cited in these exam ples occur under m any conditions, but
they do not occur under all conditions. The shifts in num bers am ong cells
that occur as one m oves from a perfect to a tarnished gold standard
depend on the relationships am ong the prevalence of disease (p) and the
tarnished gold standard’s sensitivity (sens TGS) and specificity (spec TGS).
W hen
In other circum stances there is no gold standard procedure to verify other
diagnostic test results. In these circum stances, there are ways to
estim ate test sensitivity and specificity using m axim um likelihood
estim ation m ethods applied to latent class m odels. These m ethods are
reviewed in the following paper: Reitsm a JB. Rutjes AW . Khan KS.
Coom arasam y A. Bossuyt PM. A review of solutions for diagnostic
accuracy studies with an im perfect or m issing reference standard.
Journal of Clinical Epidem iology. 62(8):797-806, 2009 Aug. Software that
calculates sensitivity and specificity when there is no gold standard is
available at http://www.epi.ucdavis.edu/diagnostictests/tags.htm l.
Additional inform ation is available at Statistics in Medicine 2009;
28:3108–3123.
p = (1 - spec TGS) / (2 - (sens TGS + spec TGS))
the shifts from right to left and left to right in the 2 by 2 table as one
m oves from a perfect to a tarnished gold standard are equal and thus
cancel each other out. W hen p is lower, the net shifts in cell num bers are
from right to left. W hen p is higher, the net shifts in cell num bers are from
left to right. For exam ple, if sens TGS = 0.9 and spec TGS = 0.9, then p = 0.5.
At any p < 0.5 the net shifts in cell num bers as one m oves from a perfect
to a tarnished gold standard are from right to left. At any p > 0.5 the net
shifts in cell num bers as one m oves from a perfect to a tarnished gold
standard are from left to right.
Sam ple Input
Num ber of Tests: 3
Num ber of Cases: 1692
The changes cited in these exam ples are larger when the sum of the
index test’s sensitivity and specificity is larger, and they approach zero
when the sum of the index test’s sensitivity and specificity decreases
toward 1.0. (Note that any com bination of the index test’s sensitivity and
specificity that gives a sum of 1.0 is on the 45 o diagonal in the ROC
curve, which indicates “no inform ation.”)
Note that the p values referred to here are prevalence values determ ined
in a cohort study. They are not the arbitrary p values in case-control
studies.
17
1 = Test Result Negative
2 = Test Result Positive
Assignm ent
Assignm ent
Assignm ent
Assignm ent
Assignm ent
Assignm ent
Assignm ent
Assignm ent
Result of
Test
1 23
--------Code: 1 1 1
Code: 2 1 1
Code: 1 2 1
Code: 2 2 1
Code: 1 1 2
Code: 2 1 2
Code: 1 2 2
Code: 2 2 2
Frequency: 1513
Frequency:
23
Frequency:
59
Frequency:
12
Frequency:
21
Frequency:
19
Frequency:
11
Frequency:
34
18
Sam ple Output
Program s written by:
Theta = prevalence of disease
Alpha = false positive rate (1 - alpha = specificity)
Beta = false negative rate (1 - beta = sensitivity)
Likelihood =
-891.1428
Estim ated Theta value = .0545
Test
1
2
3
Test
1
2
3
Estim ated Beta
.2350
.3565
.2509
S.E.
.067
.067
.067
Estim ated Alpha S.E.
.0109
.004
.0354
.005
.0100
.003
95% C.I.= ( .040, .069)
95% C.I.
( .105, .365)
( .226, .487)
( .119, .383)
95% C.I.
( .004, .018)
( .026, .045)
( .003, .017)
NOTE: CI's are based on approxim ate norm al theory. SE's and
CI's m ay be inaccurate with sm all sam ple sizes, or if param eter
estim ates are at or near their m axim um or m inim um values (1 or 0).
Test Results
---------------1 1 1
2 1 1
1 2 1
2 2 1
1 1 2
2 1 2
1 2 2
2 2 2
Probability of Being a Case
-----------------------------------.0013
.2743
.0593
.9489
.2754
.9912
.9492
.9998
19
S.D. W alter, Ph.D., Professor
McMaster University
Departm ent of Clinical Epidem iology and
Biostatistics
1200 Main Street W est, Room HSC 2C16
Ham ilton, Ontario L8N 3Z5 Canada
E-Mail W ALTER@ FHS.MCMASTER.CA
These program s estim ate the error rates of diagnostic tests or
m easurem ents when there is no gold standard. Maxim um likelihood
estim ation m ethods are applied to latent class m odels representing the
observed data.
1. LATENT1 (Version 3) - used when all the observations are subject to
error, i.e. there are no gold standard m easurem ents. There m ust be 3 or
m ore observations per patient.
2. LATENT2 - used when there are 2 diagnostic m easurem ents, and
there are definitive gold standard assessm ents available in follow-up for
patients with one or two positive results. Patients with both initial tests
negative have no further observations m ade, and so m ay be true
disease cases or true non-cases.
3. LATENT3 - sim ilar to LATENT2, but there are three initial tests.
Patients with 3 negative results have no further follow-up; other patients
have a gold standard diagnosis available.
These program s are available through Blackboard in the section on
Assignm ents.
20
Issues
•
Return to Verification Bias
More than two tests are required, although som e m ethods allow
the use of m ultiple determ inations using the sam e test, in
different populations or in different tim e periods
•
Som e, but not all, m ethods require the assum ption of
independence between tests conditional on true disease status
•
Specialized software needs to be developed for each m ethod
•
Som e m ethods require intensive com puting
If we can estim ate sensitivity and specificity when no patient gets the gold
standard procedure (because there is no such procedure), we should be
able to estim ate sensitivity and specificity when som e patients get the
gold standard procedure and others do not. A description of how this can
be done is in the following articles: 1. Zhou XH. Correcting for verification
bias in studies of a diagnostic test's accuracy. Statistical Methods in
Medical Research. 1998;7:337-353 and 2. Zhou XH, Higgs RE.
Assessing the relative accuracies of two screening tests in the presence
of verification bias. Statistics in Medicine. 2000;19(11-12):1697-1705.
Issues
21
•
Som e, but not all, m ethods assum e that the probability of
verifying a patient does not depend on the unobserved disease
status
•
Current m ethods focus on diagnostic tests with binary or ordinal
results, and do not address diagnostic tests with continuous
results. (Note that m ost diagnostic test analysis divides
continuously scaled results into categories that then can be
treated as ordinal results, for exam ple, with ROC curves.)
•
For the analysis of diagnostic tests with continuously scaled
results using ROC data, see Zhou XH, Castelluccio P, and Zhou
C. Nonparam etric estim ation of ROC curves in the absence of
the gold standard. Biometrics. 2005;61: 600-609.
22
Download