EP550 Verification Bias and Tarnished Gold Standards In 1997, a m eta-analysis of 147 published reports sum m arized the operating characteristics of the exercise-ECG test for diagnosing coronary artery disease (CAD) as follows: CAD Present CAD Absent ECG positive 7,830 2,896 10,726 ECG negative 3,686 9,662 13,348 11,516 12,558 24,074 Wednesday, February 2, 2011 Sensitivity = 7,830 / 11,516 = .68 Specificity = 9,662 / 12,558 = .77 Verification Bias (Journal of the American College of Cardiology. 1997;30:260-311) The assigned reading for this m aterial, which is optional, is Greenes and Begg, "Assessm ent of Diagnostic Technologies: Methodology for Unbiased Estim ation from Sam ples of Selectively Verified Patients." (Note that the article uses a different, but equally valid, definition of negative predictive value.) One year later, a study in which all patients had both exercise-ECG testing and coronary angiography described the operating characteristics of the exercise-ECG test as follows: CAD Present CAD Absent ECG positive 185 60 245 ECG negative 226 343 569 411 403 814 Sensitivity = 185 / 411 = .45 Specificity = 343 / 403 = .85 (Ann Intern Med. 1998;128:965-974) Both studies enrolled sim ilar patients, perform ed the exercise-ECG test the sam e way, and used coronary angiography as the gold standard procedure for deciding whether coronary artery disease was present or not. W hy did the two studies report different sensitivities and specificities? In the m eta-analysis, m ost patients who had a positive result after 2 exercise-ECG testing had coronary angiography. In contrast, m any patients who had a negative result after exercise-ECG testing did not have coronary angiography. Therefore, m ost patients with a positive test result were included in the calculations, and m any patients with a negative test result were excluded from the calculations. For exam ple, m ore people in the m eta-analysis (45%) had positive test results than people in the second study (30% ). If all the m issing people with negative test results were added to the table, the num ber of people with false-negative test results would increase, the total num ber of people with disease would increase, and the sensitivity would go down. Also, the num ber of people with true-negative test results would increase, and the specificity would go up. Therefore, the m easured sensitivity is higher than the actual sensitivity, and the m easured specificity is lower than the actual specificity. Adjusting for Verification Bias It is possible to adjust for verification bias. Making this adjustm ent requires two things. One, you m ust know the total num bers of people with each type of test result, including the people who did not have the gold standard procedure. Two, you m ust assum e that the m eaning of a test result in the people who had the gold standard procedure is the sam e as the m eaning of a test result in people who did not have the gold standard procedure. Specifically, you m ust assum e that the positive and negative predictive values in people who had the gold standard procedure is the sam e as the positive and negative predictive values in people who did not have the gold standard procedure. This problem was avoided in the second study because every person who had an exercise-ECG test also had coronary angiography, and thus every person was included in the calculations. W hen the results of a diagnostic test affect whether the gold standard procedure is used to verify the test result, verification bias is introduced. This problem is also called work up bias or referral bias. Verification bias is com m on because m any gold standard procedures, such as biopsy, surgery, and angiography, are invasive, risky, and expensive. Under these conditions, physicians are reluctant to refer patients for the gold standard procedure, and patients are reluctant to undergo the gold standard procedure, unless prelim inary diagnostic tests yield positive results. Other recent exam ples are described in NEJM 2003;349:335-342; Arch Int Med 2007;167:161-165; and Am J Rad 2009;193:1596-1602. 3 4 An Exam ple Revised Table Suppose that 80% of people with a positive exercise-ECG test result in the m eta-analysis had coronary angiography -- the rem aining 20% skipped angiography and were treated medically either because they had low-risk treadm ill scores or their com orbidities precluded m echanical revascularization. Also, 25% of people with negative exercise-ECG test results had angiography, because their clinical story was convincing and follow-up exercise im aging with radionuclides or echocardiography was positive. To adjust for verification bias, start with the original table. CAD Present CAD Absent ECG positive 7,830 2,896 10,726 ECG negative 3,686 9,662 13,348 11,516 12,558 24,074 Sensitivity = 7,830 / 11,516 = .68 Specificity = 9,662 / 12,558 = .77 PPV = 7,830 / 10,726 = .73 NPV = 9,662 / 13,348 = .72 CAD Present ECG positive 13,408 ECG negative 53,392 Assum e that a test result m eans the sam e thing whether it occurred in som eone included or excluded from the original table. Therefore, the PPV and the NPV are unchanged in the revised table. The next step is to m ultiply PPV tim es the row total for a positive test to get the num ber of true positive tests (.73 * 13,408 = 9,788). Sim ilarly, m ultiply NPV tim es the row total for a negative test to get the num ber of true negative tests (.72 * 53,392 = 38,648). By subtraction you can fill in the other cells. Then, calculate the colum n totals. Finally, calculate sensitivity and specificity. CAD Present ECG positive ECG negative Then calculate new row totals that correspond to the original sam ple of people who underwent testing. W e know that 80% of people with a positive test result were included in the analyses, so there were 10,726/.8, or 13,408 total people with a positive test result. Also, we know that 25% of people with a negative test result were included in the analyses, so there were 13,348/.25, or 53,392 total people with a negative test result. 5 CAD Absent CAD Absent 9,788 3,620 13,408 14,744 38,648 53,392 24,532 42,268 66,800 Sensitivity = 9,788 / 24,532 = .40 Specificity = 38,648 /42,268 = .91 6 Com pared with the values reported in the m eta-analysis, the adjusted sensitivity is lower, and the adjusted specificity is higher. Because we have adjusted for the effects of verification bias, the new sensitivity and specificity should be the sam e as the values reported in the optim al study that had every subject undergo both an exercise-ECG test and coronary angiography. Sensitivity Specificity Original m eta-analysis .68 .77 Optim al study .45 .85 Meta-analysis adjusted with PPV and NPV .41 .91 W hy are the adjusted values for the m eta-analysis different from the values in the optim al study? 7 The correction assum ed that PPV and NPV were the sam e for people who were and were not verified with coronary angiography (a/(a+b) = A/(A+B) and d/(c+d) = D/(C+D)). Verified Not Verified CAD+ CAD- ECG + a b ECG - c d CAD+ CAD- ECG + A B ECG - C D Therefore, the correction assum ed that the ratios of people with and without disease who had a positive test result were the sam e (a/b = A/B) and that the ratios of people with and without disease who had a negative test result were the sam e (c/d = C/D). This assum ption m eans that people with disease are as likely to be verified as people without disease. Consider two people with suspected CAD, both with negative test results. One person is a 50-year-old wom an with atypical chest pain and no risk factors for CAD. The other person is a 65-year-old m an who sm okes cigarettes and has typical angina and diabetes m ellitus. The wom an is less likely to have CAD, and she is less likely to have her negative exercise-ECG result "verified" with angiography. In contrast, the m an is m ore likely to have CAD, and he is m ore likely to have his negative exercise-ECG result "verified" with angiography. A sim ilar, but perhaps less powerful, effect likely occurs when the test result is positive. Therefore, in situations like this one people with disease are m ore likely to have their test result verified with the gold standard procedure than people without disease. 8 The process that we call verification bias and that produces the num bers in the tables we usually see has two effects. One effect is related to the relative absence from the typical table of people with negative test results, which m eans the tables we see are skewed in the following direction: CAD+ CAD- ECG + The com bination of these two effects m eans that the adjustm ent used a PPV that was too big, because the num ber of true-positive test results was too big relative to the num ber of false-positive test results, and an NPV that was too sm all, because the num ber of true-negative test results was too sm all relative to the num ber of false-negative test results. Therefore, we should have used a PPV lower than .73, for exam ple .60, and an NPV higher than .72, for exam ple .85. Adjustm ent with the original PPV (.73) and NPV (.72) ECG - 9 9 CAD Present Another effect is related to the relative excess in the typical table of people with disease, which m eans the tables we usually see also are skewed in a second direction: CAD+ ECG + ECG - CAD- 8 9 13,408 14,744 38,648 53,392 24,532 42,268 66,800 CAD Present CAD- 8 98 3,620 Sensitivity = 9,788 / 24,532 = .40 Specificity = 38,648 /42,268 = .91 8 CAD+ ECG - ECG negative 9,788 Adjustm ent with the New PPV (.60) and NPV (.85) The table we actually see has num bers that are a com bination of these two effects. ECG + ECG positive CAD Absent 9 CAD Absent ECG positive 8,044 5,364 13,408 ECG negative 8,009 45,383 53,392 16,053 50, 747 66,800 Sensitivity = 8,044/16,053 = .50 Specificity = 45,383/50,747 = .89 A lower PPV decreases the num ber of true-positive test results, which alone would lower the sensitivity, and it increases the num ber of falsepositive test results, which alone would lower the specificity. A higher NPV increases the num ber of true-negative test results, which alone would raise the specificity, and it decreases the num ber of false-negative 10 test results, which alone would raise the sensitivity. If the changes in PPV and NPV are about the sam e size, as they are in this exam ple, the effects on sensitivity and specificity are dom inated by the higher NPV because there are m ore total negative than total positive test results to start with. Tarnished Gold Standards Therefore, the new, adjusted sensitivity is higher than the old, adjusted sensitivity, and the new, adjusted specificity is lower than the old, adjusted specificity. The new, adjusted sensitivity is between the original sensitivity and the old, adjusted sensitivity, and the new, adjusted specificity is between the original specificity and the old, adjusted specificity. Not all gold standards are perfectly accurate. Many, m aybe m ost, m ake errors when they are used to assign patients to disease or disease-free states, and these errors com plicate how we interpret the results of other diagnostic tests. Consider the following exam ple: Sensitivity Specificity Original m eta-analysis .68 .77 Meta-analysis adjusted with new PPV and NPV .50 .89 Optim al study .45 Meta-analysis adjusted with original PPV and NPV .41 (Bieshuvel C, Irwig L, Bossuyt P. Observed differences in diagnostic test accuracy between patient subgroups: Is it real or due to reference standard m isclassification? Clinical Chemistry 2007;53(10):1725-1729.) Perfect Gold Disease Present In m ost situations affected by verification bias, the true values for sensitivity and specificity are between the values reported in the original article and the values that are calculated using the adjustm ent m ethod described by Greenes and Begg. The original and adjusted values, however, m ay be useful because they define ranges surrounding the true values for sensitivity and specificity. Disease Absent Index test positive 80 270 350 Index test negative 20 630 650 100 900 1000 .85 .91 True prevalence = 100/1000 = 0.10 True sensitivity = 80/100 = 0.80 True specificity = 630/900 = 0.70 Tarnished Gold Disease Present Standard Disease Absent Index test positive 99 251 350 Index test negative 81 569 650 180 820 1000 Observed prevalence = 180/1000 = 0.18 Observed sensitivity = 99/180 = 0.55 Observed specificity = 569/820 = 0.69 11 Standard 12 Note that the row totals (total num bers of positive and negative results for the index test) do not change when an im perfect standard replaces a perfect standard. The num bers in all four cells and the colum n totals do change, however, because the inaccurate gold standard shifts som e results from left cells to right cells and other results from right cells to left cells. Because there are m ore results in right cells to start with (the prevalence of disease is relatively low), m ore are shifted left than are shifted right. The net effect is for the num bers in the left colum n to go up and the num bers in the right colum n to go down, which causes the observed prevalence to increase. Because there are m ore results in the bottom row than in the top row to start with, m ore results are shifted from right to left in the bottom row. This change increases the num ber of false negatives (bottom -left cell) and thus the colum n total m ore than it increases the num ber of true positives (top-left cell), which causes the observed sensitivity to go down. In the right colum n the big decreases that occur in the num ber of true negatives (bottom -right cell) also occur in the colum n total, which causes the observed specificity to change less. The net result is a big decrease in sensitivity and not m uch change in specificity. Consider a second exam ple, which has a high prevalence of disease: Perfect Gold Disease Present Standard Disease Absent Index test positive 640 60 700 Index test negative 160 140 300 800 200 1000 True prevalence = 800/1000 = 0.80 True sensitivity = 640/800 = 0.80 True specificity = 140/200 = 0.70 13 Note that the sensitivity and specificity don’t change when the prevalence changes if the gold standard is perfect (Com pare the true values in this exam ple with the true values in the previous exam ple). Tarnished Gold Disease Present Standard Disease Absent Index test positive 582 118 700 Index test negative 158 142 300 740 260 1000 Observed prevalence = 740/1000 = 0.74 Observed sensitivity = 582/740 = 0.79 Observed specificity = 142/260 = 0.55 As before, the row totals do not change when an im perfect standard replaces a perfect standard. The num bers in all four cells and the colum n totals do change, however, because the inaccurate gold standard shifts som e results from left cells to right cells and other results from right cells to left cells. Because there are m ore results in left cells to start with (the prevalence of disease is relatively high), m ore of them are shifted right than are shifted left, which causes the observed prevalence to go down. Because there are m ore results in the top row to start with, m ore results are shifted from left to right in the top row than in the bottom row. This change increases the num ber of false positives (top-right cell) and thus the colum n total m ore than it increases the num ber of true negatives (bottom -right cell), which causes the observed specificity to go down. In the left colum n the big decreases that occur in the num ber of true positives (top-left cell) also occur in the colum n total, which causes the observed sensitivity to change less. The net result is a big decrease in specificity and not m uch change in sensitivity. 14 This property can be used to determ ine how good the gold standard is. Apply the index test and gold standard to subpopulations of people who have different probabilities of disease, for exam ple, because they have m ore or fewer risk factors, because they have m ore or fewer sym ptom s, or because they have m ore severe or less severe sym ptom s. Look for changes in sensitivity and specificity for the index test. In sum m ary, when the gold standard is perfect, the sensitivity and specificity of other diagnostic tests do not change when the prevalence of disease changes. W hen the gold standard is im perfect, however, the sensitivity and specificity of other diagnostic tests do change when the prevalence of disease changes. In m ost real-world situations, when prevalence changes from low to high, sensitivity increases and specificity decreases. 15 16 Caveats No Gold Standard The changes cited in these exam ples occur under m any conditions, but they do not occur under all conditions. The shifts in num bers am ong cells that occur as one m oves from a perfect to a tarnished gold standard depend on the relationships am ong the prevalence of disease (p) and the tarnished gold standard’s sensitivity (sens TGS) and specificity (spec TGS). W hen In other circum stances there is no gold standard procedure to verify other diagnostic test results. In these circum stances, there are ways to estim ate test sensitivity and specificity using m axim um likelihood estim ation m ethods applied to latent class m odels. These m ethods are reviewed in the following paper: Reitsm a JB. Rutjes AW . Khan KS. Coom arasam y A. Bossuyt PM. A review of solutions for diagnostic accuracy studies with an im perfect or m issing reference standard. Journal of Clinical Epidem iology. 62(8):797-806, 2009 Aug. Software that calculates sensitivity and specificity when there is no gold standard is available at http://www.epi.ucdavis.edu/diagnostictests/tags.htm l. Additional inform ation is available at Statistics in Medicine 2009; 28:3108–3123. p = (1 - spec TGS) / (2 - (sens TGS + spec TGS)) the shifts from right to left and left to right in the 2 by 2 table as one m oves from a perfect to a tarnished gold standard are equal and thus cancel each other out. W hen p is lower, the net shifts in cell num bers are from right to left. W hen p is higher, the net shifts in cell num bers are from left to right. For exam ple, if sens TGS = 0.9 and spec TGS = 0.9, then p = 0.5. At any p < 0.5 the net shifts in cell num bers as one m oves from a perfect to a tarnished gold standard are from right to left. At any p > 0.5 the net shifts in cell num bers as one m oves from a perfect to a tarnished gold standard are from left to right. Sam ple Input Num ber of Tests: 3 Num ber of Cases: 1692 The changes cited in these exam ples are larger when the sum of the index test’s sensitivity and specificity is larger, and they approach zero when the sum of the index test’s sensitivity and specificity decreases toward 1.0. (Note that any com bination of the index test’s sensitivity and specificity that gives a sum of 1.0 is on the 45 o diagonal in the ROC curve, which indicates “no inform ation.”) Note that the p values referred to here are prevalence values determ ined in a cohort study. They are not the arbitrary p values in case-control studies. 17 1 = Test Result Negative 2 = Test Result Positive Assignm ent Assignm ent Assignm ent Assignm ent Assignm ent Assignm ent Assignm ent Assignm ent Result of Test 1 23 --------Code: 1 1 1 Code: 2 1 1 Code: 1 2 1 Code: 2 2 1 Code: 1 1 2 Code: 2 1 2 Code: 1 2 2 Code: 2 2 2 Frequency: 1513 Frequency: 23 Frequency: 59 Frequency: 12 Frequency: 21 Frequency: 19 Frequency: 11 Frequency: 34 18 Sam ple Output Program s written by: Theta = prevalence of disease Alpha = false positive rate (1 - alpha = specificity) Beta = false negative rate (1 - beta = sensitivity) Likelihood = -891.1428 Estim ated Theta value = .0545 Test 1 2 3 Test 1 2 3 Estim ated Beta .2350 .3565 .2509 S.E. .067 .067 .067 Estim ated Alpha S.E. .0109 .004 .0354 .005 .0100 .003 95% C.I.= ( .040, .069) 95% C.I. ( .105, .365) ( .226, .487) ( .119, .383) 95% C.I. ( .004, .018) ( .026, .045) ( .003, .017) NOTE: CI's are based on approxim ate norm al theory. SE's and CI's m ay be inaccurate with sm all sam ple sizes, or if param eter estim ates are at or near their m axim um or m inim um values (1 or 0). Test Results ---------------1 1 1 2 1 1 1 2 1 2 2 1 1 1 2 2 1 2 1 2 2 2 2 2 Probability of Being a Case -----------------------------------.0013 .2743 .0593 .9489 .2754 .9912 .9492 .9998 19 S.D. W alter, Ph.D., Professor McMaster University Departm ent of Clinical Epidem iology and Biostatistics 1200 Main Street W est, Room HSC 2C16 Ham ilton, Ontario L8N 3Z5 Canada E-Mail W ALTER@ FHS.MCMASTER.CA These program s estim ate the error rates of diagnostic tests or m easurem ents when there is no gold standard. Maxim um likelihood estim ation m ethods are applied to latent class m odels representing the observed data. 1. LATENT1 (Version 3) - used when all the observations are subject to error, i.e. there are no gold standard m easurem ents. There m ust be 3 or m ore observations per patient. 2. LATENT2 - used when there are 2 diagnostic m easurem ents, and there are definitive gold standard assessm ents available in follow-up for patients with one or two positive results. Patients with both initial tests negative have no further observations m ade, and so m ay be true disease cases or true non-cases. 3. LATENT3 - sim ilar to LATENT2, but there are three initial tests. Patients with 3 negative results have no further follow-up; other patients have a gold standard diagnosis available. These program s are available through Blackboard in the section on Assignm ents. 20 Issues • Return to Verification Bias More than two tests are required, although som e m ethods allow the use of m ultiple determ inations using the sam e test, in different populations or in different tim e periods • Som e, but not all, m ethods require the assum ption of independence between tests conditional on true disease status • Specialized software needs to be developed for each m ethod • Som e m ethods require intensive com puting If we can estim ate sensitivity and specificity when no patient gets the gold standard procedure (because there is no such procedure), we should be able to estim ate sensitivity and specificity when som e patients get the gold standard procedure and others do not. A description of how this can be done is in the following articles: 1. Zhou XH. Correcting for verification bias in studies of a diagnostic test's accuracy. Statistical Methods in Medical Research. 1998;7:337-353 and 2. Zhou XH, Higgs RE. Assessing the relative accuracies of two screening tests in the presence of verification bias. Statistics in Medicine. 2000;19(11-12):1697-1705. Issues 21 • Som e, but not all, m ethods assum e that the probability of verifying a patient does not depend on the unobserved disease status • Current m ethods focus on diagnostic tests with binary or ordinal results, and do not address diagnostic tests with continuous results. (Note that m ost diagnostic test analysis divides continuously scaled results into categories that then can be treated as ordinal results, for exam ple, with ROC curves.) • For the analysis of diagnostic tests with continuously scaled results using ROC data, see Zhou XH, Castelluccio P, and Zhou C. Nonparam etric estim ation of ROC curves in the absence of the gold standard. Biometrics. 2005;61: 600-609. 22