Chapter 6-3. Imperfect Reference Tests *** This chapter is under construction *** Pepe (2003, p.194-196) states the following about imperfect reference standards, or imperfect reference tests, “The best available reference tests may themselves be subject to error. When used as a reference against which new tests are compared, error in the reference test can influence the perceived accuracy of new tests. …sensitivity and specificity can be biased in either direction, depending on the error in the reference. The terms relative sensitivity and relative specificity are sometimes used for sensitivity and specificity when the reference test is subject to error.” The FDA (2007, p.24) suggests the following when using in imperfect reference test. In place of “gold standard” or “reference test”, label the imperfect reference test “non-reference standard”. Make it explicitly clear which of the two tests being compared is acting as the non-reference standard. Then, compute sensitivity in the usual way, using the non-reference standard as if it were the gold standard, but call this statistic “positive percent agreement.” Compute specificity in the usual way, but call it “negative percent agreement.” _________________ Chapter 6-3 (revision 16 May 2010) p. 1 Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2010. Chapter 6-3 (revision 16 May 2010) p. 2 Article Suggestion Here is an example of what you could state in your Statistical Methods: When the reference test, or gold standard, is subject to error, the perceived accuracy of the new test can be affected, where sensitivity and specificity can be biased in either direction, depending on the error in the reference test (Pepe, 2003). The simplest approach to this test comparison situation is simply to compute sensitivity and specificity using the usual formulas, but then call the test characteristics something else to inform the reader that an imperfect reference standard has been used. The terms relative sensitivity and relative specificity in place of sensitivity and specificity have been suggested (Pepe, 2003). A similar suggestion is positive percent agreement and negative percent agreement in place of sensitivity and specificity (FDA, 2007). Considering Direct Fluorescent Antibody as a reference standard subject to error, or imperfect gold standard, the terms positive % agreement and negative % agreement are used to report the test characteristics of the new test, FilmArray RP. Positive % agreement (sensitivity) is the percent of time that FilmArray RP detected a virus when Direct Fluorescent Antibody detected it. Similarly, negative % agreement (specificity) is the percent of time that FilmArray RP did not detect a virus when Direct Fluorescent Antibody did not detect it. When the two tests disagree, this shows up in the paired 2 x 2 table in the off-diagonal cells. Such tables are reported in Table 2. If the disagreements are random, just as likely to disagree in either direction, the two off-diagnonal cell counts should be approximately equal. To test this, the McNemar test was used. If the two cells counts did not sum to one, an exact McNemar test was used (Siegel and Castellan, 1988). It was hypothesized that FilmArray RP would detect virsus more frequently than Direct Fluorescent Antibody, and so the McNemar test is also reported in Table 2. Of course, if this is so, the argument that it is due to greater sensitivity of FilmArray RP, rather than simply more misclassifications in the positive direction, requires a justification beyond the test characteristics computed in this study. This justification could take the form of a separate study where both tests are compared against a true gold standard, or an indisputable advantage of the new test predicting this direction of the discordance. Chapter 6-3 (revision 16 May 2010) p. 3 References FDA. (2007). Statistical guidance on reporting results from studies evaluating diagnostic tests, Center for Devices and Radiological Health, March 13, 2007. http://www.fda.gov/cdrh/osb/guidance/1620.pdf. Pepe MS. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction, New York, Oxford University Press. Rutjes AW, Reitsma JB, Coomarasamy A, et al. (2007). Evaluation of diagnostic tests when there is no gold standard. A review of methods. Health Technol Assess 11(50):iii, ix-51. Siegel S and Castellan NJ Jr (1988). Nonparametric Statistics for the Behavioral Sciences, 2nd ed. New York, McGraw-Hill. Chapter 6-3 (revision 16 May 2010) p. 4