Users' Guide to the Urological Literature

Users’ Guide to the Urological Literature
How to Use an Article About a Diagnostic Test
Charles D. Scales, Jr.,* Philipp Dahm, Shahnaz Sultan, Denise Campbell-Scherer
and P. J. Devereaux
From the Division of Urology, Department of Surgery, Duke University (CDS), Durham, North Carolina, Departments of Urology (PD)
and Medicine (SS), College of Medicine, University of Florida, Gainesville, Florida, Department of Family Medicine, University of
Michigan (DCS), Ann Arbor, Michigan, and Departments of Medicine and Clinical Epidemiology and Biostatistics, McMaster University
(PJD), Hamilton, Ontario, Canada
Purpose: Urologists frequently confront diagnostic dilemmas, prompting them to select, perform and interpret additional
diagnostic tests. Before applying a given diagnostic test the user should ascertain that the chosen test would indeed help
decide whether the patient has a particular target condition. In this article in the Users’ Guide to the Urological Literature
series we illustrate the guiding principles of how to critically appraise a diagnostic test, interpret its results and apply its
findings to the care of an individual patient.
Materials and Methods: The guiding principles of how to evaluate a diagnostic test are introduced in the setting of a clinical
scenario. We propose a stepwise approach that addresses the question of whether the study results are likely to be valid, what
the results are and whether these results would help urologists with the treatment of their individual patients.
Results: Some of the issues urologists should consider when assessing the validity of a diagnostic test study are how the
authors assembled the study population, whether they used blinding to minimize bias and whether they used an appropriate
reference standard in all patients to determine the presence or absence of the target disorder. Urologists should next evaluate
the properties of the diagnostic test that indicate the direction and magnitude of change in the probability of disease for a
particular test result. Finally, urologists should ask a series of questions to understand how the diagnostic test may impact
the care of their patients.
Conclusions: Application of the guides presented in this article will allow urologists to critically appraise studies of
diagnostic tests. Determining the study validity, understanding the study results and assessing the applicability to patient
care are 3 fundamental steps toward an evidence-based approach to choosing and interpreting diagnostic tests.
Key Words: urology, evidence-based medicine, likelihood functions, diagnostic techniques and procedures,
sensitivity and specificity
bladder cancer. Your current diagnostic algorithm includes
cystoscopy and cytology, in addition to appropriate imaging,
but cytology requires several days for results. A point of care
system could be a convenient adjunct to your current practice. You decide to perform a literature search to find a study
of the diagnostic accuracy of the point of care urine testing
system that you heard about at the meeting, ie NMP22®.
rologists frequently face diagnostic dilemmas, prompting them to select, perform and interpret additional
diagnostic tests. Ideally the selection and interpretation of diagnostic tests should reflect an evidence-based
practice with the intent of helping the urologist decide
whether a patient has a particular disease. We did not aim
to provide a comprehensive evaluation of a specific diagnostic test, which would require a critical appraisal of the entire
relevant body of literature. Instead, in this third article in
the Users’ Guide to the Urological Literature series we used
a clinical scenario to illustrate the guiding principles of how
to critically appraise an individual study of a diagnostic test.
U
SEARCH OF THE MEDICAL LITERATURE
Applying skills gained from the article How to Perform a
Literature Search1 in the Users’ Guide to the Urological
Literature series, you use the PICOT (population, intervention, comparison, outcome and type) framework to formulate
a focused clinical question to guide your search: “In patients
at risk for bladder cancer (population), how does NMP22
(intervention) compare to cystoscopy with urinary cytology
(comparison) in making the diagnosis of bladder cancer (outcome)?” This study question of diagnosis would best be answered in a prospective trial (study type) or cohort study.
Ideally you would hope to find a systematic review of
several high quality studies addressing this specific topic or
an evidence-based synopsis on this topic. Therefore, you
direct your first search to the Evidence-Based Medicine Reviews function in Ovid®, which is available to you through
CLINICAL SCENARIO
You are a busy urologist and frequently see patients referred
for hematuria. While recently attending the annual meeting
of the American Urological Association, you hear about a
point of care urine testing system to enhance the detection of
Submitted for publication January 25, 2008.
* Correspondence: Department of Surgery, Division of Urology,
Box 2922, Duke University Medical Center, Durham, North Carolina 27710 (telephone: 919-668-4605; FAX: 919-681-7423; e-mail:
chuck.scales@duke.edu).
0022-5347/08/1802-0469/0
THE JOURNAL OF UROLOGY®
Copyright © 2008 by AMERICAN UROLOGICAL ASSOCIATION
469
Vol. 180, 469-476, August 2008
Printed in U.S.A.
DOI:10.1016/j.juro.2008.04.026
470
HOW TO USE ARTICLES ABOUT DIAGNOSTIC TESTS
your local institution, and use NMP22 as a search term. This
query returns 3 abstracts, of which none is timely and specifically addresses your clinical question.
You next use the Clinical Queries feature in PubMed®.
You separately search the terms bladder cancer, NMP22
and cytology, which you then combine with the AND function. This yields 114 studies (as accessed November 23,
2007). You then paste the combined search into the Clinical
Queries function for diagnosis, which limits the search to 78
studies. To further minimize the number of studies you limit
your search to the English literature of the last 5 years,
which yields 25 articles. When scanning these titles, you
notice a study in JAMA®, the Journal of the American Medical Association, which you recall as a highly publicized
multicenter study of NMP22.2
The abstract confirms that this study examines the diagnostic accuracy of the NMP22 assay in a prospective, multicenter cohort study that enrolled 1,331 consecutive patients “at elevated risk for bladder cancer.”2 Based on
information that you find in the abstract all patients provided a voided urine sample for NMP22 testing and urinary
cytology before cystoscopy. The investigators concluded that
the NMP22 assay can increase the diagnostic accuracy of
cystoscopy. Intrigued by these results, you print a copy of the
full text article for further review.
HOW TO EVALUATE A STUDY
OF DIAGNOSTIC ACCURACY
The Users’ Guide to the Medical Literature recommends
that the reader of a given study consecutively ask 3 questions.3 1) Are the results of a study valid? 2) What are the
results? 3) How can I apply the results to the care of my
patient? These aspects are interdependent and each is
equally important. For example, we would not care if a new
diagnostic test for bladder cancer was accurate if we were
not convinced that the study was valid. Alternatively a well
designed study of a new test that demonstrates clinically
unimportant results would be of little use in clinical practice, although negative studies are helpful to the extent that
they support unbiased evaluation of diagnostic tests. Finally, study results must be applicable to our patients, and
the likely treatment benefits must be worth the potential
costs, inconveniences and harms. If this third criterion is not
met, study validity and the magnitude of the effect matter
little. Appendix 1 shows the application of these guidelines
to appraise a study of diagnostic accuracy.4,5
Are the Results Valid?
STARD (Standards for Reporting of Diagnostic Accuracy)
criteria serve as a comprehensive guide for the transparent
reporting of diagnostic accuracy studies.6,7 However, for the
purpose of the clinical reader, a more rapid assessment of
study validity relies on a few key questions (Appendix 1).4,5
Primary Guides
Was there an independent, blind comparison to a reference standard? To assess the performance of a diagnostic test a reference standard, sometimes referred to as a gold
standard, must exist to establish the presence or absence of
the target condition. As a reader, you must decide whether
the reference standard used is acceptable, understanding
that no test is perfect. Grossman et al used cystoscopy as the
reference standard.2 Patients were considered to have bladder cancer if a mass was visualized on initial cystoscopy (70)
or on repeat cystoscopy (9), which was only performed for
continued suspicion, within 3 months of initial cystoscopy
and the biopsy, when taken, confirmed malignancy except in
select patients who were deemed poor surgical candidates.
When considering the appropriateness of the reference standard, you recall that cystoscopy is most valuable for diagnosing papillary bladder tumors. However, cystoscopy is not
accurate for making the diagnosis of CIS, which is why you
complement cystoscopy with urinary cytology in your practice. This draws into question the role of cystoscopy as a
single reference standard for all types of bladder cancer. In
an ideal study you might have performed random bladder
biopsies in every patient, although at the cost of significant
feasibility challenges.
The second criterion to consider is blinding. Was the
person performing and interpreting the test under investigation (the index test) unaware of the results of the reference standard and vice versa? Lack of blinding can have a
substantial impact on test results and it may make the test
appear more accurate or useful than it actually is.8,9 Blinding is particularly important when interpreting an index
test or reference standard involves some degree of subjectivity. When a diagnostic test relies on subjective interpretation, ie a cytology specimen or radiographic image, 2 observers may interpret the test results differently (interobserver
variability). Diagnostic tests with low agreement among observers may not be reliable in practice.
Grossman et al explicitly stated that staff members performing the NMP22 test were blinded to the results of cystoscopy.2 In addition, physicians performing cystoscopy were
blinded to the results of the NMP22 essay. Whereas the readout of the NMP22 assay leaves little room for subjectivity, one
could easily envision that a positive index test, eg NMP22,
might lead the urologist to perform a more thorough cystoscopic examination or have a lower threshold for biopsying an
abnormal appearing bladder lesion. However, potential
sources of bias appear to have been appropriately addressed, ie
there was blinding of the index test and reference standard. In
addition, the investigators reported that “cytological examinations. . .[were conducted]. . .physically distant from cystoscopy
and NMP22 evaluations”2 and reportedly at a later time,
thereby implying that urinary cytology results did not affect
and were not affected by the other tests.
Did the patient sample include an appropriate spectrum of patients to whom the diagnostic test would be
applied in clinical practice? A diagnostic test for bladder cancer is useful to the extent that it is able to distinguish
patients with or without bladder cancer. To demonstrate
this ability you would like to see a diagnostic test studied in
a group of patients similar to yours, in which diagnostic
uncertainty exists. The study should include a broad spectrum of patients with disease of different severity, ie from
small low grade tumors to large high grade tumors, and
patients without the disease with symptoms commonly associated with bladder cancer, ie hematuria.10 –17 This point
is particularly important because diagnostic tests perform
differently, depending on the disease severity in a given
patient population, which is a phenomenon known as spectrum bias.18
HOW TO USE ARTICLES ABOUT DIAGNOSTIC TESTS
Therefore, you review the study population in which the
study was performed. Grossman et al reported that patients
eligible for the NMP22 study had no history of bladder
malignancy but all had “bladder cancer risk factors or symptoms, such as smoking, hematuria or dysuria.”2 Patients
were drawn from various practice settings, including private
practice, academic centers and veteran facilities. A detailed
summary of the patient indications for cystoscopy would be
helpful but this was not provided. Nevertheless, these patients appear similar to those whom you evaluate and they
appear to represent a reasonable spectrum in terms of bladder cancer risk.
Secondary Guides
Did the results of the test being evaluated influence the
decision to perform the reference standard? An assessment of diagnostic accuracy is susceptible to bias if the index
test results influence the decision to perform the reference
test (verification bias).19 –25 Verification bias is more common when the reference standard is an invasive procedure.24 Flow diagrams are encouraged in the reporting of
diagnostic accuracy studies because they provide a transparent method to communicate the presence or absence of verification bias to the reader.6,7 The evaluation of the NMP22
point of care test includes a flow diagram of the study, which
demonstrates that all 1,331 patients in whom the index test
(NMP22) was done underwent the reference test (cystoscopy) regardless of the index test result.2 A total of 49
patients did not undergo cystoscopy and they were excluded
for protocol violation. While the lack of cystoscopy results in
these patients could be a potential source of verification bias
if their NMP22 test was negative, we have no reason to
believe that this was the case. We conclude that the decision
to perform cystoscopy as the reference standard was indeed
independent of the NMP22 test result.
Were the methods for performing the test described in
sufficient detail to permit replication? As clinicians, we
like to use new diagnostic technologies if we deem them
worthwhile. Therefore, it is important for studies of diagnostic accuracy to describe in sufficient detail the methodology
for performing a diagnostic test. If a test requires equipment
or especially trained personnel not available in your practice, the validity and results of the study do not really matter
since it is not a technique that you can use. Thus, we must
understand the diagnostic test to assess the feasibility of
performing it in our care setting. In addition, differences in
test protocols have been shown to produce differences in
diagnostic accuracy for some tests.9,26 Therefore, including
these details is recommended by STARD criteria.6,7 Grossman
et al provided a detailed explanation with references to the
supporting literature in their methods section.2 It appears
that NMP22 is a diagnostic test that clinic personnel with
basic training could perform and, therefore, it is likely applicable to the practice of most urologists.
Having completed our study validity assessment for the
NMP22 urine assay diagnostic accuracy report, we conclude
that by and large the study by Grossman et al2 meets the
proposed primary and secondary validity criteria. The main
issue that we found was whether cystoscopy should be considered an adequate gold standard. Since cystoscopy is not
an effective means of making the diagnosis of CIS, which is
why we perform cytology and random bladder biopsy in
471
addition, when appropriate, some cases of bladder cancer
may have been missed. In addition, we would have liked to
have seen a more detailed description of the study population. Apart from these issues the investigators used important methodological safeguards to avoid bias.27 In addition,
information that relates to the feasibility of reproducing
these results in your clinical practice is provided. Therefore,
you decide that the study merits further review and proceed
to the results of the study.
What are the Results?
For any patient needing a diagnostic test clinicians have a
sense of the probability of disease in that patient, which is
known as the pretest probability. For example, consider a
70-year-old male with a 100 pack-year history of smoking
and a 42-year-old nonsmoking female who present with microscopic hematuria. The pretest probability, ie the probability of disease before any specific testing, of cancer in the
70-year-old male smoker is much higher than the probability of bladder cancer in the younger woman without risk
factors. Information about pretest probabilities can be derived from various sources, including the published literature on patients with similar symptom presentation, the
institutional registry or personal clinical experience and
intuition. Another source of information to estimate pretest
probabilities can come from the same studies that provide
data on diagnostic tests. For example, in the study by Grossman et al approximately 6% of all patients (79 of 1,331) were
ultimately found to have bladder cancer.2
The next step is to decide how the results of the NMP22
test change this pretest probability estimate. In other words,
urologists should be interested in the characteristic of the
test that indicates the direction and magnitude of change in
the probability of disease. This test characteristic is best
captured by a measure called the likelihood ratio (Appendix
2). LR is the characteristic of the test that links the pretest
probability to the probability of the target condition after
obtaining the test result, also called the posttest probability.
What are the LRs Associated With the Test Results?
LR⫹ is defined as the likelihood of a positive test in individuals with the disease compared to the likelihood of a positive
test in those without the disease. As an example, consider
the results of the study by Grossman et al (table 1).2 There
were 79 patients who were found to have bladder cancer and
1,252 in whom bladder cancer was ruled out. How likely is a
positive NMP22 test in patients with bladder cancer? Table
1 shows that 44 of 79 patients (55.7%) with bladder cancer
had a positive NMP22 test. Conversely how likely is a positive NMP22 in patients without bladder cancer? The inves-
TABLE 1. Evaluation of diagnostic test for bladder cancer with
point of care NMP22 proteomic assay2
No. Cystoscopy (reference test)
NMP22 Result
Pos
Neg
Totals
Bladder Ca
No Bladder Ca
44
35
179
1,073
79
1,252
Sensitivity ⫽ 44/79 ⫽ 55.7%, specificity ⫽ 1,073/1,252 ⫽ 85.7%, LR⫹ ⫽
(44/79)/(179/1,252) ⫽ 3.90 and LR⫺ ⫽ (35/79)/(1,073/1,252) ⫽ 0.52.
472
HOW TO USE ARTICLES ABOUT DIAGNOSTIC TESTS
tigators found a positive NMP22 test in 179 of 1,252 patients
(14.3%) without bladder cancer. The ratio of these 2 proportions equals 3.9, which is the LR for a positive NMP22 test.
In other words, a positive NMP22 test is 3.9 times more
likely to occur in patients with than without bladder cancer.
Note that the first part of the ratio (the probability of a
positive test in patients with the condition) is sensitivity.
Similarly LR⫺ is defined as the likelihood of a negative test
in patients with the condition compared to the likelihood of
a negative test in patients without the condition. In the case
of the NMP22 test LR⫺ ⫽ (35/79)/(1,073/1,252) or 0.5.
In this study the LRs for the NMP22 test are not presented but Grossman et al provided the data necessary to
calculate them.2 This emphasizes the importance of understanding the definitions of LR⫹ and LR⫺, so that these
measures can be derived from the information provided in a
2 ⫻ 2 table. Alternatively if not enough information is provided to construct a 2 ⫻ 2 table for the test results, as is the
case for urine cytology in this article, LRs can be calculated
from sensitivity and specificity values (Appendix 2). LR⫹ is
defined as (sensitivity)/(1 ⫺ specificity) and for urine cytology it is calculated as 0.158/(1 ⫺ 0.992) or 19.8. LR⫺ is (1 ⫺
sensitivity)/(specificity) and for cytology it is calculated as
(1 ⫺ 0.158)/0.992 or 0.85.
Now that we have learned how to calculate LR, how do we
apply LRs in clinical practice? LR indicates by what factor
the result of a given diagnostic test would increase or decrease the pretest probability of a condition. If a test has an
LR of 1, the posttest probability of the target condition is
exactly the same as the pretest probability, ie the test is
useless. If LR is greater than 1, this results in a posttest
probability that is higher than the pretest probability. Conversely if LR is less than 1, this results in a posttest probability that is lower than the pretest probability. The size of
this effect is related to the size of the LR. 1) LRs greater than
10 or less than 0.1 create large and often conclusive changes
from pretest to posttest probability. 2) LRs of 5 to 10 and 0.1
to 0.2 create moderate changes in pretest to posttest probability. 3) LRs of 2 to 5 and 0.2 to 0.5 generate small changes
in probability. 4) LRs of 1 to 2 and 0.5 to 1 are unlikely to
alter probability to a clinically significant degree.4
The importance of the effect of the LR, ie whether the test
results in a clinically meaningful change, is not simply determined by how large or small an LR is. It is also determined by the pretest probability.
Having determined the LRs, how do we use them to link
the pretest probability to the posttest probability? One approach, although tedious, is to convert the pretest probability
to the pretest odds (odds ⫽ probability/[1 ⫺ probability]), which
is then multiplied by the LR to obtain the posttest odds.4 We
then need to convert the posttest odds back to a posttest probability (probability ⫽ odds/[1 ⫹ odds]). An easier approach is to
use a Fagan nomogram, which allows you to read off the
posttest probability by aligning pretest probability and LR for
a given clinical scenario (see figure).28 An interactive version of
the Fagan nomogram28 is available at the CEBM (Centre for
Evidence-Based Medicine) website (http://www.cebm.net).
Let us return to the clinical example of the 70-year-old
male with a smoking history and gross painless hematuria.
While we may not know the exact pretest probability of
bladder cancer in this patient, we can make a plausible
estimate of 30% based on clinical experience. Alternatively if
data or clinical experience does not permit a specific esti-
Nomogram for converting pretest to posttest probability using LR.28
Instructions for use: identify pretest probability of disease on left
axis, then identify LR of diagnostic test on middle axis. Connect
these 2 points with straight line and continue this line to right
(posttest) axis. Point where line crosses posttest axis is estimated
posttest probability. Reprinted with permission from Massachusetts
Medical Society. Copyright © 1975 Massachusetts Medical Society.
All rights reserved.
mate, we can identify an upper and lower limit to our plausible pretest probability, for example 10% to 50%, and then
use these values to generate upper and lower limits of posttest probability. Using our pretest probability of 30% and
given a positive NMP22 test with an LR⫹ of 3.9, the Fagan
nomogram yields a posttest probability of approximately
65%. In comparison, if cytology was positive (LR⫹ ⫽ 19.8) in
the same patient, we would arrive at a posttest probability of
approximately 90%. It is noteworthy in this context that the
high LR⫹ for cytology derives from the high specificity of the
test, rather than from its sensitivity. This example demonstrates that cytology is a better diagnostic test than NMP22
for increasing the probability of bladder cancer, as reflected
by its higher LR⫹.
HOW TO USE ARTICLES ABOUT DIAGNOSTIC TESTS
Alternatively we could ask whether a negative NMP22 or
cytology result would help us lower the probability of bladder
cancer. Again, assuming a pretest probability of 30%, and an
LR⫺ of 0.52 and 0.85 for NMP22 and cytology we arrive at
posttest probabilities of approximately 18% and 27%, respectively. Comparing pretest and posttest probabilities, we note
that a negative NMP22 test resulted in a greater change than
cytology, thereby indicating that it is a better test for decreasing the suspicion of bladder cancer. However, neither of the 2
tests has altered the probability of bladder cancer dramatically
enough that we are likely to be comfortable assuring the patient that he does not have bladder cancer.
There are other measures in addition to LR to characterize the properties of a diagnostic test. Many urologists are
familiar with the terms sensitivity and specificity. Sensitivity is defined as the proportion of individuals with disease in
whom the test is positive and specificity is defined as the
proportion of individuals without the disease in whom the test
is negative. These values are often presented in the context of
a 2 ⫻ 2 table (Appendix 2). Grossman et al did not present a 2
⫻ 2 table of the results of the NMP22 test compared to the
reference standard, but rather sufficient data to construct a
table (table 1).2 For the NMP22 test sensitivity is calculated
as 44/79 or 55.7%. Specificity for the NMP22 point of care
assay is 1,073/1,252 or 85.7%. The investigators also provided these measures for cytology (15.8% sensitivity and
99.2% specificity) but they reported insufficient data to reconstruct a 2 ⫻ 2 table for cytology.
While the sensitivity/specificity framework provides a
method for comparing diagnostic tests, it is not helpful to
clinicians who are most interested in to what extent a positive or negative test alters the patient probability of disease. The reason is that the calculation for most physicians
to go from pretest probability to posttest probability, ie the
probability of disease in our patient after the test, using
sensitivity and specificity is so complicated and long that
physicians must guess at the posttest probability, and unfortunately the guess may frequently be wrong. LR provides
a more clinically relevant framework for understanding the
implications of a diagnostic test result in our patient. With
the Fagan nomogram physicians can accurately know the
posttest probability within seconds without having to guess.
Of note, neither LR, sensitivity nor specificity varies with
the prevalence of disease in the population (table 2). PPV
TABLE 2. Effect of disease prevalence on diagnostic
test properties
No. Disease
Present
Disease prevalence ⫽ 5%:*
Pos test
Neg test
Totals
Disease prevalence ⫽ 20%:†
Pos test
Neg test
Totals
Absent
Total No.
40
10
95
855
135
865
50
950
1,000
160
40
80
720
240
760
200
800
1,000
* LR⫹ ⫽ (40/50)/(95/950) ⫽ 8, sensitivity ⫽ 40/50 ⫽ 80% and PPV ⫽ 40/135 ⫽
29.6%, and LR⫺ ⫽ (10/50)/(855/950) ⫽ 0.22, specificity ⫽ 855/950 ⫽ 90%
and NPV ⫽ 855/865 ⫽ 98.8%.
† LR⫹ ⫽ (160/200)/(80/800) ⫽ 8, sensitivity ⫽ 160/200 ⫽ 80% and PPV ⫽
160/240 ⫽ 67%, and LR⫺ ⫽ (40/200)/(720/800) ⫽ 0.22, specificity ⫽ 720/800 ⫽
90% and NPV ⫽ 720/760 ⫽ 94.7%.
473
and NPV change with the prevalence of disease and, therefore, they are considered unreliable measures of test performance. For example, if a test is derived in the setting of a
urology clinic in a population with a higher pretest probability of bladder cancer and one wishes to apply it in another
clinical setting such as primary care, NPV and PPV cannot
be used. LRs are independent of prevalence and, therefore,
they are conserved across clinical settings. Table 2 shows
the hypothetical example of diagnostic test for bladder cancer with 80% sensitivity and 90% specificity that is applied
in patient populations with different bladder cancer prevalence. Whereas LRs, sensitivity and specificity are constant
regardless of whether the disease prevalence is 5% or 20%,
PPV and NPV change with disease prevalence. The implication of this observation is that reports of PPV and NPV in
the evaluation of a diagnostic test have little value since
there is no assurance that the disease prevalence in your
setting would be similar to that in the study population.
Again, LRs are clinically more useful because they do not
vary with disease prevalence and, therefore, they may be
applied by clinicians with different patient populations, ie in
the evaluation of hematuria by primary care physicians vs
urologists. Another strength of LRs is that, unlike sensitivity and specificity, they may be applied to measures with
nondichotomous outcomes, as discussed in greater detail by
Jaeschke et al.4
These examples demonstrate the usefulness of LR for
comparing 2 diagnostic tests and for assisting our clinical
evaluation of patients by modifying pretest probabilities.
When evaluating our article about a diagnostic test by
Grossman et al,2 we first assessed the validity of the study
and now we have ascertained how to generate and use the
LR using the study results. The last step in evaluating our
article is to understand how the test is applicable to patient
care.
Will the Results Help
Me in Caring for My Patient?
Will the reproducibility of the test result and its interpretation be satisfactory in my setting? The value of a
diagnostic test often depends on its reproducibility when
applied to a patient, which refers to whether the test yields the
same result when a patient disease state has not changed. Test
reproducibility can be affected by various factors, including
features of the test itself. For example, if degradation of the
monoclonal antibody used for the NMP22 test occurs in
storage, the test may not perform reproducibly with time.
Other tests may lack reproducibility due to varying interpretations by subjective observers. For example, different
pathologists viewing cytology samples may vary in their
interpretations of abnormal-appearing cells. If a study describes a test as being highly reproducible, this may be
because the test is simple and easy to use, or because the
investigators are highly skilled at applying the test. In the
latter case the test may not perform as well when you apply
it in your practice setting. In the case of the NMP22 test the
detailed test description suggests that it may be used with
minimal training and it does not rely on subjective interpretation. Thus, it is likely to be reproducible in your setting.
Are the results applicable to my patient? Another important issue to consider is how similar the patients in your
practice are to the patients in the study population used to
474
HOW TO USE ARTICLES ABOUT DIAGNOSTIC TESTS
evaluate the diagnostic test. As reported previously, spectrum bias may cause a diagnostic test to perform differently
in patient populations with different degrees of disease severity.18 Therefore, reviewing the practice setting and the
types of patients studied is important. In our example patients were recruited from academic, veteran and private
practice facilities but detailed eligibility criteria were not
reported.2 Indications for testing were urinary symptoms,
such as frequency or hematuria, or risk factors for bladder
cancer, such as smoking. While Grossman et al reported
patient demographics, we have little information on the
clinical characteristics of these patients, such as the proportion with a smoking history or with urinary frequency. No
distinction was made between patients presenting with
gross and microscopic hematuria. Therefore, our ability to
generalize these study findings is hindered by the lack of a
detailed description of the study population.
Will the results change my management? After we have
decided that the results are applicable to our patient, we
must decide whether they would change our management.
Before ordering any type of diagnostic test it is helpful to
know what probabilities we are willing to accept to confirm
or refute the target diagnosis. For example, consider our
evaluation of a patient with hematuria. In the presence of
negative office cystoscopy and negative cytology the probability of bladder cancer is low and no further testing is
indicated unless prompted by new symptoms. In other words
the probability of disease is below the testing threshold. On
the other hand, if a patient has negative office cystoscopy
but positive cytology, the probability of bladder cancer is
much higher and these patients typically undergo further
testing, eg bladder biopsy. In this case the probability of
disease is between the testing and treatment thresholds.
Finally, assume that our patient has a history of bladder CIS
treated with BCG. With negative office cystoscopy but positive cytology a treatment alternative (although several exist) would be to proceed with a second course of BCG therapy
without further testing. The probability of disease is above
the treatment threshold and we may be willing to administer BCG therapy without further testing. These treatment
and test thresholds may vary among patients and disease
states because they are a matter of clinical judgment and
patient values. Considerations that play into the decision
making process include the danger of an undiagnosed or
untreated condition, the potential risks of therapy, eg BCG
vs radical cystectomy, and the overall medical condition of
an individual patient.
Ideally what we hope to achieve when performing a diagnostic test is to increase the probability of disease in that
patient beyond the treatment threshold or lower it to a level
at which we are comfortable assuming that the patient does
not have the target condition, eg bladder cancer. In our
example recall that LR⫹ for the NMP22 test was 3.9, which
will have a relatively small effect on the conversion from
pretest to posttest probability (table 3). On the other hand,
LR⫹ for cytology was 19.8, which causes large changes from
pretest to posttest probability. Table 3 provides an overview
of the posttest probabilities that result from applying the calculated LR⫹ and LR⫺ for NMP22 and cytology to different
pretest probabilities. This comparison shows that the NMP22
test results in relatively small changes from pretest to posttest
TABLE 3. LR effect on pretest probability
% Pretest Probability
LR ⫽ 19.8:
80 (high)
50
30
10 (low)
LR ⫽ 3.9:
80 (high)
50
30
10 (low)
LR ⫽ 0.85:
80 (high)
50
30
10 (low)
LR ⫽ 0.52:
80 (high)
50
30
10 (low)
% Posttest Probability
99
95
89
69
94
80
63
30
77
46
27
9
68
34
18
5
probability, therefore, indicating that it may have little impact
on how we would treat our patient. Finally, as with assessments of the magnitude of benefit of clinical treatments we like
to base them on more than 1 well done study. In the same way
diagnostic test characteristics would ideally be determined
based on more than a single investigation.
Will patients be better off as a result of the test? A
diagnostic test ultimately should be judged by whether it
adds information beyond what is otherwise available and
whether this information leads to a change in management
that is beneficial to the patient.5 When a diagnostic test
carries a relatively low risk, the target condition is serious
and treatment is effective, the value of the test is clear.
Other tests may be accurate but may not have a significant
impact on patient outcome.
RESOLUTION OF CLINICAL SCENARIO
Having critically appraised an original research study that
you found on the role of the NMP22 test in the diagnostic
evaluation of patients at increased risk for bladder cancer, you
must now decide whether to apply it in the care of future
patients. You conclude that the study was well done, yielding
results that are likely valid. However, uncertainty remains as
to whether NMP22 testing would change the management of
your cases. Therefore, based on the understanding of the results that you gained through the critical appraisal process you
decide not to change your established practice at this time. You
further recognize that the decision to adopt a new diagnostic
test should ideally be based on several well designed studies in
different patient populations that also consider the usefulness
of the test and the practicalities of its implementation, including its costs and ease of use.
CONCLUSIONS
We have outlined a simple approach to an evidence-based
practice for using diagnostic tests. To evaluate a study of
diagnostic test accuracy we apply 3 simple questions. 1) Are
the results of the study valid? 2) What are the results?
HOW TO USE ARTICLES ABOUT DIAGNOSTIC TESTS
3) Will the results help me in caring for my patients? Applying this framework will further our understanding of the
literature and enhance patient care through an evidencebased approach to diagnosis.
ACKNOWLEDGMENTS
The concepts presented were taken in part from Users’
Guide to the Medical Literature.3
Abbreviations and Acronyms
BCG
CIS
LR
LR⫺
LR⫹
NPV
PPV
APPENDIX 1
Guidelines for Evaluating an Article About
a Diagnostic Test.4,5
2.
Are the Results of the Study Valid?
Primary Guides
Was there an independent, blind comparison with a reference standard?
Did the patient sample include an appropriate spectrum of patients to
whom the diagnostic test will be applied in clinical practice?
Secondary Guides
Did the results of the test being evaluated influence the decision to
perform the reference standard?
Were the methods for performing the test described in sufficient detail to
permit replication?
3.
Will the Results Help Me in Caring for My Patients?
Will the reproducibility of the test results and its interpretation be
satisfactory in my setting?
Are the results applicable to my patient?
Will the results change my management?
Will patients be better off as a result of the test?
APPENDIX 2
4.
5.
6.
7.
8.
Diagnostic Test Properties
9.
Reference Standard
Reference Positive Reference Negative
Test is positive
Test is negative
Total
True positive (a)
False negative (c)
a⫹c
False positive (b)
True negative (d)
b⫹d
Total
a⫹b
c⫹d
a⫹b⫹c⫹d
10.
Terms and Calculations
Sensitivity ⫽ a/(a ⫹ c)
Question answered: What proportion of patients with disease has a
positive test?
Specificity ⫽ d/(b ⫹ d)
Question answered: What proportion of patients without disease has a
negative test?
11.
PPV ⫽ a/(a ⫹ b)
Question answered: What proportion of positive tests is correct?
12.
NPV ⫽ d/(c ⫹ d)
Question answered: What proportion of negative tests is correct?
13.
LR⫹ ⫽ [a/(a ⫹ c)]/[b/(b ⫹ d)]
Or LR⫹ ⫽ sensitivity/(1⫺specificity) ⫽
probability of a positive test in diseased people/probability of a positive
test in nondiseased people
Question answered: How much will a positive test change my pretest
probability of disease?
LR⫺ ⫽ [c/(a ⫹ c)]/[d/(b ⫹ d)]
Or LR⫺ ⫽ (1⫺sensitivity)/specificity) ⫽
probability of negative test in diseased people/probability of a negative
test in nondiseased people
Question answered: How much will a negative test change my pretest
probability of disease?
14.
15.
16.
PPV and NPV vary with disease prevalence in the study population. Therefore, these properties should not be applied to compare test performance
between studies or for individual patients (table 2).
⫽
⫽
⫽
⫽
⫽
⫽
⫽
bacillus Calmette-Guerin
carcinoma in situ
likelihood ratio
LR of a negative test
LR of a positive test
negative predictive value
positive predictive value
REFERENCES
1.
What are the Results?
Are LRs for the test results presented or data necessary for their
calculation provided?
475
Krupski TL, Dahm P, Fesperman SF and Schardt CM: User’s
guide to the urological literature: how to perform a literature search. J Urol 2008; 179: 1264.
Grossman HB, Messing E, Soloway M, Tomera K, Katz G,
Berger Y et al: Detection of bladder cancer using a pointof-care proteomic assay. JAMA 2005; 293: 810.
Guyatt G and Rennie D: Users’ Guide to the Medical Literature, 4th ed. Chicago: AMA Press 2002.
Jaeschke R, Guyatt GH and Sackett DL: Users’ guides to the
medical literature. III. How to use an article about a diagnostic test. B. What are the results and will they help me in
caring for my patients? The Evidence-Based Medicine
Working Group. JAMA 1994; 271: 703.
Guyatt GH, Tugwell PX, Feeny DH, Haynes RB and Drummond M: A framework for clinical evaluation of diagnostic
technologies. CMAJ 1986; 134: 587.
Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou
PP, Irwig LM et al: Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD Initiative.
Ann Intern Med 2003; 138: 40.
Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou
PP, Irwig LM et al: The STARD statement for reporting
studies of diagnostic accuracy: explanation and elaboration.
Ann Intern Med 2003; 138: W1.
Lijmer JG, Mol BW, Heisterkamp S, Bonsel GJ, Prins MH, van
der Meulen JH et al: Empirical evidence of design-related
bias in studies of diagnostic tests. JAMA 1999; 282: 1061.
Detrano R, Gianrossi R and Froelicher V: The diagnostic accuracy of the exercise electrocardiogram: a meta-analysis of
22 years of research. Prog Cardiovasc Dis 1989; 32: 173.
Stein PD, Gottschalk A, Henry JW and Shivkumar K: Stratification of patients according to prior cardiopulmonary disease and probability assessment based on the number of mismatched segmental equivalent perfusion defects. Approaches
to strengthen the diagnostic value of ventilation/perfusion
lung scans in acute pulmonary embolism. Chest 1993; 104:
1461.
Fletcher RH: Carcinoembryonic antigen. Ann Intern Med
1986; 104: 66.
Harris JM Jr: The hazards of bedside Bayes. JAMA 1981; 246:
2602.
Hlatky MA, Pryor DB, Harrell FE Jr, Califf RM, Mark DB and
Rosati RA: Factors affecting sensitivity and specificity of
exercise electrocardiography. Multivariable analysis. Am J
Med 1984; 77: 64.
Lachs MS, Nachamkin I, Edelstein PH, Goldman J, Feinstein
AR and Schwartz JS: Spectrum bias in the evaluation of
diagnostic tests: lessons from the rapid dipstick test for
urinary tract infection. Ann Intern Med 1992; 117: 135.
Moons KG, van Es GA, Deckers JW, Habbema JD and Grobbee
DE: Limitations of sensitivity, specificity, likelihood ratio,
and Bayes’ theorem in assessing diagnostic probabilities: a
clinical example. Epidemiology 1997; 8: 12.
O’Connor PW, Tansay CM, Detsky AS, Mushlin AI and Kucharczyk
W: The effect of spectrum bias on the utility of magnetic resonance imaging and evoked potentials in the diagnosis of suspected multiple sclerosis. Neurology 1996; 47: 140.
476
17.
18.
19.
20.
21.
22.
HOW TO USE ARTICLES ABOUT DIAGNOSTIC TESTS
Philbrick JT, Horwitz RI, Feinstein AR, Langou RA and Chandler JP: The limited spectrum of patients studied in exercise test research. Analyzing the tip of the iceberg. JAMA
1982; 248: 2467.
Ransohoff DF and Feinstein AR: Problems of spectrum and
bias in evaluating the efficacy of diagnostic tests. N Engl
J Med 1978; 299: 926.
Begg CB and Greenes RA: Assessment of diagnostic tests when
disease verification is subject to selection bias. Biometrics
1983; 39: 207.
Cecil MP, Kosinski AS, Jones MT, Taylor A, Alazraki NP,
Pettigrew RI et al: The importance of work-up (verification)
bias correction in assessing the accuracy of SPECT thallium201 testing for the diagnosis of coronary artery disease.
J Clin Epidemiol 1996; 49: 735.
Choi BC: Sensitivity and specificity of a single diagnostic test in the
presence of work-up bias. J Clin Epidemiol 1992; 45: 581.
Diamond GA: Off Bayes: effect of verification bias on posterior
probabilities calculated using Bayes’ theorem. Med Decis
Making 1992; 12: 22.
23.
Diamond GA, Rozanski A, Forrester JS, Morris D, Pollock BH,
Staniloff HM et al: A model for assessing the sensitivity and
specificity of tests subject to selection bias. Application to
exercise radionuclide ventriculography for diagnosis of coronary artery disease. J Chronic Dis 1986; 39: 343.
24. Greenes RA and Begg CB: Assessment of diagnostic technologies. Methodology for unbiased estimation from samples of
selectively verified patients. Invest Radiol 1985; 20: 751.
25. Zhou XH: Effect of verification bias on positive and negative
predictive values. Stat Med 1994; 13: 1737.
26. Philbrick JT, Horwitz RI and Feinstein AR: Methodologic
problems of exercise testing for coronary artery disease:
groups, analysis and bias. Am J Cardiol 1980; 46: 807.
27. Jaeschke R, Guyatt G and Sackett DL: Users’ guides to the
medical literature. III. How to use an article about a
diagnostic test A. Are the results of the study valid?
Evidence-Based Medicine Working Group. JAMA 1994;
271: 389.
28. Fagan TJ: Letter to the editor: a nomogram for applying likelihood ratios. N Engl J Med 1975; 293: 257.