Construction and interpretation of PREDICTIVENESS diagrams

advertisement
Construction and Interpretation of ’Predictiveness’ Diagrams
Statistical methods for diagnostic test evaluation are catching up with therapeutic
trials. Margaret Pepe and her group have contributed the predictiveness diagram (1). In what
follows the construction and use of this diagram is explained in terms as simple as possible. This
includes the special steps that must be taken when the empirical data come from separate samples of
the diseased and the non-diseased (‘two-gate’ studies). Towards the end some more advanced topic
will be mentioned.
The clinical situation envisaged is that of a stream of cases (or screenees) that may or
may not be suffering from disease D. One wants to characterize the diagnostic power of a
quantitative diagnostic test or a diagnostic score calculated from several clinical variables. The
variable is taken to be continuous; if it is not, a few, fairly obvious, minor modifications are needed
in the present text to allow the user to handle tied observations.
The square region in the plane within which both coordinates (x, y) are between zero
and 1.00 is called the unit square. The predictiveness diagram uses the unit square to represent the
entire population of persons that undergo testing (its area = 1 = 100%). The predictiveness curve is
based on sorting the members of the population by their ‘post-test’ disease risks: given a point on
the curve with ordinate y = a posttest risk, the associated x shows the fraction or percentage of the
population whose risks are less than that; so x is a cumulative fraction, based on increasing risks.
One may note that the diagram thus shows the cumulative distribution function (cdf)
of the posttest risk with the axes interchanged relative to usual plotting.
True
negatives
70.4 %
False
positives
9.6 %
True
positives
18.4 %
False neg. 1.6 %
Figure. Interpretation of predictiveness diagrams.
The pretest risk of disease here equals 0.2. First ignore the dashed lines.
The mark on the vertical dividing line shows that when those who have posttest {posterior} risks > 0.15 are
selected for treatment (regarded as ‘positive’), then 72% will be negative, 28% positive. The areas indicate
the resulting percentage true-positive, etc. Sundry calculations: The sensitivity becomes 0.184/0.2 = 92%,
specificity = 0.704/0.8 = 88%. The predictive value of a positive result (PVpos) is always > 0.15 (on the
average it amounts to 0.184/0.28 = 66%). And PVneg is always > 85% (average PVneg = 0.704/0.72 =
97.8%).
Dashed reference curves: a completely uninformative test invariably reproduces the pretest {prior} risk
(horizontal line); a perfectly discriminating test (all subjects diagnosable without error) would produce the
step curve.
Suppose the curve passes through (x, y) = (0.72, 0.15), as in the figure. If you choose a
posttest risk threshold of 0.15, then the diagram tells you that 72% of those tested have a risk below
0.15; the remaining 28% will have higher risks. Conversely, if you can afford to treat only 28% of
the population, the diagram shows that you should treat those with a risk above 0.15; the 72% you
cannot afford to treat will have smaller risks – though some may clearly have risk levels pretty close
to 0.15.
The predictiveness curve runs across the unit square from a left-most point (x = 0, y =
smallest risk in population) and reaches the right-hand wall at (x = 1, y = largest risk). It thereby
divides the unit square into an upper and a lower region, the areas of which have a simple
interpretation: The area of the region below the predictiveness curve is the mean ordinate value and
thus portrays the mean posttest probability of disease. The area therefore necessarily equals the
corresponding pretest probability (by virtue of the general theorem: mean of conditional mean =
unconditional mean). Similarly, the area of the region above the curve is the pretest probability of
non-disease (which is 80% in the figure). In fact, by virtue of the way the abscissa, x, is defined, it
is ensured that each member of the population contributes equally to the total area; so the area of
any subregion represents a fraction of the population – and hence a probability statement
concerning a randomly chosen member thereof. In particular, the probabilities that the next case
will be true (false) positive (negative) are visualized: the four regions demarcated by the curve and
the vertical dividing line through the cut point portray these four probabilities; see figure. E.g., the
region below the curve and right of the cut has an area of 0.184, which is the probability that the
next case will be true positive. (As just mentioned, the region below the curve represents the
diseased, and the vertical dividing line splits that region into a false negative contingent and a true
positive contingent, the size of each contingent being the area of the region concerned.)
The legend to the figure gives the resulting sensitivity and specificity. As regards
predictive values, note that the posttest risk is itself the pertinent predictive value, conditional as it
is on the exact test result (or diagnostic score) obtained in any given case. However, for what they
are worth, averaged predictive values based on the chosen dichotomization can also be calculated as
indicated in the legend.
Included in the figure are, for comparison, the predictiveness curves of an
uninformative test and a perfect test, both based on the pretest risk of 0.2. (Theoretically, any nondecreasing curve with area below it = 0.2 might arise.)
For other aspects of predictiveness diagrams, see the original article by Pepe et al.
Predictiveness curves based on specified pretest prevalences
Suppose your dataset comes from a ‘two-gate’ study (separate-sample design), and the
disease prevalence in real clinical clienteles is different from the artificial (investigator-chosen)
composition of the dataset. It is obvious what happens when one applies a logistic model or other
statistical models to the given dataset. They produce, for each subject, a predicted posttest
probability, P, obtained by fitting the dataset as it is. The estimate therefore reflects an artificial
situation. To remedy that, one may proceed as follows. The critical assumption is, of course, that
the diseased and the non-diseased samples are each representative of the real populations of those
with and without the disease.
For a moment let us stick to the logistic model. Here, P is defined via the
corresponding log odds value, S:
P = exp(S) / (1 + exp(S)) , S = ln(P / (1 – P)),
(1)
P/(1 – P) = exp(S) being the posttest odds of disease. (Software normally allows the user to get hold
of each individual subject’s P or S, so that he does not have to write a separate program for that
purpose.)
Suppose the dataset contains n patients with disease D and m non-D cases, so that the
(artificial) pretest odds equal n/m. Consider a new subject being tested. By the rule that (posttest
odds) = (pretest odds)∙L, where L stands for the likelihood ratio occasioned by the subject’s test
results, we know that L must satisfy the following equation:
P/(1 – P) = (n/m) ∙ L .
(2a)
In passing, one may note that, if L > 1 (< 1), the test result speaks for D (against D). If L = 1, the
data have made D neither more nor less probable than it was a priori; incidentally, this corresponds
to the shoulder of the ROC where the local slope is 1.00 (45o angle).
Now suppose you know from separate sources that the pretest prevalence is p'. That
changes the subject’s posttest risk to P', the relation being now:
P'/(1 – P') = (p'/(1 – p')) ∙ L .
(2b)
To determine what P' level the subject’s test result would imply were the pretest prevalence p', we
may therefore isolate the likelihood ratio L implicit in (2a) and insert it in (2b). The first step in
constructing a p'-based predictiveness diagram is therefore this: For each of the m+n records a
likelihood ratio L is calculated using the score S from the logistic model.
L = exp(S) / (n/m) .
(2c)
Let q' = (1 – p') denote the pretest probability of a non-D. From (2bc) the adjusted posttest odds can
be calculated for each of the m+n records:
P'/(1 – P') = (p'/q') ∙ L = (p'/q') ∙ exp(S) / (n/m) ,
(3)
or, when solved for the adjusted risk of disease:
P' = p'L / (p'L + q') .
(4)
This defines the subject’s ordinate in the prevalence-adjusted diagram we are constructing.
Incidentally, it all boils down to adding a fixed adjustment, ln((p'/q')/(n/m)), to the logistic S, as can
be seen by comparing the logarithm of (3) with (1); in other words, the ‘intercept’ term of the model
output changes by that amount.
The same steps (2-4) are applicable with models other than the logistic; just ignore the
formulae that involve S.
The corresponding horizontal position in the adjusted diagram (call it x') is calculated
as the cumulative fraction with (a lower S value and hence) a lower P' value, adjusting individual
contributions so as to produce the chosen pretest prevalence. This simple operation looks
complicated when formally expressed: Along the horizontal axis, each non-D or D subject is
allotted a slot of width [q'/m] or [p'/n], respectively. Thus, the total allotments become m∙[q'/m] +
n∙[p'/n] = q' + p' = 1, as intended. Let A be the number of non-D cases with a lower value than the
current subject, and B the analogous number of cases of D. When the current subject’s diagnosis is
scored as D = 0 or 1, respectively,
x' = (A + (1 – D)/2)∙[q' / m] + (B + (D)/2)∙[p' / n] .
(5)
The prevalence adjustment is brought about by the factors in square brackets (in an unadjusted,
hence probably not so useful, predictiveness diagram these factors would all be [1/(m+n)]). The
small shifts, (1 – D)/2 and (D)/2, simply serve to position a subject’s plotting mark midway
(centrally) in his/her slot.
When m or n is small, an academically more correct step curve should be used. In
particular, the area rules above apply to the step-curve version. The current subject’s slot extends
from
(A)∙[q' / m] + (B)∙[p' / n]
to (A + (1 – D))∙[q' / m] + (B + (D))∙[p' / n] ;
(6)
cf. expression (5), which represents the mid-slot abscissa. Left and right, respectively, of the current
subject’s slot the ordinate jumps from the preceding P' to the current subject’s P' and from that P' to
the next higher P' in the dataset. A step curve results.
Tests and confidence limits
Much logistic software allows confidence limits for the individual S and hence for P
or P' to be computed. These limits may, of course, be added to the individual points. In principle, a
distinction should be made here between limits that take p' as known (random uncertainty in the
estimation of L, only) and the kind of limits that include estimation error in p'; in practice, the
difference will hardly by noticeable. Supplying confidence limits for a point on the curve as such
(say, around the posttest risk of 0.15 given abscissa 0.72 as in the figure, and vice versa) is a much
tougher question, possibly addressable by means of a bootstrapping procedure.
Returning to the individual, e.g., upper, confidence limits for estimated P' values, their
distribution may be inspected. In particular, when they are plotted the way the variable P' was
plotted, one gets what may be called a benefit-of-doubt predictiveness diagram. The interpretation
of a point (x', y) on this curve may sound as follows: In consideration of the statistical uncertainty
inherent in concluding from a dataset of the limited size available today, one will have to treat an
estimated fraction (1 – x') of new cases if one wants to treat all patients who have a disease risk that
cannot at present confidently be declared to be < y. This will offer a benefit of doubt to those
patients whose test results are atypical and whose P' estimation is therefore less certain than in other
cases. (For the fraction (1 – x') itself no confidence warranty is given: we are inspecting the raw
distribution of individual upper confidence limits and cannot provide any ‘second-order’ confidence
statements, except possibly via some bootstrap route.)
(1) Pepe MS, Feng Z, Huang Y, Longton G, Prentice R, Thompson IM et al. Integrating the
predictiveness of a marker with its performance as a classifier. Am J Epidemiol 2008;167:362-8.
Download