Transcript

advertisement
Slide 1
Sensitivity and Specificity Part II - Computations and Examples
Slide 2
This video is designed to accompany pages 81-94 of the workbook “Making Sense of Uncertainty:
Activities for Teaching Statistical Reasoning,” a publication of the Van-Griner Publishing Company
Slide 3
In this video we will practice the computation of false positive and false negative rates using real data.
Let’s start with the ImPACT test. ImPACT stands for “immediate Post-Concussion Assessment and
Cognitive Testing” and is designed to be a screening test for concussion that is somewhere between an
on-field diagnosis and an MRI. It takes about twenty-five minutes to complete and it measures
attention span, working memory, sustained and selective attention time, response variability, nonverbal problem solving, and reaction time. Subjects are given a total score from each of these
categories.
Just how well does ImPACT work as a screening test?
Slide 4
To answer this question we need two things. We need data on how the test categorized athletes. That
is, we need a study population so we can know how many athletes ImPACT said were concussed and
how many ImPACT said were not concussed. Secondly, we need some gold standard results (such as
from an MRI) so that we can know whether the athlete really was concussed or not.
Researchers publishing in the Archives of Clinical Neuropsychology had just that information for 138
athletes. Notice that of those 138 athletes, 66 really were not concussed and 72 were. In a somewhat
confusing coincidence notice that ImPACT said that 72 were not concussed and 66 were.
Of the 66 that were truly not concussed (true negatives), ImPACT said 7 were. So ImPACT had 7
instances of a false positive, and, hence, the estimated false positive rate is 7 out of 66 or 0.11. Or 11%
if you prefer percentages.
Similarly, of the 72 who were truly concussed according to an MRI (true positives), ImPACT said 13 were
not. So the false negative rate is 13 out of 72 or 0.18 or 18%.
It follows that the estimated specificity for ImPACT is 100% minus 11% or 89%. And the estimated
sensitivity is 100% minus 18% or 82%.
Slide 5
Let’s look at another example.
The Beck Depression Inventory (BDI) is still commonly used, especially with students, to screen for
depression. The Inventory consists of 21 questions about how the subject has been feeling in the last
week. You see a typical question here where you have to assess how sad you feel, ranging from “I do
not feel sad” (which gets scored a 0) to “I am so sad or unhappy that I can’t stand it” (which gets scores
a 3).
All questions are scored like this one so the subject’s total score can range from 0 to 3 times 21 or 63.
High scores are indications of depression and different cutoffs are used for different purposes.
Slide 6
Let’s look at the results for 95 patients who took the BDI. The gold standard, or actual status of each
patient (depressed or not depressed) was determined by a long clinical interview as determined by the
Diagnostic and Statistical Manual of Mental Disorders IV. And a cutoff of 10 was used in conjunction
with the BDI for determining if a screened subject was, in fact, likely to be clinically depressed.
Notice that of the 95 patients participating, 78 were not truly clinically depressed, but 17 were. The
Beck Inventory said that 71 of those 95 were not depressed and 24 were.
More importantly, of the 78 who really were not depressed, Beck said 12 were. So the BDI has an
estimated false positive rate of 12 over 78 or 14%.
Similarly, of the 17 who really were depressed, Beck said 5 were not. So the BDI has an estimated false
negative rate of 5 out of 17 or 29%.
It follows that the sensitivity is estimated at 71% and the specificity at 85%. Both are good, but not
spectacular.
Slide 7
As you can see, the computation of false positive and false negative rates is really very straightforward,
especially when the data are already arrayed in a table for us. Let’s look at one more example where
we have to create the table.
There are three common testing procedures that are implemented during a field sobriety test at a
sobriety checkpoint: a test for visual nystagmus (HGN), and two for balance and agility, the one-legged
stand (OLS) and the walk and turn (WAT).
A well-known national database archives data from 296 subjects who participated in the National
Highway and Transportation Safety Agency’s 1998 San Diego field sobriety test reliability study. We can
use those data to assess how well the FST performs. What do we need at our disposal? We need to
know how the FST categorized subjects (drunk or not drunk). This will depend on what cutoff or
criterion is used. And we also need to know which subjects really were drunk and which weren’t. This
information was provided by the results of a blood test and is available in the data set as well.
Slide 8
Let’s start by assuming that getting a 4 or more on the FST will tag a participant as drunk.
For these data we also have to decide what will be taken to mean a participant is really drunk. Let us
assume that a blood alcohol content of 0.04% or greater means a participant was legally drunk. Of the
296 subjects from the San Diego study, 267 had blood alcohol content of 0.04% or greater, so the data
set has 267 who were, in truth, drunk, and 29 who were not.
Slide 9
Have a look at the table shown. Since we know 29 participants in the study really were not legally
drunk, and 267 were, we can fill out the three cells shown here.
The task ahead of us is to fill out the final cells.
Slide 10
A portion of the actual data set is shown on the right of your screen. The data are ordered by Total Field
Sobriety Test score and all those subjects who had a score less than or equal to 4 are shown.
Recall, that a score of 3 or less would be categorized as “sober” by the rules we adopted for this
illustration. How many of those are in the entire data set? Just count the number of rows in the portion
of the data set shown that correspond to a 3 or below on the Total FST.
There are 20 of those and the “20” goes in the cell shown. 20 subjects, total, were judged to be sober
from the FST.
Now, of those 20 that the FST judged to be sober, how many really were sober? That is, of those 20 how
many had an Actual BAC (Blood Alcohol Content) less than 0.04? Once again, count.
There were 9 of those 20 who were really sober. Nine of the subjects that the FST said were sober,
really were sober. This number goes in the cell shown.
With the 9 and the 20 in the first row, this next cell is easily computed by subtraction. 20 – 9 is 11 so of
the 20 subjects the FST said were sober, 11 were not.
In fact, the remaining three cells can all be computed by subtraction as well.
And the table is complete. Now we can compute the estimated false negative and false positive rates
for these important data.
Of the 29 subjects who really were sober, the FST said 20 were drunk. So the false positive rate is 20 out
of 29, or 69%. Hence, the specificity is 100% - 69% or 31%
Likewise, of the 267 subjects who really were drunk, 11 were classified as sober by the FST. Hence, the
false negative rate is 11 out of 267 or 4%. It follows that the sensitivity is 96%
Typical of a field sobriety test, the sensitivity is very good and the specificity is not so good. The test is
excellent at catching drunk drivers, but the price paid is that it also falsely accuses a rather large
percentage of sober drivers.
Slide 11
So what if we change the rule that triggers the FST to categorize a participant as drunk? For example,
what if instead of a 4 or above being taken to mean “intoxicated” one took 2 or above?
This will make it a lot easier to catch all the drunks. It will also catch even more people who are truly
sober. So the FPR will surely go up and the FNR will go down.
So, no surprise, how well a screening test performs with respect to sensitivity and specificity will be
directly related to the cutoff that is used to identify a “positive” outcome.
Slide 12
This concludes our video on the computations associated with sensitivity and specificity. Remember,
simple fractions are used to compute sensitivity and specificity when both test results and the truth are
arrayed in a simple 2x2 table.
Download