Notes on Signal Detection Theory and the Output of the

advertisement
Notes on Signal Detection Theory and the Output of the Program for the
calculation of the Relative Operating Characteristic
1. Introduction
Signal Detection Theory (SDT, Swets and Pickett, 1982) is a verification method which seeks to determine
the ability of a weather forecasting system to distinguish between situations preceding an event of interest
and those which do not. SDT is concerned with the evaluation of forecasts of two-state or dichotomous
variables, and the forecasts are expressed as probabilities of the event of interest.
Typically a verification dataset consists of paired forecast probabilities and observations, where each
forecast - observation pair consists of a forecast probability and a binary indicator of the occurrence or
non-occurrence of the event of interest. For entry into the SDT program, the probability values are
restricted to a discrete set {p( i ), i= 1,..k}. Verification data may be converted to a discrete set of k
probability values by “binning” the cases of the original sample, as is usually done for the preparation of
reliability tables. Normally, the bins are deciles of probability, 0-<10%, 10-<20%,….90-100%, or they may
be centered on each decile with half-size bins at the extremes, 0-<5%, 5-<15%…., 85-<95%, 95-100%.
Either end of the bin can be inclusive; it will make little difference to the results.
2. Calculation of the relative operating characteristic (ROC)
The data is entered into the program by means of a k by 3 array whose rows are the probability (usually
the central value of each probability bin), the number of non-occurrences in the bin, and the number of
occurrences in the bin.
Table 1. Sample input data for the calculation of the ROC using the
normal-normal model. The data are from the development of a
discriminant analysis-based 6h forecast of “cloudy” conditions, where
“cloudy” is defined as more than 4 oktas skycover.
PROB
# NO
# YES
0.05
32
6
0.15
8
7
0.25
8
2
0.35
9
7
0.45
9
4
0.55
5
15
0.65
10
10
0.75
3
12
0.85
8
16
0.95
14
146
From this data, we can plot two relative frequency distributions, one for the non-occurrences and one for
the occurrences,
Frequency distribution, nonoccurrences
35
30
32
g0(p)
25
20
14
15
8
10
8
9
10
9
5
8
3
5
0.95
0.85
0.75
0.65
0.55
0.45
0.35
0.25
0.15
0.05
0
# NO
Forecast probability, p
Frequency distribution, occurrences
160
140
146
g1(p)
120
100
80
60
40
20
7
6
2
7
4
15 10 12 16
0.95
0.85
0.75
0.65
0.55
0.45
0.35
0.25
0.15
0.05
0
# YES
Forecast probability, p
Figure 1. Conditional distributions of forecast probabilities of cloudy weather for not cloudy cases (above)
and cloudy situations (below)
Decision-making involves selecting a threshold probability, say p*, such that one decision (e.g. to protect
against adverse weather) is taken whenever p > p*, and a different decision (e.g. to do nothing) is made
when p < p*. We are interested in knowing how the likelihood of correct and incorrect decisions varies as
p* varies in a given set of forecasts. To estimate this, one can obtain cumulative frequencies from the
distributions g0(p) and g1(p),
h( p*) 
pK
g
p  p*
f ( p*) 
pK
g
p  p*
0
1
( p)  Pr{ p  p*| event}
( p)  Pr{ p  p*| nonevent}
where pK is the frequency of the Kth frequency bin. Then, these inverse cumulative frequencies can be
plotted as functions of p*,
1
0.9
0.8
Frequency
0.7
0.6
f(p*)
0.5
h(p*)
0.4
0.3
0.2
0.1
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
Threshold probability, p*
Figure 2. Plot of the empirical hit rate h(p*) and the empirical false alarm rate f(p*) for the data of Fig.1.
h(p*) is the empirical hit rate, equal to the Probability of Detection (POD), and f(p*) is the empirical false
alarm rate, which is NOT the same as the False Alarm Ratio (FAR). The false alarm rate is the frequency
with which the event was forecast when it didn’t occur. The False Alarm Ratio is the relative frequency of
forecasts which correspond to non-events. The false alarm rate is determined by computing the percent
of observations of the non-event which correspond to forecasts of the event, while the false alarm ratio is
determined by calculating the percent of forecasts of the event which correspond to observations of the
non-event.
To visualize the calculation of the hit rate h(p*) and the false alarm rate f(p*) from Table 1, it is useful to
consider a 2x2 contingency table,
Table 2. Forecast-observed contingency table for a dichotomous variable.
FORECAST
OBSERVED
YES
NO
YES
X
Y
X+Y
NO
Z
W
Z+W
X+Z
Y+W
TOTAL
The entries in the table are the number of cases falling into each of the categories of the joint distribution.
The hit rate is given by X/(X+Y), the percent of correct forecasts of the event given that it was observed.
The false alarm rate is Z/(Z+W), the percent of forecasts of the event given the event did not occur.
Consider, for example, p*=0.3, indicated by the heavy line on Table 1. Then X is the sum of the “YES”
column below the line, Y is the sum of the “YES” column above the line, Z is the sum of the “NO” column
below the line and W is the sum of the “NO” column above the line. The sums X+Y and Z+W are the
sums of the “YES” and “NO” columns respectively. Computation of the empirical ROC involves moving
the threshold p* down through the table and calculating h and f from the contingency table generated at
each step. In this way, the ROC assesses the probability forecasts as they might be used in decisionmaking.
Figure 2 shows the empirical hit rate and false alarm rates for the cloudiness example represented by the
data above and Fig. 1. Each plotted point represents the frequency of hits or false alarms that would have
occurred if the decision criterion p* was set to the corresponding value on the abscissa. For example,
p*=0.3 incurs a hit rate of about 93% and a false alarm rate of about 55%. The more skillful the forecasts,
the greater the separation between the two curves g0(p*) and g1(p*). Really excellent forecasts would
have g1 concentrated near the high end of the probability range and g0 concentrated near the low end of
the probability range. The relative operating characteristic curve (ROC) is essentially a graphical
representation of the difference between the two distributions. The ROC is a graph of h(p*) against f(p*)
as p* varies.
ROC
1.0
.5
0.9
.4
0.8
0.7
.7
.9
.1
.3 .2
.6
.8
Hit Rate
0.6
Hit 0.5
Rate
0.4
0.3
0.2
0.1
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
False alarm Rate
Figure 3. Empirical ROC for the data of Table 1 .
Fig. 3 is the ROC for the data in Table 1. The major diagonal is the line for which h(p*) = f(p*) for all
thresholds p*. In that case g0 and g1 would be identical and the forecast set shows no ability to distinguish
between occurrence and non-occurrence of the event. Perfect performance would be represented by the
ROC rising along the ordinate axis from (0,0) to (1,0), then along the top of the diagram to (1,1). ROCs
below the major diagonal represent potentially useful forecasts that would need recalibration to move
them up into the upper half of the diagram.
A useful index of performance suggested by the fact that the best ROCs lie nearer to the upper left hand
corner of the plot is the total area under the empirical curve, which may be denoted P(A) or PA. This
varies from 0.0 (totally perverse performance) through 0.5 (useless) to 1.0 for perfect performance. PA is
sensitive to the overall location of the ROC; it is therefore an overall indicator of performance, over all
thresholds. Reduction of the performance indicator to a single value inevitably involves loss of information
about the performance. In fact ROCs for competing forecast systems may cross sometimes, indicating
one system is better for low thresholds and the other for high. It is advisable to plot the whole ROC
whenever possible.
The ROC can be linearized by plotting the standard normal deviates corresponding to h and f, as shown in
Fig. 4 for the data of Table 1. In fact, it has been shown that, when transformed this way, empirical ROCs
nearly invariably lie very close to a straight line for a wide variety of decision-making processes, not only
from weather forecasting, but also experimental psychology, medical imaging, aptitude testing and
information retrieval (Swets, 1986).
Z - Plot
1
slope s=s0/s1
m
-2
-1
ZA
.6
-1
.4
.2
1
.8
.7
-2
.9
Z[f(p*)]
.5
.3
.1
2
Z[h(p*)]
Z[h(p*)]
Figure 4. The ROC plotted in terms of the standard normal deviates corresponding to h and f. The
straight line is the least-squares fit to the data points of the empirical ROC.
Transferring the fitted line back to linear probability coordinates leads to the curve shown in Fig. 5. The
area under this curve is denoted as Az, which as an index of performance should be less subject to the
effects of sampling scatter than the empirical curve and PA.
3. The Signal Detection model
ROCs linear on double probability axes can be generated by moving a threshold through a pair of
gaussian distributions. The linearity of empirical ROCs indicates that the forecasters are behaving as if
they are selecting their forecast probability on the basis of an underlying decision variable, presumably
representing their judgment based on the information available to them, which has gaussian distributions
prior to the occurrence and non-occurrence to the forecast event in question. In fact, the linearity of
empirical ROCs implies only that the hypothetical distributions can be transformed to gaussian by means
of a monotonic transformation (Swets, 1986). This arises because the performance of the decision
variable is related to the differences between the two underlying distributions rather than to the
distributions themselves. The assumption does mean, however, that the model would probably not work
well for multi-modal distributions. We now denote the continuous form of the decision variable as x, which
has a distribution f0(x) prior to non-occurrences and f1(x) prior to occurrences of the event. F0(x) is
N(m0,s0) and f1(x) is N(m1,s1). A specific threshold x* is related to p* through Bayes’ formula,
p*  w0 l ( x*) / (10
.  w0 l ( x*))
w
0

p
0
/ (1.0 
p ),
0
p0 being the climatological probability of the event,
and,
l ( x*) 
f
1
( x*) /
f
0
( x*)
In terms of x, hit rate is defined by,
h( x*)  Pr{ X  x*| event}  

x*
f
1
( x)dx
f ( x*)  Pr{ X  x*| noevent}  

x*
f
0
( x)dx
so that h(x*) is the area under f1(x) to the right of x* and similarly, the false alarm rate is the area under
f0(x) to the right of x*.
Fitted ROC in Probability coordinates
1
0.9
0.8
0.7
Hit Rate
0.6
No Skill Line
Fitted ROC
0.5
Empirical ROC
0.4
0.3
0.2
0.1
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
False Alarm Rate
Figure 5. The fitted ROC in linear probability space for the data of Table 1. Points on the graph represent
the points of the empirical ROC as in Fig. 3.
In signal detection theory, f0(x) can be seen as the distribution of the output of a filter when noise alone is
present in the input, and f1(x) is this distribution when a signal is present in addition to the noise. The
threshold x* is set so as to minimize the probability of either type of decision error, misses or false alarms.
The SDT model also has similarities to the Neyman-Pearson approach to statistical hypothesis testing,
with x as a test statistic (e.g. student’s t, chi-squared etc.), f0(x) is its distribution under the null hypothesis
and f1(x) under the alternative hypothesis. The value of x* is usually selected to keep the probability of a
type I error (“false alarm”) at 5% or 1%. Hit rate is equivalent to the power of the test, one minus the
probability of a type II error.
The parameters of the SDT model can be found by fitting a straight line to the standard normal deviates of
h and f. The mathematical basis of this is found in Green and Swets (1966). The parameters required
are the separation of the means of f1 and f0, the ratio of their standard deviations and the values of x*
corresponding to the threshold probabilities used for the actual forecasts. It is convenient to scale the x
axis so that m0=0.0 and s0=1.0. When this is done, m1=m is equal to the intercept of the fitted line on the
Z[f(p*)] axis, and s1=1.0/s, where s is the slope of the fitted line. These parameters are shown on Fig. 4.
The units are the standard deviation of f0(x). If the line was fitted by least squares then the x* values can
be estimated by the Z[f(p)] values at the points of intersection of the perpendicular from the data points to
the fitted line (Fig. 4).
Common indices of skill based on the ROC are:
1. z(A). This is the perpendicular distance from the origin of the Z[f(p)], Z[h(p)] axes to the ROC, in units
of the standard deviation of f0. It is calculated by,
z( A)  m s / (1  s )
2
2. Az. This is the area under the fitted ROC on axes linear in probability. It is the same as PA except
that Az uses the fitted curve, which should be less sensitive to sampling scatter than the empirical
curve. It is found as the area under a standard normal probability distribution up to the normal deviate
value equal to z(A).
3. DA is z(A) multiplied by the square root of 2. It is equal to m when s=1.0.
4. References
Green, D. M. and J. A. Swets, 1966: Signal detection theory and psychophysics. New York: Wiley
(Reprinted, 1974, Huntingdon, NY: Krieger)
Swets, J. A., 1986: Form of empirical ROCs in discrimination and diagnostic tasks: Implications for theory
and measurement of performance. Psychological Bulletin, 99, 181-198.
Swets, J. A. and R. M. Pickett, 1982: Evaluation of diagnostic systems: Methods from signal detection
theory. New York, Academic Press, 253 pp.
Download