Notes on Signal Detection Theory and the Output of the Program for the calculation of the Relative Operating Characteristic 1. Introduction Signal Detection Theory (SDT, Swets and Pickett, 1982) is a verification method which seeks to determine the ability of a weather forecasting system to distinguish between situations preceding an event of interest and those which do not. SDT is concerned with the evaluation of forecasts of two-state or dichotomous variables, and the forecasts are expressed as probabilities of the event of interest. Typically a verification dataset consists of paired forecast probabilities and observations, where each forecast - observation pair consists of a forecast probability and a binary indicator of the occurrence or non-occurrence of the event of interest. For entry into the SDT program, the probability values are restricted to a discrete set {p( i ), i= 1,..k}. Verification data may be converted to a discrete set of k probability values by “binning” the cases of the original sample, as is usually done for the preparation of reliability tables. Normally, the bins are deciles of probability, 0-<10%, 10-<20%,….90-100%, or they may be centered on each decile with half-size bins at the extremes, 0-<5%, 5-<15%…., 85-<95%, 95-100%. Either end of the bin can be inclusive; it will make little difference to the results. 2. Calculation of the relative operating characteristic (ROC) The data is entered into the program by means of a k by 3 array whose rows are the probability (usually the central value of each probability bin), the number of non-occurrences in the bin, and the number of occurrences in the bin. Table 1. Sample input data for the calculation of the ROC using the normal-normal model. The data are from the development of a discriminant analysis-based 6h forecast of “cloudy” conditions, where “cloudy” is defined as more than 4 oktas skycover. PROB # NO # YES 0.05 32 6 0.15 8 7 0.25 8 2 0.35 9 7 0.45 9 4 0.55 5 15 0.65 10 10 0.75 3 12 0.85 8 16 0.95 14 146 From this data, we can plot two relative frequency distributions, one for the non-occurrences and one for the occurrences, Frequency distribution, nonoccurrences 35 30 32 g0(p) 25 20 14 15 8 10 8 9 10 9 5 8 3 5 0.95 0.85 0.75 0.65 0.55 0.45 0.35 0.25 0.15 0.05 0 # NO Forecast probability, p Frequency distribution, occurrences 160 140 146 g1(p) 120 100 80 60 40 20 7 6 2 7 4 15 10 12 16 0.95 0.85 0.75 0.65 0.55 0.45 0.35 0.25 0.15 0.05 0 # YES Forecast probability, p Figure 1. Conditional distributions of forecast probabilities of cloudy weather for not cloudy cases (above) and cloudy situations (below) Decision-making involves selecting a threshold probability, say p*, such that one decision (e.g. to protect against adverse weather) is taken whenever p > p*, and a different decision (e.g. to do nothing) is made when p < p*. We are interested in knowing how the likelihood of correct and incorrect decisions varies as p* varies in a given set of forecasts. To estimate this, one can obtain cumulative frequencies from the distributions g0(p) and g1(p), h( p*) pK g p p* f ( p*) pK g p p* 0 1 ( p) Pr{ p p*| event} ( p) Pr{ p p*| nonevent} where pK is the frequency of the Kth frequency bin. Then, these inverse cumulative frequencies can be plotted as functions of p*, 1 0.9 0.8 Frequency 0.7 0.6 f(p*) 0.5 h(p*) 0.4 0.3 0.2 0.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 Threshold probability, p* Figure 2. Plot of the empirical hit rate h(p*) and the empirical false alarm rate f(p*) for the data of Fig.1. h(p*) is the empirical hit rate, equal to the Probability of Detection (POD), and f(p*) is the empirical false alarm rate, which is NOT the same as the False Alarm Ratio (FAR). The false alarm rate is the frequency with which the event was forecast when it didn’t occur. The False Alarm Ratio is the relative frequency of forecasts which correspond to non-events. The false alarm rate is determined by computing the percent of observations of the non-event which correspond to forecasts of the event, while the false alarm ratio is determined by calculating the percent of forecasts of the event which correspond to observations of the non-event. To visualize the calculation of the hit rate h(p*) and the false alarm rate f(p*) from Table 1, it is useful to consider a 2x2 contingency table, Table 2. Forecast-observed contingency table for a dichotomous variable. FORECAST OBSERVED YES NO YES X Y X+Y NO Z W Z+W X+Z Y+W TOTAL The entries in the table are the number of cases falling into each of the categories of the joint distribution. The hit rate is given by X/(X+Y), the percent of correct forecasts of the event given that it was observed. The false alarm rate is Z/(Z+W), the percent of forecasts of the event given the event did not occur. Consider, for example, p*=0.3, indicated by the heavy line on Table 1. Then X is the sum of the “YES” column below the line, Y is the sum of the “YES” column above the line, Z is the sum of the “NO” column below the line and W is the sum of the “NO” column above the line. The sums X+Y and Z+W are the sums of the “YES” and “NO” columns respectively. Computation of the empirical ROC involves moving the threshold p* down through the table and calculating h and f from the contingency table generated at each step. In this way, the ROC assesses the probability forecasts as they might be used in decisionmaking. Figure 2 shows the empirical hit rate and false alarm rates for the cloudiness example represented by the data above and Fig. 1. Each plotted point represents the frequency of hits or false alarms that would have occurred if the decision criterion p* was set to the corresponding value on the abscissa. For example, p*=0.3 incurs a hit rate of about 93% and a false alarm rate of about 55%. The more skillful the forecasts, the greater the separation between the two curves g0(p*) and g1(p*). Really excellent forecasts would have g1 concentrated near the high end of the probability range and g0 concentrated near the low end of the probability range. The relative operating characteristic curve (ROC) is essentially a graphical representation of the difference between the two distributions. The ROC is a graph of h(p*) against f(p*) as p* varies. ROC 1.0 .5 0.9 .4 0.8 0.7 .7 .9 .1 .3 .2 .6 .8 Hit Rate 0.6 Hit 0.5 Rate 0.4 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 False alarm Rate Figure 3. Empirical ROC for the data of Table 1 . Fig. 3 is the ROC for the data in Table 1. The major diagonal is the line for which h(p*) = f(p*) for all thresholds p*. In that case g0 and g1 would be identical and the forecast set shows no ability to distinguish between occurrence and non-occurrence of the event. Perfect performance would be represented by the ROC rising along the ordinate axis from (0,0) to (1,0), then along the top of the diagram to (1,1). ROCs below the major diagonal represent potentially useful forecasts that would need recalibration to move them up into the upper half of the diagram. A useful index of performance suggested by the fact that the best ROCs lie nearer to the upper left hand corner of the plot is the total area under the empirical curve, which may be denoted P(A) or PA. This varies from 0.0 (totally perverse performance) through 0.5 (useless) to 1.0 for perfect performance. PA is sensitive to the overall location of the ROC; it is therefore an overall indicator of performance, over all thresholds. Reduction of the performance indicator to a single value inevitably involves loss of information about the performance. In fact ROCs for competing forecast systems may cross sometimes, indicating one system is better for low thresholds and the other for high. It is advisable to plot the whole ROC whenever possible. The ROC can be linearized by plotting the standard normal deviates corresponding to h and f, as shown in Fig. 4 for the data of Table 1. In fact, it has been shown that, when transformed this way, empirical ROCs nearly invariably lie very close to a straight line for a wide variety of decision-making processes, not only from weather forecasting, but also experimental psychology, medical imaging, aptitude testing and information retrieval (Swets, 1986). Z - Plot 1 slope s=s0/s1 m -2 -1 ZA .6 -1 .4 .2 1 .8 .7 -2 .9 Z[f(p*)] .5 .3 .1 2 Z[h(p*)] Z[h(p*)] Figure 4. The ROC plotted in terms of the standard normal deviates corresponding to h and f. The straight line is the least-squares fit to the data points of the empirical ROC. Transferring the fitted line back to linear probability coordinates leads to the curve shown in Fig. 5. The area under this curve is denoted as Az, which as an index of performance should be less subject to the effects of sampling scatter than the empirical curve and PA. 3. The Signal Detection model ROCs linear on double probability axes can be generated by moving a threshold through a pair of gaussian distributions. The linearity of empirical ROCs indicates that the forecasters are behaving as if they are selecting their forecast probability on the basis of an underlying decision variable, presumably representing their judgment based on the information available to them, which has gaussian distributions prior to the occurrence and non-occurrence to the forecast event in question. In fact, the linearity of empirical ROCs implies only that the hypothetical distributions can be transformed to gaussian by means of a monotonic transformation (Swets, 1986). This arises because the performance of the decision variable is related to the differences between the two underlying distributions rather than to the distributions themselves. The assumption does mean, however, that the model would probably not work well for multi-modal distributions. We now denote the continuous form of the decision variable as x, which has a distribution f0(x) prior to non-occurrences and f1(x) prior to occurrences of the event. F0(x) is N(m0,s0) and f1(x) is N(m1,s1). A specific threshold x* is related to p* through Bayes’ formula, p* w0 l ( x*) / (10 . w0 l ( x*)) w 0 p 0 / (1.0 p ), 0 p0 being the climatological probability of the event, and, l ( x*) f 1 ( x*) / f 0 ( x*) In terms of x, hit rate is defined by, h( x*) Pr{ X x*| event} x* f 1 ( x)dx f ( x*) Pr{ X x*| noevent} x* f 0 ( x)dx so that h(x*) is the area under f1(x) to the right of x* and similarly, the false alarm rate is the area under f0(x) to the right of x*. Fitted ROC in Probability coordinates 1 0.9 0.8 0.7 Hit Rate 0.6 No Skill Line Fitted ROC 0.5 Empirical ROC 0.4 0.3 0.2 0.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 False Alarm Rate Figure 5. The fitted ROC in linear probability space for the data of Table 1. Points on the graph represent the points of the empirical ROC as in Fig. 3. In signal detection theory, f0(x) can be seen as the distribution of the output of a filter when noise alone is present in the input, and f1(x) is this distribution when a signal is present in addition to the noise. The threshold x* is set so as to minimize the probability of either type of decision error, misses or false alarms. The SDT model also has similarities to the Neyman-Pearson approach to statistical hypothesis testing, with x as a test statistic (e.g. student’s t, chi-squared etc.), f0(x) is its distribution under the null hypothesis and f1(x) under the alternative hypothesis. The value of x* is usually selected to keep the probability of a type I error (“false alarm”) at 5% or 1%. Hit rate is equivalent to the power of the test, one minus the probability of a type II error. The parameters of the SDT model can be found by fitting a straight line to the standard normal deviates of h and f. The mathematical basis of this is found in Green and Swets (1966). The parameters required are the separation of the means of f1 and f0, the ratio of their standard deviations and the values of x* corresponding to the threshold probabilities used for the actual forecasts. It is convenient to scale the x axis so that m0=0.0 and s0=1.0. When this is done, m1=m is equal to the intercept of the fitted line on the Z[f(p*)] axis, and s1=1.0/s, where s is the slope of the fitted line. These parameters are shown on Fig. 4. The units are the standard deviation of f0(x). If the line was fitted by least squares then the x* values can be estimated by the Z[f(p)] values at the points of intersection of the perpendicular from the data points to the fitted line (Fig. 4). Common indices of skill based on the ROC are: 1. z(A). This is the perpendicular distance from the origin of the Z[f(p)], Z[h(p)] axes to the ROC, in units of the standard deviation of f0. It is calculated by, z( A) m s / (1 s ) 2 2. Az. This is the area under the fitted ROC on axes linear in probability. It is the same as PA except that Az uses the fitted curve, which should be less sensitive to sampling scatter than the empirical curve. It is found as the area under a standard normal probability distribution up to the normal deviate value equal to z(A). 3. DA is z(A) multiplied by the square root of 2. It is equal to m when s=1.0. 4. References Green, D. M. and J. A. Swets, 1966: Signal detection theory and psychophysics. New York: Wiley (Reprinted, 1974, Huntingdon, NY: Krieger) Swets, J. A., 1986: Form of empirical ROCs in discrimination and diagnostic tasks: Implications for theory and measurement of performance. Psychological Bulletin, 99, 181-198. Swets, J. A. and R. M. Pickett, 1982: Evaluation of diagnostic systems: Methods from signal detection theory. New York, Academic Press, 253 pp.