Appendix A. Skill of Threshold Forecasts The determination of the

advertisement
Appendix A. Skill of Threshold Forecasts
The determination of the skill of a threshold forecast begins with the creation of a
contingency table of the form:
Contingency Table for Threshold Forecasts
Observed
Yes
No
a
b
Forecast
Yes
c
d
No
For example, if Code Orange O3 concentrations are both observed and forecast (“hit”),
then one unit is added to “a”. A “false alarm”, Code Orange forecast but not observed, is added
to “c”.
A number of statistical measures of forecast skill can be obtained using the contingency
table above. For air quality forecasting, the Code Orange threshold is the threshold of most
interest because forecast above this threshold initiate Air Quality Action Day alerts.
The full set of skill statistics for 2009 and other recent seasons is given below with
explanations of each measure following.
Philadelphia Forecast Area
Skill Score Measures for O3 Code Orange Threshold
(76 ppbv for 8-hour Average)
Bias
False Alarm
Hit
Miss
Correct Null
Accuracy
Heidke
CSI (Threat)
PSS
2009
1.29
0.56
0.57
0.43
0.96
0.94
0.47
0.33
0.53
2007-2008
1.04
0.28
0.76
0.24
0.91
0.87
0.65
0.59
0.66
2003-2006
1.05
0.27
0.76
0.24
0.91
0.87
0.66
0.59
0.67
Table A1. Skill score measures of the public forecast for recent forecast seasons .
Philadelphia Forecast Area
Skill Score Measures for O3 Code Orange Threshold
(76 ppbv for 8-hour Average)
NOAA-EPA
2.71
0.74
0.71
0.29
0.88
0.88
0.33
0.24
0.60
Bias
False Alarm
Hit
Miss
Correct Null
Accuracy
Heidke
CSI (Threat)
PSS
Statistical (new)
3.29
0.70
1.00
0.00
0.87
0.88
0.42
0.30
0.87
Statistical (old)
4.14
0.79
0.86
0.14
0.81
0.81
0.27
0.20
0.67
Table A2. Skill score measures for forecast guidance techniques in 2009. “NOAA-EPA” refers
to the numerical forecast model made available by NOAA and EPA
(http://www.weather.gov/aq), the new statistical model was trained on post-NOx SIP Rule data
and the old statistical model on 1993-2001 data.
Explanation of Basic Skill Score Measures:
Bias (B) =
ac
ab
Range: [-∞ , +∞]
Bias determines whether the same fraction of events are both forecast and observed. If B
= 1, then the forecast is unbiased. If B < 1 there is a tendency to under-predict and if B > 1
there is a tendency to over-predict.
False Alarm Rate (F) =
b
ab
Range: [0 ,1]
This is a measure of the rate at which false alarms (high O3 forecast but not observed)
occur
Hit Rate (H) =
a
ac
Range: [0 ,1]
The hit rate is often called the “probability of detection” and measures how well observed
high O3 cases are forecast.
Miss Rate = 1 – H
Range: [0 ,1]
The miss rate measures the fraction of observed high O3 cases that are not forecast.
Correct Null (CNull) =
d
cd
Range: [0 ,1]
CNull measures the fraction of observed good or moderate air quality days that are
forecast as good or moderate. Because Code Orange days are relatively infrequent, this measure
is typically close to unity.
Accuracy (A) =
ad
abcd
Range: [0 ,1]
Accuracy measures the fraction of cases correctly forecast above and below the threshold
to all cases. As with CNull, the infrequency of Code Orange cases leads to very high values for
Accurary.
Other Measures:
Generalized skill scores (SSref) measure the improvement of forecasts over some given
reference measure. Typically the reference is persistence (current conditions used as forecast for
tomorrow) or climatology (historical average conditions).
MSE
Skill Score (SSref) = (1 ) * 100%
MSEref
The skill score is typically reported as a percent improvement over the reference forecast.
Additional measures of skill can be determined. The Heidke skill score (HSS) compares
the proportion of correct forecasts to a no skill random forecast. That is, each event is forecast
randomly but is constrained in that the marginal totals (a + c) and (a + b) are equal to those in the
original verification table.
HSS =
2ad  bc 
a  c c  d   a  bb  d 
Range: [-1,1]
For this measure, the range is [-1,1] with a random forecast equal to zero. The Code Red
forecast show skill by this measure.
Another alternative is the critical success index (CSI) or the Gilbert Skill Score (GSS)
also called the “threat” score.
CSI =
a
H
=
a  b  c 1 B  H
Range: [0 ,1]
Since the correct null forecast is excluded, this type of measure is effective for situations
like tornado forecasting where the occurrence is difficult to determine due to observing bias, i.e.,
tornados may occur but not be observed. This can also be the case for air quality forecasting
when the monitor network is less dense. Note, however, that the random forecast will have a
non-zero skill.
The Peirce skill score (PSS), also known as the “true skill statistic” is a measure of
skill obtained by the difference between the hit rate and the false alarm rate:
ad  bc
=H–F
Range: [-1 ,1]
a  c b  d 
The range of this measure is [-1,1]. If the PSS is greater than zero, then the
number of hits exceeds the false alarms and the forecast has some skill. Note, however, that if d
is large, as it is in this case, the false alarm value (b) is relatively overwhelmed. The advantage
of the PSS is that determining the standard error is relatively easy.
PSS =
References
Stephenson, D. B., Use of the “odds ratio” for diagnosing forecast skill, Wea. Forecasting, 15,
221-232, 2000.
Wilks, D. S., Statistical Methods in the Atmospheric Sciences, Academic Press, 467pp., 1995.
Download