Appendix A. Skill of Threshold Forecasts The determination of the skill of a threshold forecast begins with the creation of a contingency table of the form: Contingency Table for Threshold Forecasts Observed Yes No a b Forecast Yes c d No For example, if Code Orange O3 concentrations are both observed and forecast (“hit”), then one unit is added to “a”. A “false alarm”, Code Orange forecast but not observed, is added to “c”. A number of statistical measures of forecast skill can be obtained using the contingency table above. For air quality forecasting, the Code Orange threshold is the threshold of most interest because forecast above this threshold initiate Air Quality Action Day alerts. The full set of skill statistics for 2009 and other recent seasons is given below with explanations of each measure following. Philadelphia Forecast Area Skill Score Measures for O3 Code Orange Threshold (76 ppbv for 8-hour Average) Bias False Alarm Hit Miss Correct Null Accuracy Heidke CSI (Threat) PSS 2009 1.29 0.56 0.57 0.43 0.96 0.94 0.47 0.33 0.53 2007-2008 1.04 0.28 0.76 0.24 0.91 0.87 0.65 0.59 0.66 2003-2006 1.05 0.27 0.76 0.24 0.91 0.87 0.66 0.59 0.67 Table A1. Skill score measures of the public forecast for recent forecast seasons . Philadelphia Forecast Area Skill Score Measures for O3 Code Orange Threshold (76 ppbv for 8-hour Average) NOAA-EPA 2.71 0.74 0.71 0.29 0.88 0.88 0.33 0.24 0.60 Bias False Alarm Hit Miss Correct Null Accuracy Heidke CSI (Threat) PSS Statistical (new) 3.29 0.70 1.00 0.00 0.87 0.88 0.42 0.30 0.87 Statistical (old) 4.14 0.79 0.86 0.14 0.81 0.81 0.27 0.20 0.67 Table A2. Skill score measures for forecast guidance techniques in 2009. “NOAA-EPA” refers to the numerical forecast model made available by NOAA and EPA (http://www.weather.gov/aq), the new statistical model was trained on post-NOx SIP Rule data and the old statistical model on 1993-2001 data. Explanation of Basic Skill Score Measures: Bias (B) = ac ab Range: [-∞ , +∞] Bias determines whether the same fraction of events are both forecast and observed. If B = 1, then the forecast is unbiased. If B < 1 there is a tendency to under-predict and if B > 1 there is a tendency to over-predict. False Alarm Rate (F) = b ab Range: [0 ,1] This is a measure of the rate at which false alarms (high O3 forecast but not observed) occur Hit Rate (H) = a ac Range: [0 ,1] The hit rate is often called the “probability of detection” and measures how well observed high O3 cases are forecast. Miss Rate = 1 – H Range: [0 ,1] The miss rate measures the fraction of observed high O3 cases that are not forecast. Correct Null (CNull) = d cd Range: [0 ,1] CNull measures the fraction of observed good or moderate air quality days that are forecast as good or moderate. Because Code Orange days are relatively infrequent, this measure is typically close to unity. Accuracy (A) = ad abcd Range: [0 ,1] Accuracy measures the fraction of cases correctly forecast above and below the threshold to all cases. As with CNull, the infrequency of Code Orange cases leads to very high values for Accurary. Other Measures: Generalized skill scores (SSref) measure the improvement of forecasts over some given reference measure. Typically the reference is persistence (current conditions used as forecast for tomorrow) or climatology (historical average conditions). MSE Skill Score (SSref) = (1 ) * 100% MSEref The skill score is typically reported as a percent improvement over the reference forecast. Additional measures of skill can be determined. The Heidke skill score (HSS) compares the proportion of correct forecasts to a no skill random forecast. That is, each event is forecast randomly but is constrained in that the marginal totals (a + c) and (a + b) are equal to those in the original verification table. HSS = 2ad bc a c c d a bb d Range: [-1,1] For this measure, the range is [-1,1] with a random forecast equal to zero. The Code Red forecast show skill by this measure. Another alternative is the critical success index (CSI) or the Gilbert Skill Score (GSS) also called the “threat” score. CSI = a H = a b c 1 B H Range: [0 ,1] Since the correct null forecast is excluded, this type of measure is effective for situations like tornado forecasting where the occurrence is difficult to determine due to observing bias, i.e., tornados may occur but not be observed. This can also be the case for air quality forecasting when the monitor network is less dense. Note, however, that the random forecast will have a non-zero skill. The Peirce skill score (PSS), also known as the “true skill statistic” is a measure of skill obtained by the difference between the hit rate and the false alarm rate: ad bc =H–F Range: [-1 ,1] a c b d The range of this measure is [-1,1]. If the PSS is greater than zero, then the number of hits exceeds the false alarms and the forecast has some skill. Note, however, that if d is large, as it is in this case, the false alarm value (b) is relatively overwhelmed. The advantage of the PSS is that determining the standard error is relatively easy. PSS = References Stephenson, D. B., Use of the “odds ratio” for diagnosing forecast skill, Wea. Forecasting, 15, 221-232, 2000. Wilks, D. S., Statistical Methods in the Atmospheric Sciences, Academic Press, 467pp., 1995.