William F. Ryan
Tiffany Wisniewski
Amy Huff
Department of Meteorology
The Pennsylvania State University
University Park, PA 16802
October 9, 2007
1. Daily public ozone (O
3
) forecasts for the Philadelphia non-attainment area, including southern New Jersey, northern Delaware and extreme northeastern Maryland, were issued for the period May 1, 2007 to September 8, 2007. Five Code Red and 16
Code Orange O
3 cases occurred during 2007. This is slightly higher than the average for the 2003-2006 period (18 days ≥ Code Orange) but well below the average for the earlier
1994-2002 period (36 days ≥ Code Orange).
2. The summer of 2007 was slightly warmer and drier than average in the
Philadelphia metropolitan area. The number of hot (≥ 90°F) days, generally conducive to high O
3
, was near average compared to the preceding decade. Locations south and southwest of Philadelphia, however, were significantly hotter and drier than normal.
3. The correct AQI forecast color code for O
3
was issued in 68% of all cases. Of the missed forecasts, the majority (60%) were errors between Code Green and Code
Yellow. This performance was less than the 2003-2006 average correct code percentages of 74%.
4. Forecasts were accurate for the high O
3
cases (Code Orange or Code Red).
Code Orange or Code Red was observed on 21 days with 16 of these cases (76%) carrying forecasts of Code Orange and Air Quality Action Day warnings. All of the 10 most severe O
3
days carried Air Quality Action Day warnings.
5. As in previous forecast seasons, Code Orange or higher forecasts (Ozone
Action Days) were also accurate. In 2006, 23 Ozone Action Days were forecast with 16
(70%) observing Code Orange or higher.
6. The median absolute error (MAE) of the numeric O
3
forecasts was 7.0 parts per billion by volume (ppbv) or within 10.6% of observed O
3
. This was slightly worse than the average MAE for the previous four seasons of 6.2 ppbv. The forecasts showed a large increase in skill over standard benchmark measures – a 39% skill improvement over persistence (the previous day’s observations used as the current day forecast).
7. As part of a cooperative agreement, the National Weather Service and the U.S.
EPA have developed a coupled chemistry-weather model to predict O
3
. This model has been continuously updated and, in 2007, performed well for the Philadelphia area.
Research undertaken during 2007 suggests that a combination of numerical model forecasts and stand-alone statistical models can provide increased skill in high O
3 cases.
Daily ozone forecasts for the Philadelphia non-attainment area, including southern
New Jersey, northern Delaware and extreme northeastern Maryland, were provided for the period May 1, 2006 to September 8, 2007. A time series of daily peak O
3
and forecast O
3
for this period ( Figure 1 ) shows the usual large oscillations in daily O
3
along with intermittent multi-day periods of persistent high and low concentrations. These multi-day events are driven by synoptic scale (2-5 day) weather patterns. Five Code Red
and 16 Code Orange cases occurred during 2007 ( Figure 2 ).
The summer of 2007 was slightly warmer and drier than average. At Philadelphia
International Airport (PHL), the summer season (June-August) average temperature was
0.6°F above normal and precipitation for the same period was 1.1” below normal.
Differences from normal in temperature and precipitation across the region are shown in
Figure 3 and Figure 4 . Extremely dry and hot weather prevailed over most of the region
to our south and southwest.
As noted in earlier reports, hot weather is necessary but not sufficient for high O
3
.
In 2007, hot (≥ 90º F) days were frequent (23 days) but not significantly different from the longer term average frequency (1997-2008) of 25 days. Overall, for the period 1997-
2006, 75% of all days with maximum temperature ≥ 90º F also reach the Code Orange (≥
85 ppbv for an 8-hour average) threshold although a much lower percentage (26%) reach
Code Red. This time period straddles the onset of regional NO x
emissions controls as part of the so-called “NO x
SIP Rule”. These controls, primarily on large stationary power
generation units, were phased in over the period 2003-2005. As seen in Table 1 , there is
evidence that these regional emissions reductions may influence the O
3
-temperature relationship, particularly in the high temperature extreme, with fewer hot cases also observing high O
3
. Theoretically, regional NO x
controls serve to reduce the
“background” or regional scale, O
3
concentrations on which the urban emissions build.
For the period 1997-2002, 83% of hot days also observed high O
3
while for the 2003-
2007 period, the frequency fell to only 58%. A longer time series is necessary, however, to reach a firm conclusion on this matter.
3
Observed O
3
Daily mean peak O
3
for the summer of 2007 was 66.0 ppv. This is above average for the most recent period (2003-2006) of 63.6 ppbv but is below the longer term average
(1994-2002) of 69.9 ppbv ( Figure 5 ). There were 21 days above the Code Orange
standard. As is typically the case, nearly three-quarters (71%) of the high O
3
cases were
organized into multi-day episodes: May 25-27, May 30-June 1, June 26-27, August 2-4,
August 7-8 and August 15-17. Analyses of these multi-day episodes are ongoing and preliminary data and discussions can be found at: http://www.meteo.psu.edu/~wfryan/reports/2007-episodes.htm
.
The frequency of occurrence of the various color codes is shown in Figure 6 .
Conditions in 2007 were similar in most respects to the recent past (2003-2006) as can be
seen by comparison to Figure 7 . The most frequent color code was green (good air
quality). While the total frequency of all cases above the Action Day threshold (Code
Orange) was consistent with past years (~15%), there were a larger fraction of Code Red cases in 2007 – 4% compared to 2%. Compared to earlier years (1994-2002), however, the number of Code Orange or higher cases was lower (23% in 1994-2002) and the
number of Code Green cases higher (46% in 1994-2002) ( Figure 8 ).
Color Code Forecasts
During the 2007 O
3
season forecasts were issued daily from May to September.
The forecasts were issued as a color code to the public with numerical (ppbv) forecasts prepared internally as a tool for quantitative determination of forecast skill. The correct
color code was issued in 68% of all cases ( Table 2 ). Of the missed code forecasts, the
majority (60%) were errors between Code Green and Code Yellow. This performance was less than the 2003-2006 correct code percentages of 74%.
For the high O
3
cases (Code Orange or Code Red), the forecasts were accurate.
Code Orange or Code Red was observed on 21 cases with 16 of these cases (76%) carrying forecasts of Code Orange or higher. This represents a slight improvement over the 2003-2006 average of 71%. As has been the case in previous forecast seasons, Code
Orange or higher forecasts (Ozone Action Days) were accurate. In 2006, 23 Ozone
Action Days were forecast with 16 days (70%) observing Code Orange or higher.
However, this was less than the recent (2003-2006) average of 78%. Thus, while the number of bad O
3
days were more likely to be identified, it was at the cost of a higher than normal number of “false alarms”. This trade-off of more hits for more false alarms is typical for categorical forecasts but further off-season analyses will be undertaken to determine if the false alarm forecasts followed any particular pattern.
In addition to an increased frequency of false alarms of Code Orange conditions, another important shortcoming of the 2007 color code forecasts was the inability to forecast Code Red conditions. All of the Code Red observed cases were covered with
Air Quality Action Day warnings but all at the Code Orange range. Again, further analysis will be undertaken to determine the causes of under-prediction in these cases.
ppbv Forecasts
For the 2007 season, the numerical forecasts had a slight bias toward over-
prediction (+2.9 ppbv) ( Table 3 ). The median absolute error (MAE) for the consensus
(public) forecasts was 7.0 ppbv. This was consistent with the previous four-season average. Forecasts prior to 2003 were issued for peak 1-hour O
3
and so are not directly comparable. The root mean square (rms) error, a measure of forecast consistency, was
11.2 ppbv which is higher than the 2003-2006 average of 10.5 ppbv. The forecasts showed a large increase in skill over standard benchmark measures – a 39% skill improvement over persistence (the previous day’s observations used as the current day
forecast). Forecast results for 2007 were roughly consistent with preceding years ( Table
Statistical Model Performance .
The forecasts issued to the public are based on statistical forecast models that have been in use for several years. The statistical models are strongly influenced by forecasted temperature and cannot completely account for other meteorological effects
(e.g., thunderstorms) that greatly affect O
3
concentrations.
Currently two statistical models are in use in Philadelphia (R0302 and LN2001).
Both models use multiple linear regression methods with the key predictors typically including temperature, relative humidity, wind at the surface and aloft, along with previous day peak O
3
and climatology. The R0302 model performs best overall and, in
2007, forecast the correct color code in 64% of all cases. As has been the case historically, this is slightly less than the public forecast skill. As with the public forecasts, statistical model skill in forecasting color codes was less in 2007 than in earlier years
(70% correct for 2003-2006).
Quantitative statistics for the consensus (public) forecast and statistical guidance
are given in Table 3 . The statistical models have a higher bias toward over-prediction
than the public forecasts. The mean absolute error and rms error were also higher for the statistical guidance suggesting inconsistent skill. The bias in the statistical models has
been growing over the past several years ( Table 5 ) and this may also reflect changes in
the O
3
-temperature relationship discussed above.
Forecast Performance in High O
3
Cases
The most important measure of forecast skill is performance in the higher end of the O
3
distribution. These cases, involving observations and forecasts of O
3
in the Code
Orange (Air Quality Action Day) range, have the greatest implications for pollution control strategies and public health.
In Table 6 , a summary of forecast performance in
observed cases of Code Orange or higher O
3
is given. The statistical guidance showed little bias while the NOAA-EPA forecast model (discussed in more detail below) and the public forecast showed a low bias. For all other quantitative error measures (MAE, mean absolute error and rms error), the statistical model guidance performed as well or better than the public forecast. This is an unusual outcome as the public forecasts typically perform better than the statistical models. The NOAA-EPA model also performed well in terms of overall statistics but did a poorer job identifying the Code Orange cases.
The complementary measure of forecast skill in high O
3
cases is performance in cases of forecast high O
3
. These results are shown in Table 7 . While the statistical
models performed well in observed high O
3
cases, they did so at the cost of an increased number of false alarms with a rate of success of 56% compared to 70% for the public forecast. The public forecast added value in 2007 primarily by reducing the number of possible false alarms.
A variety of forecast skill measures for observed and forecast Code Orange cases
are given in Table 8 . These scores measure the skill of the forecasts at accurately
predicting cases above a threshold concentration. In this case, the threshold is 85 ppbv for an 8-hour maximum (Code Orange, the Air Quality Action Day threshold). The skill
scores are described in more detail in Appendix A . Overall, the public forecast was less
conservative than previous years. That is, more false alarms were issued (FAR = 0.30 compared to 0.19 in 2006) but fewer Code Orange cases were missed (POD = 0.76 compared to 0.68 in 2006). The bias measure is > 1.00 (1.10) which suggests a tendency to over-predict at this threshold. Overall skill is best measured by the Conditional Skill
Index (CSI) and Heidke Skill Score (HSS) measures. As Code Orange O
3
is a relatively rare event, the most frequent category is the “correct null” forecast, that is, low O
3 forecast and observed. These forecasts are typically easy to make and not of critical interest. The CSI calculation removes the influence of the correct null forecast. For 2007, the public forecast skill score of 0.57 (CSI has a range [-1,1]) is good. The HSS score compares the forecasts to a randomly generated set of forecasts that are constrained to have the same number of cases in each color code as observed. Again, on the range of [-
1,1], the HSS of 0.67 is a good result.
Numerical Forecast Guidance
As part of a cooperative agreement, the National Weather Service and the U.S.
EPA have developed a coupled chemistry-weather model to predict O
3
(see, http://www.nws.noaa.gov/ost/air_quality/index.htm
). This model was tested in 2003 and
2004 and became fully operational in 2005. The NOAA/EPA model (NAQFS) has shown reasonable skill overall and was able to predict some of the high O
3
cases in 2005.
In 2006, we undertook a preliminary assessment of the numerical model forecast performance in the Philadelphia area and found that it showed some skill for local forecasting. In 2007, we took a closer look at model performance in the Philadelphia area.
Selected forecast results for the Philadelphia area are shown in Table 9 . For the
overall forecasts, the NAQFS performs slightly better than the statistical forecasts. This is the first time we have seen this result. In the hot weather cases, the NAQFS also performs better than the statistical forecasts which tend to over-predict O
3
due to the strong influence of maximum temperature as a predictor. The NAQFS still has problems resolving the high O
3
cases. For observed cases of Code Orange or higher O
3
, the
NAQFS under-predicts strongly and, while the MAE is consistent with the statistical forecasts, the mean and rms error are higher. This suggests a lack of consistency in the forecast or, in other words, a tendency to have occasional very bad forecasts. As with the statistical models, the NAQFS tends to over-predict the frequency of Code Orange cases
but at a slightly lower rate. In terms of the key skill score measures ( Table 8 ), the
NAQFS still lags the statistical models by ~14%. However, the skill of the NAQFS appears to be increasing as the model is updated and refined and we will continue to use its output as part of the forecast analysis process.
One forecast approach that showed promise in 2007 was a combination of statistical and numerical forecast guidance. This “hybrid” model simply takes a weighted average of the statistical models and the NAQFS numerical model output. For this analysis, each of the two statistical models was given a 25% weight and the NAQFS a
50% weight. Selected results for the “hybrid” model are given in Table 10 . The result of
most interest is that the MAE is the same for both observed and forecast high O
3
cases.
The MAE for high forecast O
3
is also much lower than for any of the models applied
). This result is shown graphically in Figure 9 . Skill scores for
the “hybrid” model in categorical forecasts of Code Orange or higher is shown in
. Again, the “hybrid” model performs better on nearly all measures than its
constituents and rivals the public forecast skill. This result is of interest to the forecasters and will be tested operationally during the 2008 season.
Conclusions
Daily public ozone (O
3
) forecasts for the Philadelphia non-attainment area, including southern New Jersey, northern Delaware and extreme northeastern Maryland, were issued for the period May 1, 2007 to September 8, 2007. Five Code Red and 16
Code Orange O
3 cases occurred during 2007. This is slightly higher than the average for the 2003-2006 period (18 days ≥ Code Orange) but well below the average for the earlier
1994-2002 period (36 days ≥ Code Orange).
The summer of 2007 was slightly warmer and drier than average in the
Philadelphia metropolitan area. The number of hot (≥ 90°F) days, generally conducive to high O
3
, was near average compared to the preceding decade. Locations south and southwest of Philadelphia, however, were significantly hotter and drier than normal.
The correct AQI forecast color code for O
3
was issued in 68% of all cases. Of the missed forecasts, the majority (60%) were errors between Code Green and Code
Yellow. This performance was less than the 2003-2006 average correct code percentages
of 74%. Forecasts were accurate for the high O
3
cases (Code Orange or Code Red).
Code Orange or Code Red was observed on 21 days with 16 of these cases (76%) carrying forecasts of Code Orange (Air Quality Action Day). All of the 10 most severe
O
3
days carried Air Quality Action Day warnings. As in previous forecast seasons, Code
Orange or higher forecasts (Ozone Action Days) were also accurate. In 2006, 23 Ozone
Action Days were forecast with 16 (70%) observing Code Orange or higher.
The median absolute error (MAE) of the numeric O
3
forecasts was 7.0 parts per billion by volume (ppbv) or within 10.6% of observed O
3
. This was slightly worse than the average MAE for the previous four seasons of 6.2 ppbv. The forecasts showed a large increase in skill over standard benchmark measures – a 39% skill improvement over persistence (the previous day’s observations used as the current day forecast).
As part of a cooperative agreement, the National Weather Service and the U.S.
EPA have developed a coupled chemistry-weather model to predict O
3
. This model has been continuously updated and, in 2007, performed well for the Philadelphia area.
Research undertaken during 2007 suggests that a combination of numerical model forecasts and stand-alone statistical models can provide increased skill in high O
3 cases.
Philadelphia Forecast Area
Days ≥ 90ºF and O
3
Concentration Threshold
Number of Cases Percent Above Threshold
≥ 85 ppbv
≥ 105 ppbv
1997-2002
134
57
83%
35%
≥ 85 ppbv
≥ 105 ppbv
2003-2006
53
9
2007
60%
10%
≥ 85 ppbv
≥ 105 ppbv
12
5
52%
22%
Table 1 . Frequency of 8-hour peak O
3
above the Code Orange and Code Red thresholds in the Philadelphia metropolitan area during hot days.
Color Code Forecasts
Philadelphia Forecast -2007
Observed
Green Yellow Orange Red
49 6 0 0
Forecast
Green
Yellow
Orange
19
1
0
29
6
0
5
11
0
0
5
Red 0
Table 2 . Color code forecasts and observations for the Philadelphia forecast area – 2007.
Summary Statistics
Forecasts and Statistical Guidance
Philadelphia Forecast-2007
(all measures in ppbv)
Obs Forecast R0302 LN2001 Persistence
66.0 68.9 72.4 73.0 65.9 Mean
Bias
Mean AE
+2.9
8.7
7.0
+6.4
10.4
8.8
+7.0
10.9
8.6
+0.3
14.5
11.5 Median AE
Rms error 11.2 13.3 13.8 18.7
Table 3 . Summary statistics for the public forecast, two statistical models (R0302 and
LN2001), and the reference persistence forecast.
Forecast Statistics
Philadelphia Metropolitan Forecast Area (2003-2007)
Bias (ppbv)
2007
+2.9
8.7
2006
+2.6
6.9
2005
+1.6
8.8 Mean AE (ppbv)
Median AE (ppbv) rms error (ppbv)
Days ≥ 90 F (JJA)
Days ≥ 85 ppbv
7.0
11.2
22
21
5.0
11.3
23
20
7.0
12.1
28
20
Table 4 . Public forecast results for the 2003-2007 seasons.
2004
+1.6
7.0
6.0
8.8
9
8
2003
+0.4
7.3
7.0
9.0
23
15
Bias (ppbv)
Statistical Forecast Model Results
Philadelphia Metropolitan Forecast Area (2003-2007)
2007
+6.4
2006
+3.8
2005
+3.9
2004
+1.7
Mean AE (ppbv)
Median AE (ppbv) rms error (ppbv)
10.4
8.8
13.3
9.1
7.4
11.6
9.5
7.0
13.0
Table 5
. As in Table 4 but for statistical forecast model guidance.
7.6
6.0
10.1
2003
+1.3
7.6
6.4
9.7
Median (ppbv)
Mean (ppbv)
Philadelphia Forecast Area -2007
Observed High O
3
Cases (n= 21)
Observed Forecast R0302 LN2001 NAQ Persistence
94 90.0 94.3 93.1 88.0 84.0
97.4 89.6
-4.0
9.0
10.1
91.7
+0.3
8.0
9.0
91.8
-0.9
4.6
8.0
89.1
-8.3
6.0
10.1
84.2
-9.8
14.0
18.3
Bias (ppbv)
Median AE
Mean AE (ppbv) rms error (ppbv)
Orange Fcst/Obs
12.9
76%
12.3
76%
11.4
81%
13.6
67%
22.6
38%
Table 6 . Summary forecast statistics for public forecasts, statistical model guidance, numerical model guidance and the reference persistence forecast for cases in which observed peak O
3
in the Philadelphia area reached or exceeded the Code Orange AQI threshold (85 ppbv for an 8-hour average). NAQ refers to the NOAA-EPA forecast model.
Philadelphia Forecast Area -2007
Forecast High O
3
Cases
Number of Days
Median (ppbv)
Mean (ppbv)
Forecast R0302 LN2001 NAQ Persistence
23 28 30 27 21
94.0 94.4 93.0 92.0 90.0
93.6 94.2 93.6 93.9 94.4
Bias (ppbv)
Median AE
Mean AE (ppbv) rms error (ppbv)
Correct Code
+0.6
10.0
11.6
14.6
70%
+5.6
9.2
11.5
14.7
57%
+5.0
9.7
11.8
15.0
57%
+7.8
8.0
13.6
18.0
52%
+14.1
13.0
16.8
20.8
38%
Table 7 . Summary forecast statistics for consensus and statistical model guidance for cases in which peak O
3
in the Philadelphia area was forecast to reach or exceed the Code
Orange AQI threshold (85 ppbv for an 8-hour average).
POD
Miss
Skill Scores
Philadelphia Forecast Area - 2007
Fcst
0.76
R0302 LN2001 Persistence NAQ
0.76 0.81 0.38 0.67
FAR
Accuracy
CNull
0.24
0.30
0.91
0.95
0.24
0.43
0.87
0.95
0.19
0.43
0.87
0.96
0.62
0.62
0.80
0.88
0.33
0.48
0.85
0.93
Bias
CSI
1.10
0.57
0.70
1.33
0.48
0.65
1.43
0.50
0.69
1.00
0.24
0.26
1.29
0.41
0.55 TSS
HSS 0.67 0.58 0.59 0.26 0.49
Table 8 . Skill score measures for the consensus forecast, statistical guidance, reference
(persistence) forecast and NOAA-EPA numerical model. All skill measures are detailed in Appendix A. POD = Probability of Detection; Miss = Miss Rate; FAR = False Alarm
Rate; CNull = Correct Null Forecast; CSI = Critical Success Index, or Threat Score;
TSS = True Skill Statistic, also known as Peirce score, Hanssen-Kuipers discriminant
HSS = Heidke Skill Score.
Forecast Skill of NOAA-EPA Forecast Model
Philadelphia Metropolitan Area - 2007
N
All Cases 131
Observed ≥ 85 ppbv 21
Forecast ≥ 85 ppbv
27
T max
≥ 90°F 23
Bias
+2.4
-8.3
+7.8
-0.5
Mean AE
10.1
10.1
13.6
12.0
Median AE
8.0
6.0
8.0
9.0 rms Error
13.2
13.6
18.0
14.3
Table 9 . Forecast skill measures for the NOAA-EPA forecast model.
Forecast Skill of “Hybrid” Forecast Model
Philadelphia Metropolitan Area - 2007
N
All Cases 131
Observed ≥ 85 ppbv
21
Forecast ≥ 85 ppbv 25
T max
≥ 90°F 23
Bias
+4.6
-7.0
+1.9
+1.2
Mean AE
9.1
8.6
11.1
10.8
Median AE
6.2
5.6
5.6
8.9
Rms Error
11.8
11.4
14.9
13.6
Table 10.
Forecast skill measures of the “hybrid” model combining a weighted average of statistical forecast models and the NOAA-EPA operational numerical forecast model.
POD
Skill Scores
Philadelphia Forecast Area - 2007
Fcst
0.76
R0302
0.76
LN2001
0.81
Hybrid
0.81
NAQ
0.67
Miss
FAR
0.24
0.30
0.91
0.24
0.43
0.87
0.19
0.43
0.87
0.19
0.32
0.33
0.48
0.85 Accuracy
CNull
Bias
CSI
TSS
HSS
0.95
1.10
0.57
0.70
0.67
0.95
1.33
0.48
0.65
0.58
0.96
1.43
0.50
0.69
0.59
0.91
0.96
1.19
0.59
0.74
0.68
0.93
1.29
0.41
0.55
0.49
Table 11
. Skill score measures, as in Table 8 , but now including results from the
“hybrid” forecast model (bold figures).
80
60
40
20
140
120
100
Observed
Forecast
0
5/
1
5/
8
5/
15
5/
22
5/
29 6/
5
6/
12
6/
19
6/
26 7/
3
7/
10
7/
17
7/
24
7/
31 8/
7
8/
14
8/
21
8/
28 9/
4
Date (2007)
Figure 1 . Time series of daily peak 8-hour O
3
in the Philadelphia metropolitan area and forecast peak O
3
.
60
50
40
30
20
10
0
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007
Figure 2 . Number of days in Philadelphia during the summer forecast season exceeding the Code Orange threshold (85 ppbv, 8-hour average). For the period 1994-2002, the average number of days above this threshold is 36, for the more recent period, 2003-2007, the average is 18.
Figure 3 . Mean temperature for June-August, 2007 expressed as a percentile compared to the 1971-2000 period. Figure courtesy of NOAA Climate Diagnostics Center.
Figure 4 . As in Figure 3 but for precipitation.
72
70
68
76
74
66
64
62
60
58
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007
Figure 5 . Mean peak daily O
3
for the Philadelphia forecast area. For the period 1994-
2002, the average is 70.0 ppbv, and for the period 2003-2007 the average is 63.8 ppbv.
4%
12%
53%
31%
Figure 6 . Frequency of occurrence of O
3
concentrations in the Philadelphia forecast area by color code for 2007.
13%
2%
53%
32%
Figure 7 . As in Figure 6, but for the period 2003-2006.
6%
17%
46%
31%
Figure 8 . As in Figure 6, but for the period 1994-2002.
7
6
5
4
3
2
10
9
8
Hybrid
NAQFS
R0302
1
0
All Obs ≥ 85 FC ≥ 85
Forecast
Figure 9 . Median absolute error forecasts statistics for the “hybrid” forecast model (blue bar), the NAQFS numerical model (magenta bar) and one of the statistical models
(R0302, yellow bar) for 2007.
The determination of the skill of a threshold forecast begins with the creation of a contingency table of the form:
Contingency Table for Threshold Forecasts
Forecast
Observed Yes
Yes a c
No b d No
For example, if Code Red O
3
concentrations are both observed and forecast
(“hit”), then one unit is added to “a”. A “false alarm”, Code Red forecast but not observed, is added to “c”.
Basic Skill Measures:
A basic set of skill measures are determined and then used as the basis for further analysis.
Bias (B) = a a
b c
= 0.69
Bias determines whether the same fraction of events are both forecast and observed. If B = 1, then the forecast is unbiased. If B < 1 there is a tendency to underpredict and if B > 1 there is a tendency to over-predict. In this case B = 0.69 which points to a tendency to under-predict Code Red cases.
False Alarm Rate (F) = b b
d
= 0.02
This is a measure of the rate at which false alarms (high O
3
forecast but not observed) occur. For OAD forecasts, this rate is extremely low.
Hit Rate (H) = a a
c
= 0.78
The hit rate is often called the “probability of detection”
Miss Rate = 1 – H = 0.22
Correct null forecasts:
Correct Null (CNull) = c d
d
Accuracy:
Accuracy (A) = a
d a
b
c
d
= 0.95
= 0.78
Other Measures:
Generalized skill scores (SS ref
) measure the improvement of forecasts over some given reference measure. Typically the reference is persistence (current conditions used as forecast for tomorrow) or climatology (historical average conditions). Because O
3 concentrations have a strong seasonal cycle and concentrations have decreased significantly from the late 1980’s, climatology in this case is based on a ten-day running average, centered on the day of interest, from the period 1994-2004.
MSE
Skill Score (SS ref
) = (1 -
MSEref
) * 100% = 53%
The skill score is typically reported as a percent improvement over the reference forecast.
Additional measures of skill can be determined. The Heidke skill score (HSS) compares the proportion of correct forecasts to a no skill random forecast. That is, each event is forecast randomly but is constrained in that the marginal totals (a + c) and (a + b) are equal to those in the original verification table.
HSS = a
c
c
2
d ad
bc a
b
b
d
= 0.61
For this measure, the range is [-1,1] with a random forecast equal to zero. The
Code Red forecast show skill by this measure.
Another alternative is the critical success index (CSI) or the Gilbert Skill Score
(GSS) also called the “threat” score .
CSI = a a
b
c
=
1
H
B
H
= 0.47
For this measure, the range is [0,1]. Since the correct null forecast is excluded, this type of measure is effective for situations like tornado forecasting where the occurrence is difficult to determine due to observing bias, i.e., tornados may occur but not be observed. This can also be the case for air quality forecasting when the monitor network is less dense. Note, however, that the random forecast will have a non-zero skill.
The
Peirce skill score (PSS), also known as the “true skill statistic”
is a measure of skill obtained by the difference between the hit rate and the false alarm rate:
PSS = a ad
c
b bc
d
= H – F = 0.74
The range of this measure is [-1,1]. If the PSS is greater than zero, then the number of hits exceeds the false alarms and the forecast has some skill. Note, however, that if d is large, as it is in this case, the false alarm value (b) is relatively overwhelmed.
The advantage of the PSS is that determining the standard error is relatively easy.
Stephenson, D. B., Use of the “odds ratio” for diagnosing forecast skill, Wea.
Forecasting , 15 , 221-232, 2000.
Wilks, D. S., Statistical Methods in the Atmospheric Sciences , Academic Press, 467pp.,
1995.