Strategies for the Verification of Ensemble Forecasts

advertisement
STRATEGIES FOR THE VERIFICATION OF ENSEMBLE FORECASTS
(Laurence J. Wilson)
Environment Canada
Abstract: The subject of ensemble forecast verification naturally divides itself into two parts:
verification of the ensemble distribution of deterministic forecasts and verification of probability
forecasts derived from the ensemble. The verification of the ensemble distribution presents unique
problems because it inevitably involves comparing the full ensemble of forecast values against a
single observation at the verifying time. A strategy to accomplish this kind of evaluation of the
ensemble distribution is described, along with other methods that have been used to evaluate aspects
of the ensemble distribution.
By comparison, evaluation of probability forecasts extracted from the ensemble is simpler; any of the
existing measures applied to probability forecasts can be used. Following the work of Murphy, the
attributes of probability forecasts are described, along with a set of verification measures which are
used to evaluate these attributes. Examples of application of each technique to ensemble forecasts is
shown, and the interpretation of the verification output is discussed. The verification methods
described include the Brier and rank probability scores, skill scores based on these, reliability tables,
and the relative operating characteristic (ROC), which has been widely used in the past few years to
evaluate ensemble forecasts.
1.
Introduction
The subject of evaluation of ensemble forecasts naturally divides into two parts, according to the use
of ensemble forecasts. Since the output of an ensemble forecast system is a distribution of weather
elements, valid at each time and place, it is necessary to consider those methods that apply to the
evaluation of the ensemble distribution itself. These are discussed in section 2. Then, since ensemble
forecasts are often used to estimate probabilities, it is also necessary to consider methods that are
used to evaluate probability forecasts. These are discussed in section 3.
2.
Verification of the ensemble distribution
Until the advent of ensemble prediction systems, verification of forecasts from numerical weather
prediction models involved simply matching the model forecast in space and time with the
corresponding observation. With a single model run, there could usually be a one-to-one match
between forecast and observation, on which numerous quantitative verification measures could be
computed. An ensemble system produces a distribution of forecast values for each point in time and
space, but there is still only a single observation value. The challenge of ensemble verification is to
devise a quantitative method to compare the distribution against specific observation values.
As a starting point, one might consider what constitutes an “accurate” forecast distribution. What
characteristics should the forecast distribution possess in order to be considered of high quality as a
forecast? Two desirable characteristics of ensemble distribution, “consistency” and “non-triviality”
have been stated by Talagrand, 1997 A forecast distribution is said to be consistent if for each
possible probability distribution f, the a posteriori verifying observations are distributed according to f
in those circumstances when the system predicts the distribution f. (Talagrand, 1997). In other words,
if one could compile a sufficiently large set of similar cases, where similar distributions had been
forecast by the ensemble system, then the distribution of observations for those cases should match
the ensemble distribution. In practice, it is nearly impossible to compile a sufficiently large enough
sample of cases that are similar enough because of the large number of degrees of freedom in each
distribution, so it would be very difficult to verify consistency directly.
The second desirable characteristic of ensemble forecasts stated by Talagrand is “non-triviality”. An
ensemble forecast system is said to be non-trivial if it forecasts different distributions on different
occasions. This is similar in some ways to the concept of sharpness in a forecast system: the system
must give different forecasts for different times and locations. A system which always forecasts the
climatological value of the weather element, or, the climatological distribution, would be a trivial
forecast system.
2.1
A probability-based score and skill score
Wilson et.al (1999) proposed a verification system for ensemble forecasts that attempts to evaluate
the ensemble distribution as a basis for estimating the observation value. The concept is illustrated in
Figure 1. In Figure 1, three hypothetical distributions are shown, a relatively sharp distribution that
might be associated with a short range ensemble forecast, a distribution with greater variance, as
might be predicted at medium range by an ensemble system, and a broader distribution which might
be the climatological distribution for that date and place. The verifying observation is indicated as -3
degrees C. If +/- 1 degree C is considered a sufficiently accurate forecast for temperature in this
case, one can determine the forecast probability within one degree of the observation using each of
the distributions (The shaded areas on the figure). Then , this probability can be used directly as a
score for the forecast. The probability determined from the climatological distribution can represent
the score value for a climatological forecast, and can be used to build a skill score in the usual format:
skill 
score f  scorec
1  scorec
where scoref is the score value for the forecast, and scorec is the score value for climatology. Figure 1
shows normal distribution curves. If the ensemble is small enough, better estimates of the probability
can be obtained by first fitting a distribution (Wilson et. al (1999) suggest normal for temperature,
gamma for precipitation and wind speed), then calculating probabilities from the distribution.
Experiments with data from the 51-member ECMWF ensemble system indicated that, for ensembles
of this size, it is not necessary to fit a distribution; use of the empirical ensemble distribution gave
similar results.
Figure 1. Schematic representation of probability scoring system as it might be applied to ensemble
temperature forecasts at a specific location. Example shows probabilities (hatched areas) for the
observed value +/-1 degree C.
This type of scoring system evaluates the probability distribution in the vicinity of the observation only;
the shape and spread of the distribution far from the observation value are not considered directly.
The score is sensitive to both the spread and the location accuracy of the ensemble with respect to
the observation: If the observation coincides with the ensemble mode, a relatively high score value
can be obtained, especially if the spread (variance) of the ensemble distribution is small. On the other
hand, if the forecast is missed in the sense that the observation does not lie near a mode of the
ensemble, then the probability score will be low. In such cases, greater ensemble spread,
corresponding to an indication of greater uncertainty can lead to higher scores than would be the case
for lower ensemble spread. A perfect score (probability=1.0) occurs when all the ensemble members
predict within the acceptable range; that is, the score is maximized when the ensemble predicts the
observed value with confidence.
-3-
Figure 2 shows an example of the scoring system applied to a specific ensemble forecast for a
specific station, Pearson International Airport, Toronto. The histogram shows the actual ensemble
distribution of temperature forecasts for this location, valid May 17, 1996. A normal distribution has
been fitted to the ensemble (crosses in the figure), and to the climatological temperature distribution
for the valid date (circles). The verifying temperature, indicated by an X on the abscissa, was 11
degrees C. For the short range forecast, (left side), the verifying temperature is close to the mean of
the distribution, giving a probability score of 0.27 of occurrence within 1 degree C of the observation.
The observed temperature was near normal; the corresponding score value for climatology is 0.19.
The skill score in this case, obtained from (1) is 0.11. The forecast has achieved positive skill by
forecasting a sharper distribution than the climatological distribution.
The right side of figure 2 shows the fitted distributions and scores for a 7-day ensemble forecast
verifying on the same day. This forecast is about as sharp as the shorter range one, (i.e., the spread
of the fitted and empirical distributions are about the same as for the short range foreast, but the
ensemble has totally missed the observation. All the members forecast temperatures too low. As a
result, the score value is 0.0 in this case, and the climatological score is 0.19, same as before. The
skill score is negative, -0.23, because the forecast has missed in a situation which is climatologically
normal.
Figure 2. Verification of 72 h (left) and 168 h (right) ensemble 2 m temperature forecasts for Pearson
International Airport. The histogram represents the actual ensemble distribution, the fitted normal
distribution is represented by crosses, and the corresponding climatological distribution is represented
by circles. The computed score (sf), skill score (ss) and climate score (sc) are all shown in the
legend. The observed temperature is shown by the X on the abscissa.
For ensemble precipitation amount forecasts (QPF), it has been suggested that a gamma distribution
might be more appropriate than the normal distribution (Wilson et al, 1999). Figure 3 shows an
example of the probability score and skill score computed for a 72h ensemble forecast of precipitation
accumulation over 12h. A gamma distribution has been fit to both the ensemble forecast and the
climatological distribution of 12h precipitation amounts. The score and skill score were computed in
this case using a geometric window for a correct forecast. The lower boundary of the correct range
was set to 0.5 times the upper boundary, so that the forecast is considered correct if it lies between
(observation/sqrt (2)) and (observation * sqrt(2)). This takes account of the fact that small differences
in the predicted precipitation are more important for small amounts than for large amounts. Under
this scheme, for example, ranges such as (2.0, 4.0), (5.0, 10.0) and (20.0, 40.0) might all be used,
depending on the observed precipitation amount. The factor width of 2 seemed to be strict enough in
tests, but other factors could be used to determine the window for a correct forecast. Tests using a
smaller window factor of 1.5 indicated that the results are not strongly sensitive to the size of the
window in this range. It is nevertheless important to report the selected window size with the results
so that they can be interpreted.
In Fig. 3, the ensemble has indicated that some precipitation is likely, whereas the climatological
distribution favours little or no precipitation. 2.0 mm of precipitation was observed on this occasion,
and so the forecast shows positive skill of 0.36. The full resolution operational model (ECMWF) has
also predicted precipitation, a higher amount than all the members of the ensemble. The window for
a correct forecast is (1.414, 2.828), geometrically centered on the observed value of 2.0 mm. The
deterministic forecast from the full resolution model lies outside this window, and would have to be
assigned a score of 0.0. Thus the ensemble has provided a more accurate forecast than the full
resolution model in this case.
Figure 3. Ensemble distribution (bars), fitted gamma distribution (crosses) and corresponding climate
distribution (circles) for 72 h quantitative Airport. Score value (sf), climate score (sc) and skill score
-4-
(ss) are given in the upper right precipitation forecast for Pearson International corner, along with the
deterministic forecast from the ECMWF T213 model and the verifying observation.
The examples shown so far have been for single ensemble forecasts at specific locations. Of course,
the score and skill score can be computed over a set of ensemble forecasts and used to evaluate
average performance over a period and for many locations. One such experiment used the
probability score to compare the performance of the ECMWF ensemble and the Canadian ensemble
on the same period in 1997. Table 1 summarizes the data used in the experiment.
Table 1.
Sample used in comparison of the ECMWF 51 member and Canadian 9 member
ensemble forecasts.
Verification period
Stations
Parameters
Ensemble size
Ensemble model
Projections
ECMWF Data
151 days, Jan to May, 1997
23 Canadian stations
2m temperature, 12h precipitation,
10m wind
51 member ensembles
T106 model
10 days (12h) from 12 UTC
Canadian Data
148 days, Jan to May, 1997
23 Canadian stations
2m temperature, 12h precipitation,
10m wind
9 member ensembles
T63 model
10 days (12h) from 00 UTC
Figure 4 shows an example from this verification, again for Pearson International Airport. Figure 4a,
for the Canadian ensemble shows that the skill remains positive to about day 6 with respect to
climatology, and furthermore is asymptotic to 0 skill. One would expect this skill score to be
asymptotic to 0 skill as the ensemble spread approaches the spread of the climatological distribution,
and the conditioning impact of the initial state is lost It is also an indication that the model’s
climatology approaches the observed climatology. (i.e., the model’s temperature forecasts are
unbiased) Score values and skill is higher for forecasts verifying at 12 UTC, which suggests the
Canadian ensemble system forecasts early morning temperatures near the minimum temperature
time more accurately than early evening temperatures. Figure 4b, for the 51-member ECMWF
ensemble, shows positive skill with respect to climatology throughout the 10-day run, though the skill
is near 0 by day 7 of the forecast. The ECMWF model also exhibits a diurnal variation in the accuracy
and skill, but in the opposite sense. That is, forecasts in the evening are more accurate than in the
early morning. This might be expected because the ECMWF model is tuned for the European area,
which has a maritime climate. Toronto’s climate is more continental in nature, with stronger nighttime
cooling than might be experienced in a maritime climate.
Figure 4c shows verification results using the score and skill score for 60-member combined ECMWF
and Canadian ensembles. Since the scores for the separate ensembles are similar in magnitude,
combining the ensembles doesn’t have a large effect overall on the scores. The combined result
seems to be as a weighted average of the two individual results. The diurnal effect has been mostly
eliminated.
Figure 4. Verification of Canadian (a), ECMWF (b) and combined (c) ensemble 2m temperature
forecasts for Pearson International Airport for January to May, 1997.
Using the score over a set of precipitation amount forecasts (12h accumulation) indicated somewhat
lower skill for precipitation forecasting than for temperature forecasting. Figure 5 shows three
examples of score and skill score values averaged over the five month test period, for three Canadian
stations. The verification was carried out using a window factor of 2.0. At St. John’s Newfoundland,
on Canada’s east coast, the skill was slightly positive until about day 2 of the forecast, then slightly
negative. Once again, the score values seemed to be asymptotic for longer projections, but tending
towards a negative value rather than 0. For Toronto and Winnipeg, the skill was never positive at any
-5-
projection, and the Winnipeg results are poorer than the Toronto results. Both show a strong diurnal
variation, which can be attributed to differences in the observed frequency of precipition in the 00 UTC
and 12 UTC verifying samples. The performance differences among these stations are most likely
related to the differences in the climatology of precipitation occurrence. At St, John’s, the frequency
of occurrence of precipitation is relatively high; this station has a maritime climate. Toronto and
especially Winnipeg have more continental climates with generally lower frequencies of precipitation
occurrence. In terms of the ensemble forecast, the negative skill is likely caused by too many
ensemble members forecasting small amounts of precipitation in situations when none occurs. These
are the situations when the climatological distribution, which favours the non-occurrence of
precipitation, will have higher accuracy. In other words, it is possible that both models are biased
towards forecasting too much precipitation or forecasting a little precipitation too often, since the skill
score is asymptotic to negative values. To check this, it would be worthwhile to compare the
climatological distribution with the predicted distribution compiled from all the ensemble forecasts.
The above results show that the score and the skill score are quite sensitive to differences in
performance of the ensemble. It is also relatively easy to interpret the score calues, even for a single
forecast. The score applies to any variable for which observations exist. Figures 6 to 8 illustrate its
diagnostic use for verification of 500 mb height forecasts. For this experiment, the score was
calculated at every model gridpoint, using the analysis to give the verifying heights. After some
experimentation with the window width, 4 dm was chosen as the best compromise between
smoothness of the result and strictness of the verification. At 2 dm, every small scale deviation in the
score values was visible with the result that the spatial distribution of the score was too noisy, while at
6 dm, forecasts tended to be “perfect” well into the forecast period, and spatial variations in the score
did not show up at all in the shorter range forecasts.
Figure 5. Verification of 60 member combined Canadian-ECMWF 12h quantitative precipitation
forecasts as a function of projection time, for three Canadian stations, over the period January to May,
1997. Both score values (top) and skill score values (bottom) are shown in Figure 6 - Probability
score values for a 36 h 500 mb height ensemble forecast, using the ECMWF ensemble.
Figure 6 shows an analysis of the score values for a 36 h 500 mb height forecast over North America
and adjacent oceans, using the ECMWF ensemble and the Canadian analysis. There are two main
features of note in these results. First, there are two large areas, one over the eastern Pacific and the
other over the Great Lakes area where the score values drop from 100 to low values. Second, there
are three small areas in the tropics and a large area in the Arctic where the score values drop
suddenly to 0. These results can be explained in comparison to the verifying analysis (Fig. 7) and the
analysis of the ensemble standard deviation (Fig. 8). Figure 7 shows two 500 mb troughs which
correspond closely in position to the large areas of lower score values in the mid-latitudes. Evidently,
the ensemble was unsure of the location or shape of these features, and this increased uncertainty
results in lower scores in these locations. This is supported by Fig. 8, which shows higher ensemble
spread in the vicinity of the two troughs. On the other hand, there is no indication in Figure 8 of any
increased ensemble spread in the vicinity of the other areas of low score values in the tropics and the
arctic. Thus, it is evident that these are areas where the ensemble forecast has “missed”; all or nearly
all the members lie outside the window for a correct forecast.
Figure 7. Verifying analysis for the case of Figure 6. 500 mb heights are in dm.
Figure 8. Standard deviation of the ensemble forecast for the case shown in Figure 6, in dm.
2.2
Rank Histogram
The rank histogram (sometimes called Talagrand diagram) is a way of comparing the ensemble
distribution with the observed distribution over a set of cases, preferably a large set of cases. To
construct the histogram, the ensemble forecast values of the element being assessed are first ranked
-6-
in increasing order. For an ensemble of N members, this defines N+1 intervals where the first and
last are open-ended and the rest are the intervals between successive pairs of ensemble members.
For each ensemble forecast, the interval containing the observed value is determined and tallied. The
number of occurrences of the observation within each interval over the whole sample is plotted as a
histogram.
If it is assumed that each of the N+1 intervals is equally likely to contain the observation, then the rank
histogram would be expected to show a uniform distribution across the N+1 categories, and this would
indicate that the spread of the ensemble distribution is on average equal to the spread of the
distribution of the observations in the verifying sample. The rank histogram of Fig. 9 shows a ushaped distribution, with relatively higher frequencies of occurrence of the observation in the extreme
categories. That means that the observation lies outside the whole ensemble more frequently than
would be expected under the assumption that all intervals are equally likely. This is usually
interpreted to mean that the ensemble spread is too small; that the ensemble does not cover the
whole range of possible outcomes often enough.
Figure 9. An example of a rank histogram, for the ECMWF 50 member ensemble system, 6 day
forecast of 850 mb temperature.
It should be noted that the rank histogram does not constitute a true verification system. It would be
possible to generate a perfect (flat) rank histogram simply by randomly selecting the verifying interval,
without regard for the observed value of the variable.
2.3
Verification of individual members of the ensemble
As discussed above, the major problem of verification of the ensemble is related to the matching of an
ensemble of forecast values with a single observation. To get around this issue, it has always been
tempting to reduce the information in the ensemble to a single value to match to the verifying
observation, for example, by using the ensemble mean. Once the problem is reduced to matching of
single values, then any of the standard verification measures that are used to verify single
deterministic forecasts can be used, such as mean absolute error, root mean square error, anomaly
correlation etc. However, comparing an ensemble mean, which is a statistic of the ensemble, with a
single observation is not appropriate, for two main reasons:


The ensemble mean has different statistical characteristics than the verifying observation, for
example, lower associated variance. Because of this, it often verifies well, especially when
quadratic scoring rules are used, which is misleading.
The ensemble mean is not a trajectory of the model. That is, it is not necessarily true that the
ensemble mean field at 24 h represents a physically possible evolution of the atmosphere from
the ensemble mean field at 12 h.
Aside from the ensemble mean, there are other, more legitimate strategies for comparing single
ensemble members with the observation. One might, for example, wish to compare the unperturbed
control forecast, generated using the ensemble model, with the output from the full resolution model.
The control forecast may begin as the center of the ensemble distribution (when perturbations are
added and subtracted to form the ensemble), but it does not remain so through the integration.
Verification of the control forecast and comparison with the full resolution model may give information
on how the ensemble can be expected to perform with respect to the full resolution model. The
control forecast may also be used instead of the ensemble mean as the reference for computation of
statistics of the ensemble such as standard deviation or variance. The accuracy of the control
forecast is also used in evaluations of the relationship between ensemble spread and forecast
accuracy. It is important to know whether it is the ensemble mean or control that is used in order to
-7-
interpret the results of such studies, because the spread about the ensemble mean is always less
than or equal to the spread about the control forecast.
Other individual ensemble members that are sometimes looked at in verification are the “best” and
“worst” members. Best and worst can be defined in terms of any verification score, and they can be
allowed to change from one projection time to the next. This kind of verification is useful to assess
the extent to which the ensemble forecast distribution “covered” the actual outcome, and represents a
fair way of comparing ensemble forecasts based on different sizes of ensemble. It is also useful in
retrospective case studies, to identify the nature of perturbations which either improved or degraded
the forecast. Since the “best” and “worst” members are not known a priori, this kind of verification
does not give information which can be used as feedback to the interpretation of future ensemble
forecasts, but rather is useful as a diagnostic tool.
3.
Verification of probability forecasts from the ensemble
One of the principal uses of ensemble forecasts is to give probability forecasts of weather elements.
These can range from simple probability of precipitation (POP) to probabilities of precipitation
categories, probabilities of extreme temperatures or probability of strong winds. Nearly always, the
probability is estimated by first defining a category by setting a threshold, then counting the
percentage of the ensemble forecasts which meet the criterion defined by the threshold. For
instance, a probability forecast of surface temperature anomaly greater than 4 degrees C would be
made by determining the percentage of ensemble members for which the temperature forecast is
greater than 4 degrees above normal. Probability estimation in this way is a form of post-processing
of the ensemble; the information provided by the ensemble distribution has been sampled by
integrating that distribution with respect to a particular value defined by the threshold. Therefore,
verification of probability forecasts from the ensemble is not equivalent to verification of the ensemble
distribution; the distribution is evaluated only in terms of the threshold used for the probability
forecast. If the distribution is sampled more completely by defining several thresholds and several
categories, and if these probabilities are verified by methods which are sensitive to the distribution of
probabilities over the range of categories, then this is closer to an evaluation of the distribution, but it
still remains incomplete with respect to the full distribution. In all cases, the estimation of probabilities
from the ensemble distribution means evaluating the distribution in terms of those estimates only.
Probability forecasts from the ensemble can be evaluated using the same methods as are used for
any other probability forecasts. There is quite a large body of literature on this subject, much ot it due
to Allan Murphy. The following discussion is therefore relatively brief; further details on verification of
probability forecasts can be found in summary documents such as Stanski et al (1989) and Wilks
(1995).
-8-
3.1
Attributes of probability forecasts
Murphy and Winkler (1987), present a framework for verification and point out that all verification
information is contained in the joint distribution of forecasts and observations. In practical terms, all
the verificaiton information one might need is contained in the complete set of paired forecasts and
corresponding observations which forms the verification sample. The framework provides for two
types of factorizations of the joint distribution, or, stratifications of the full verification sample. First, it
is possible to stratify by forecast value, that is, to define subsets of the verification sample that share a
certain set of forecast values. For example, one may speak of the conditional distribution of
observations given that the forecast probability value is between 0 and 10%, where the “condition” is
imposed on the forecast value. Second, it is possible to stratify by observation. Since observations
are usually binary (the event occurred or it didn’t), there are only two conditions which can be placed
on the observation. For example, one may speak of the conditional distribution of forecasts given that
the event occurred. Here, the condition is imposed on the observations, that is, the dataset has been
stratified into two subsets according to whether the event occurred or not.
Murphy (1993) defines a set of attributes of probability forecasts that one may wish to assess using
the verification tools available. These are listed in Table 2, along with their definition and some of the
verification measures that are appropriate for each. Some of these do not imply stratification of the
verification dataset (green in the table), while some refer to stratification or conditioning by forecast
(red) or observation (grey).
Table 2. Attributes of forecasts (after Murphy, 1993)
ATTRIBUTE
1. Bias
2. Association
3. Accuracy
4. Skill
5. Reliability
6. Resolution
7. Sharpness
8. Discrimination
9. Uncertainty
DEFINITION
Correspondence between mean forecast and
mean observation
Strength of linear relationship between pairs
of forecasts and observations
Average correspondence between individual
pairs of observations and forecasts
Accuracy of forecasts relative to accuracy of
forecasts produced by a standard method
Correspondence of conditional mean
observation and conditioning forecasts,
averaged over all forecasts
Difference between conditional mean
observation and unconditional mean
observation, averaged over all forecasts
Variability of forecasts as described by
distribution forecasts
Difference between conditional mean
forecast and unconditional mean forecast,
averaged over all observations
Variability of observations as described by
thedistribution of observations
RELATED MEASURES
bias (mean forecast probability-sample
observed frequency)
covariance, correlation
mean absolute error (MAE), mean
squarred error (MSE), root mean squared
error, Brier score (BS)
Brier skill score, others in the usual format
Reliability component of BS, MAE, MSE of
binned data from reliability table
Resolution component of BS
Variance of forecasts
Area under ROC, measures of separation
of conditional distribution; MAE, MSE of
scatte plot, binned by observation value
Variance of observations
Probability verification methods that are often used to assess these attributes are surveyed in the
following sections.
-9-
3.1
Scores and skill scores
The Brier score is most commonly used for assessing the accuracy of binary (two-category)
probability forecasts. It is in fact misleading to use the Brier score for assessing multi-category
probability forecasts; the Rank Probability Score (RPS- see below) is preferred for this purpose. The
Brier score is defined as:
  F ij  Oij 
2
PS 
N
where the observations Oij are binary (0 or 1) and N is the verification sample size. The Brier score
has a range from 0 to 1 and is negatively-oriented. Lower scores represent higher accuracy. Using a
little algebra, the Brier score can be partitioned into three components,
1
PS 
N
F k O k
nk 


k 1 ij 
K
2


O k O
 
2

O  O kij

2



As represented here, the sample has been stratified into K bins according to the probability forecast
value, the overbars refer to averages over the bins if there is a subscript and over the whole sample if
there is no subscript. The sample size in bin k is given by nk. The three terms of the partitioned Brier
score are, from left to right, the reliability, the resolution and the uncertainty, all as defined in Table 2.
The first two of these terms depend on the forecast through the binning of the data, but the third term
depends only on the observations, which means that the forecaster has no control over the value of
this term. It is therefore not advisable to compare Brier scores that have been computed on different
samples, because the uncertainty term will vary among samples with different climatological
frequencies of occurrence of the event, and cause variations in the Brier score which have nothing to
do with the accuracy of the forecast.
The Brier Skill Score is in the usual skill score format, and may be defined by:
BSS 
PS

PS
PS
C
C
F


 100  1 




 
 100

 C ij  Oij 

 F ij  Oij
2
ij
2
ij
where the C refers to climatology and F refers to the forecast. It is not necessary to use climatology;
any standard forecast can be used in the formulation of the skill score. More recently, skill scores in
this format have been used with one forecast system as a standard, in order to determine the percent
improvement of a competing forecast system. Probabilities from the ensemble could be compared in
this way to probability forecasts from statistically interpreted NWP model output, for example.
Figure 10 shows an example of the use of the Brier and Brier skill scores to evaluate ensemble
geopotential height forecasts (from Buizza and Palmer, 1998). In this case, the event was defined as
“500 mb geopotential height anomaly greater than 50 m” over Europe. The scores were calculated to
show the impact of increasing ensemble size on the accuracy and skill of the forecast. Brier skill
drops to 0 by day 7 for some of the tests and by day 8 for the others.
Figure 10. Brier score (top) and Brier skill score (bottom) over Europe for a 45-day period for different
ensemble sizes up to 32 members. (After Buizza and Palmer, 1998)
- 10 -
The Rank Probability Score (RPS) is intended for verification of probability forecasts of multi-category
events. In the computation of this score, the categories are ranked, and the contributions to the score
values are weighted so that proximity of the verifying category to the bulk of the probability weight is
given credit. For example, in a four category problem, if category 3 is observed, a forecast of 90%
probability of occurrence of category 2 will score higher than a forecast which assigns 90% probability
to category 1, even though both forecasts are incorrect.
The RPS for a single forecast of K mutually exclusive and exhaustive categories is defined as follows:
2
i
 
1  K  i
RPS 
    Pn   d n  
 
K 1
n 1
 i 1  n 1

where Pn is the probability assigned to category n and dn is 1 or 0 according to whether category n
was observed or not. Like the Brier score, the RPS has a negative orientation and a range of 0 to 1.
The ranked probability skill score (RPSS) is the corresponding skill score, and is the same as the BSS
except it is based on the RPS,
RPSS 
RPS S  RPS F
RPS S
Figure 11. Rank Probability Skill Score for 10-category probability forecasts of 500 mb height
anomaly over Europe for 45 days. Curves are for 5 different ensemble sizes (after Buizza and
Palmer, 1998).
The RPS and RPSS are quite useful for evaluating ensemble systems because multi-category
probability forecasting involves sampling the ensemble distribution at several points; if a large number
of categories is used, then the RPS and RPSS give a good evaluation of the full ensemble distribution
with respect to the observation. Figure 11 shows an example of the RPSS, based on 10 categories,
for probabilities of ranges of 500 mb height anomaly. The skill is with respect to climatology and
remains positive throughout the 10 day model run period.
3.2
Reliability Tables
A reliability table is a graphical way of evaluating the attributes reliability and sharpness of probability
forecasts, and, to a lesser extent, resolution. The tables are nearly always used for two-category
forecasts (binary), but a multi-category extension has recently been published (Hamill, 1997). The
diagram is constructed by first stratifying the verification sample according to forecast probability,
usually into deciles, or perhaps broader categories if there isn’t enough data to support stratification
into 10 categories. The observed frequency of the event within each subsample is plotted against the
forecast probability at the midpoint of the category. The sample size in each probability category is
plotted as a number next to the corresponding point on the graph and/or a histogram is plotted along
with the graph which gives the sample size or frequency of each category in the overall verification
sample. The climatological frequency of occurrence of the event in the whole sample is represented
on the graph by a horizontal line.
Interpretation of reliability tables is discussed fully in Stanski et al (1989). Essentially, perfect reliability
is indicated by forecast probability=observed frequency, which is true for all points on the 45 degree
line. Points above the 45 degree line represent underforecasting of the event (probabilities too low)
- 11 -
and points below the 45 degree line represent overforecasting (probabilities too high). Reliability for
probability forecasts is analagous to bias for continuous variables; it indicates the tendency of the
forecast probability value to agree with the actual frequency of occurrence. Unlike continuous
variables, however, reliability, and many other verification measures for probability forecasts cannot
meaningfully be evaluated on a single forecast. Reliability tables require relatively large verification
sample sizes to get stable estimates of the observed frequency of occurrence in all probability
categories. Reliability is one of the components of the Brier score as shown above, and can be
calculated quantitatively using the first term in the partitioned Brier score.
Sharpness is also indicated on a reliability table, by the distribution of forecast probabilities, displayed
either as a histogram or numerically on the reliability graph. Sharpness is measured by the spread
(variance or standard deviation) of the probability forecasts in the sample. It is a function only of the
forecasts, and therefore does not carry any information about the quality of the forecasts, only about
their “decisiveness”. In terms of the histogram, sharp forecasts (best) are indicated by a u-shaped
distribution: The greater the frequency of use of the extreme probabilities in the sample, the sharper
the forecasts. A categorical (or deterministic) forecast is sharp since probabilities of only 0 or 100%
are assigned to the values of the variable. At the other end of the spectrum, a forecast with no
sharpness is one where the same probability value is used all the time.
Reliability and sharpness can usually be traded off in any forecast system. For example a
conservative forecasting strategy which does not often forecast the extremes of probability may well
be reliable, but is not sharp. Generally, if one attempts to increase the sharpness, forecasts may
become less reliable, especially forecasts of extreme probabilities. Reliability is often regarded as
more important than sharpness; a forecast system that is sharp but not reliable is in a sense an
overstatement of confidence. On the other hand, a forecast of climatology is perfectly reliable but has
no sharpness and is not very useful. An optimal strategy would be to ensure reliability first, then
increase the sharpness as much as possible with out sacrificing significantly on reliability. The
following are some examples of reliability tables as applied to ensemble forecasts.
Figure 12. Reliability tables for verification of probability of precipitation > 1 mm in 24 h (left) and > 5
mm in 24 h (right), for three months of 6-day European forecasts from the ECMWF ensemble system.
Figure 12 shows two reliability tables for an earlier version of the ECMWF ensemble system, for
events “greater than 1mm precipitation in 24 h” and “greater than 5 mm precipitation in 24 h. The
climatological frequency is represented by the horizontal line in each case. For precipitation greater
than 1 mm, there is fair reliability in the forecasts up to about 70%, but the higher probability values
are overforecast. The 0 skill line on the diagram rperesents the locus of points where the reliability
and resolution components of the Brier score balance each other, leaving only the uncertainty
component. This means that the skill with respect to the climatology goes to 0. The skill scores are
shown on the figure; it can be seen that the forecasts for the 5 mm threshold are indeed near the 0
skill level. Both curves show some tendency to underforecast low probabilities and overforecast high
probabilities. This is a typical pattern for forecasts which exhibit less-than-perfect reliability.
Figure 13. A reliability table for probability forecasts of 850 temperature anomalies less than -4
degrees, for the European area, verified against the analysis. Dotted lines are lines of equal Brier
score.
Figure 13 shows a reliability table for probability forecasts of the event “850 mb temperature anomaly
less than -4 degrees”, temperatures more than 4 degrees colder than normal. The graph shows
these forecasts to be quite reliable, with some overforecasting tendency in the larger probabilities.
The dashed lines on the graph are lines of equal Brier score. One could visualize calibration of these
forecasts by “moving” the points horizontally toward the 45 degree line. Effectively, this means
relabelling the probability categories with the actual observed frequency. The Brier score would
improve under such a transformation as can be seen visually from the dotted lines on the figure.
- 12 -
However, the sharpness would decrease considerably. This is an example of the tradeoff mentioned
above; increasing the reliability means reducing the sharpness for a specific set of forecasts. The
Brier score improves because it is a quadratic scoring rule, which penalizes larger probability errors
more than small ones. Figure 13 also indicates that the distribution of forecasts tends toward a ushape; these forecasts are also quite sharp. This example also shows both sample and long-term
climatological frequencies, which helps in interpretation of the Brier score.
Figure 14. Example of the effect of calibration of probability forecasts on the reliability table and Brier
score. 24 h Ensemble forecasts from the U.S. system for 24 h precipitation totals > 0.1 inch (top) and
> 0.10 inch (bottom). Corrected forecasts were obtained using the rank histogram. (after Hammill and
Colucci, 1997)
Calibration of the probabilities from an ensemble system can be easily done using the rank histogram.
The simplest way to calibrate is to compute the rank histogram and replace the equally -spaced
increments in the cumulative distribution with the true increments from the rank histogram. For
example, an ensemble of 49 members gives 50 intervals assumed equal in probability at 2% each, so
that the cumulative probability increases by 2% at each of the ranked ensemble member values. If
the true frequency of occurrence of the event in the lowest-ranked category is 13%, then the first
threshold becomes 13% instead of 2%. Other thresholds are determined by accumulating each of the
actual histogram frequencies in turn. This type of calibration normally leaves open the question of
how to model the distribution in the tails, outside the ensemble. Hamill and Colucci (1997) do this by
fitting a Gumbel distribution in the tails, and using the rank histogram values in the interior of the
distribution. Figure 14 shows an example of relaibility tables from their results, for 24h precipitation
probability forecasts. The uncalibrated (left side) forecasts are really not very reliable. After
correction, the reliability has been substantially improved, and the Brier score is lower (better) as
expected.
3.3
Signal Detection Theory and the Relative operating characteristic curve
Signal detection theory (SDT), brought into meteorological application by Mason (1982), is designed
to determine the ability of a diagnostic system (in this case the ensemble forecast) to separate those
situations when a signal is present (for example, the occurrence of rain) from those situations when
only noise is present. SDT measures the attribute “discrimination”, and implies stratification of the
dataset on the basis of the observation. SDT is for two-category forecast problems. However, one of
its greatest advantages when applied to ensemble forecasts is that it can be used to evaluate both
probabilistic and deterministic forecasts in comparison to each other in a consistent fashion. SDT is
thus a good way of comparing the performance of the deterministic model and the ensemble.
In SDT, the relationship between two quantities is explored, the hit rate and the false alarm rate.
These two quantities are computed from the verification dataset. Consider the two-by-two contingency
table in figure 15. If the forecast is categorical, there are four possibilities: event is forecast and it
occurs ( a “hit”), event is not forecast and it occurs (a “miss”), the event is forecast but doesn’t occur
(a “false alarm”) and the event is neither forecast nor occurs (a “correct negative”). From the
verification dataset, the values of these four quantities, X, Y, Z, and W respectively are simply the
totals of each of these four possible outcomes in the sample. The hit rate and false alarm rate can
then be defined as shown on the figure. To use this table, however, the probability forecasts must be
turned into categorical forecasts, which means adopting a categorization strategy based on the
forecast probability. For example, one strategy might be to “forecast the occurrence of the event if
the forecast probability is greater than 30%”. Given such a strategy, each pair of forecast probabilities
and occurrences (which are already categorical) can be assigned to one of the four possible
outcomes tallied in the contingency table, and totalled over the whole sample. Each different value of
the decision threshold, 30%, 40 % etc. leads to a different table, and therefore a different hit rate and
false alarm rate.
- 13 -
Figure 15. 2 by 2 contingency table for determination of hit rate and false alarm rate.
Figure 16. Example of the distribution of occurrences and non-occurrences of an event, for deciles of
forecast probability.
Given the total occurrences of the event and the non-event for each decile of forecast probability, for
example, as in Fig. 16, one can calculate X, Y, Z, and W for the different threshold probabilities and
plot them on a graph, as in Fig. 17. In Fig. 16, for a decision threshold of 30%, X is the total of the
occurrences column below the line, Y is the total occurrences above the line, Z is total nonoccurrences below the line, and W is total non-occurrences above the line. Hit rates and false alarm
rates are computed in this way for each probability decile as a threshold. Fig 17 is a plot of the hit
rate and false alarm rate. This curve is the (empirical) relative operating characteristic (ROC). One
would hope that, as the threshold decreases from 1.0, the hit rate would increase much faster than
the false alarm rate, so that the curve stays in the upper left half of the diagram. In fact, the closer the
curve is to the left side and top of the diagram, the greater the ability of the system to distinguish
between situations where a signal is present from those where only noise is present.
Figure 17. An example af an empirical ROC, formed by plotting the hit rate against the false alarm
rate for probability decision thresholds 0, 10%, 20%,….90%, 100.
The most common measure associated with the ROC is the area under the curve, which is the area of
the box between the curve and the lower right corner., expressed as a percentage of the whole area
of the box. The closer the curve is to the upper left corner, the greater this area, and the maximum
value is 1.0. The diagonal, representing hit rate = false alarm rate (area=0.5) is the “0 skill” line in this
context, but we have found in practice that values of the area below 0.7 represent a rather weak
ability to discriminate.
What is really being measured by the ROC is the difference between the two conditional distributions,
the distribution of forecast probabilities given that the event occurred and the distribution of forecast
probabilities given that the event did not occur. (for example, Fig. 16) A pair of such distributions is
shown schematically in Fig. 18. Visually, it is clear that the variable X (the forecast probability in this
case) will form a much better basis for determining the occurrence or non-occurrence of the event if
these two distributions are well-separated. The separation of the means of the two distributions is
another measure of the discriminating ability of the system. Normally, this is expressed in terms of
the standard deviation of the forecast distribution for non-occurrences, but it is directly related to the
area measure, and does not provide any additional information.
Figure 18. A schematic of the two conditional distributions, for occurrences, f 1(x) and nonoccurrences, f0(x). The random variable X is in this case the probability forecast. The distributions
shown are normal with mean and standard deviation (m1,s1) and (m0,s0) respectively.
Experiments with the ROC from many fields have shown that the ROC tends to be very close to linear
in terms of the standard normal deviates corresponding to the hit rate and false alarm rate. In other
words, if one takes the hit rate as a probability (frequency), and calculates (from normal probability
tables, for example) the value of the random variable X that corresponds to that probability, do the
same for the false alarm rate, then plotting these values instead of the original hit rate and false alarm
rate will result in a straight line. To apply the so-called normal-normal model to a specific dataset, one
would first convert the empirical hit rate and false alarm rates to standard normal deviate values using
the normal probability distribution with parameters mean and standard deviation estimated from the
data. Then a straight line can be fitted using a method such as least squares regression.
Alternatively, the maximum likelihood method can be used. Once the straight line is determined, then
the data can be transformed back to linear probability space, which leads to a smooth curve.
- 14 -
The normal-normal model implies the assumption that the distributions prior to occurrence and nonoccurrence can at least be monotonically transformed to normal. They do not themselves have to be
normal distributions. This means that there can be no multi-modality in the distributions, which is
usually a good assumption. For variables such as precipitation, where the ensemble might indicate
either no rain or a lot of rain on some occasions (a bi-modal distribution), the normal-normal model
might not fit as well, but one is unlikely to incur significant error by using it.
One might ask why go to the trouble? Well, the difficulty lies in computing the areas under the ROC.
If an empirical curve is used, the data points are joined by straight lines. Compared to the normalnormal model, which implies a smooth curve, calculating the area under the empirical curve would
always give an underestimate to a greater or lesser degree depending on the distribution of the points
across the range and on the number of points used to estimate the ROC. (Note that (0,0) and (1,1)
are always the endpoints of the ROC). This problem becomes significant when ROCs are compared
for different forecasts, especially if one of the events has a low frequency of occurrence. The lower
the frequency of occurrence of the event, the greater the tendency of the points on the ROC to cluster
toward the lower left corner of the diagram. The actual discriminating ability may not be affected, but
when the points are not well-spread across the domain of the ROC diagram, the error in the
estimation of the area from the empirical curve will be larger.
Figure 19. A ROC curve for two forecasts of probability of 6 h precipitation accumulation, 0 to 6 h
(circles) and 42-48 h (crosses). Hit rate and false alarm rate are plotted in terms of their
corresponding standard normal deviates, using the mean and standard deviation estimated from the
data.
Figure 19 shows an example of a ROC plotted on normal-normal paper. ROCs are represented there
for two sets of precipitation forecasts, for 0 to six hours and for 42 to 48 hours. The areas are 0.856
and 0.767 respectively. As can be seen from the plot, the points really do lie very close to a straight
line; this has been noted for ROCs in many fields besides meteorology. Provided there are enough
points to estimate the ROC, confidence on the straight line fit is usually much higher than the 99%
level. Generally, 4 points are needed; it is not advisable to try to estimate the curve with fewer points.
Figure 20 is a schematic of a ROC in linear probability space, represented as a smooth curve. Once
the curve has been fitted in standard normal deviate space, the curve is defined, and it is convenient
to transform back to linear probability space using many more points, e.g. 100 equally-spaced points,
to facilitate drawing the curve. To compare the performance of a deterministic forecast, one can
compute the hit rate and false alarm rate directly from the forecast and plot a single point on the
graph. If the point lies above the curve, then the deterministic forecast gives a better basis for
identifying the occurrence of the event than the ensemble, but only in the vicinity of the plotted point.
The ensemble’s performance can be evaluated over all probability thresholds, which normally makes
it more useful to a wider variety of weather-sensitive users, who may be interested in making
decisions at different thresholds, depending on the nature of their operation.
Figure 20. Schematic of a fitted ROC in probability space. Az, the shaded area, is the area under the
curve and h(p*) and f(p*) are the hit rate and false alarm rate respectively.
Figures 21 and 22 show two examples of ROCs obtained for ECMWF ensemble precipitation
forecasts. Analagous to the reliability table, it is useful to display ROCs along with histograms of
distributions. For the ROC, the relevant distributions are the two conditional distributions for
occurrence and non-occurrence cases of the sample. These histograms, sometimes called likelihood
diagrams, show graphically the separation of the dwo distributions. In Fig. 21, the ROC is plotted for
three different projection times, where the forecast event is 24 h precipitation accumulation >1 mm.
As expected, the longer the forecast range, the smaller the area under the curve. By 10 days, the
forecasts can be considered to be barely useful to discriminate between precipitation and no
- 15 -
precipitation, with an area of 0.709. With three curves, there are three likelihood diagrams plotted to
the right of the figure. At 4 days, the two distributions are well-separated, with non-occurrences
associated with high probability density neat the low end of the range and occurrences associated
with higher probability density near the high end of the range. By 10 days, there is still some
separation of the two distributions, but they are much less clearly different than at the earlier forecast
ranges. If the two distributions were exactly colocated (means equal), then the area under the ROC
would be 0.5, and the system can be said to have no discriminating ability.
Figure 21. ROC curves for 4- 6- and 10-day probability forecasts of 24 h precipitation accumulation
greater than 1mm. for Europe, using ECMWF ensemble data. The area under the curves and the
separation distance of the means of the two conditional distributions is shown in the lower right of the
ROC graph. The “likelihood diagrams” are histograms of the two conditional distributions of forecast
probabilities.
Fig. 22 compares ROC curves for a set of 3 day precipitation forecasts, but examines the system’s
ability to discriminate occurrences of precipitation above specific thresholds. The four curves, for
1mm, 2 mm, 5 mm, and 10 mm accumulations in 24 h, actually lie quite close to each other,
suggesting that the system performance does not differ greatly for the different thresholds. Sampling
the ensemble distribution at different thresholds is similar to sampling the probability distribution over
many ensembles at different thresholds to generate the ROC. Thus the ROC seems to be insensitive
to changes in the threshold of the physical variable.
Figure 22. An example of a ROC for 3 day probability forecasts of 24h precipitation accumulation
greater than 1mm, 2 mm, 5 mm and 10 mm, based on the ECMWF 51 member ensemble system.
Finally, it should be noted that interpretation of the ROC requires a slightly different way of thinking
than one may use for the reliability table. The ROC is completely insensitive to reliability, and
provides completely different information about the forecasts being verified. With stratification
according to the observation, it is a kind of posteriori or retrospective verification: How well did the
system perform in effectively distinguishing between occurrences and non-occurrences. Unlike the
reliability table, it does not give information that the forecaster can use directly to alter his probability
estimate the next time a priori. Rather it gives information which is of greater interest to the user of
the forecast, to make decisions based on the ensemble or on probabilities derived from the ensemble.
4.
Concluding remarks
Ensemble verification strategies are of two types: those which seek to verify the ensemble distribution
or aspects of it, and those which verify probability forecasts derived from the ensemble. Verification
of an ensemble distribution against single observations requires the development of special methods,
while verification of probability forecasts can be carried out with standard methods applicable to
probability forecasts in general. This paper has presented a survey of verification methods applicable
to both the full ensemble and to probability forecasts. It is not an exhaustive survey, though most of
the frequently-used methods have been briefly described.
The set of verification tools presented here for verification of probability forecasts is sufficient to
evaluate the 9 attributes of probability forecasts as identified by Murphy, except perhaps association.
These verification tools are of three types, summary scores, reliability tables and associated
measures, and the ROC and its associated measures. They evaluate accuracy and skill, reliability
and sharpness; and discrimination and uncertainty respectively. While the scores summarize the
verification information into a single score value, the graphical presentation of the reliability table and
the ROC permit the extraction of considerable detail about the performance of the ensemble system.
The ROC also permits comparison of probability forecasts with deterministic forecasts. None of these
- 16 -
methods is sufficient on its own, it will always be necessary to use several verification measures to
fully assess the quality of an ensemble forecast.
5.
References
Buizza, R., and T.N. Palmer, 1998: Impact of ensemble size on ensemble prediction. Mon. Wea.
Rev., 126, 2503-2518.
Hamill, T. M., 1997: Reliability diagrams for multicategory probabilistic forecasts. Mon. Wea. Rev.,
12, 736-741.
Hamill, T. M., and S. J. Colucci, 1997: Verification of Eta-RSM short range ensemble forecasts. Mon.
Wea. Rev., 125, 1312-1327.
Hamill, T. M., and S. J. Colucci, 1998: Evaluation of Eta/RSM ensemble probabilistic precipitation
forecasts. Mon. Wea. Rev., 126, 711–724.
Mason, I., 1982: A model for assessment of weather forecasts. Aust. Meteor. Mag., 30, 291–303.
Murphy, A.H., 1993: What is a good forecast? An essay on the nature of goodness in weather
foreasting. Wea. Forecasting, 8, 281-293.
Murphy, A.H., and R. L. Winkler, 1987: A general framework for forecast verification. Mon. Wea.
Rev., 115, 1330-1338.
Stanski, H.R., L.J. Wilson and W.R. Burrows, 1989: Survey of common verification methods in
meteorology, WMO, WWW Technical Report No. 8, 115 pp.
Talagrand, O., 1997: Statistical consistency of ensemble prediction systems. Discussion paper, SAC
Working Group on EPS, ECMWF.
Wilks, D.S., 1995: Statistical Methods in the Atmospheric Sciences, Academic Press, New York,
Chapter 7, p. 233-281.
Download