quasi-operational multi-model multi-analysis ensembles

advertisement
QUASI-OPERATIONAL MULTI-MODEL MULTI-ANALYSIS ENSEMBLES
ON MEDIUM-RANGE TIMESCALES
Kenneth R. Mylne, Robin T. Clark and Ruth E. Evans
UK Meteorological Office, Bracknell, UK
1. Introduction
Uncertainty in medium-range forecasts is caused by errors in both the analysis and the forecast
model. Historically, ensemble forecasting systems have been designed primarily to account for analysis
errors by adding perturbations to the control analysis to generate ensemble members; eg Molteni et al
(1996) and Toth and Kalnay (1993) describe the original ensemble systems at ECMWF and NCEP
respectively. Ensemble systems using only initial condition perturbations typically display insufficient
spread in the ensemble, characterised by an excessive proportion of verifying observations falling outside
the range of the ensemble spread. In practical terms for forecasting this results in many events not being
predicted by any members of the ensemble, and in probability forecasts not giving reliable probabilities.
Ensemble research is now increasingly looking at methods of taking account of other uncertainties in the
forecast process. For example, Buizza et al (1999) have incorporated some elements of model uncertainty
into the operational ECMWF ensemble by adding stochastic perturbations to model physics within
individual ensemble members; Houtekamer et al (1996) run a number of separate versions of their model
in an ensemble, with different versions of parametrization schemes for example.
Another relatively cheap approach is to create a multi-model multi-analysis ensemble by combining
ensembles run using different models. The additional use of more than one analysis to which perturbations
are added, may provide further unstable directions for ensemble spread not present with a single analysis.
Evans et al (1999) conducted a number of case studies by combining ensembles from the ECMWF and
UK Met Office (UKMO) NWP systems, using both analyses and both models from the two centres. They
concluded that the joint ensemble significantly outperforms either individual ensemble, giving a gain in
predictability of the order of one day over the ECMWF Ensemble Prediction System (EPS), with better
coverage of the observations and evidence that the UKMO ensemble can include synoptically valuable
information not included in the EPS. One limitation of this work was that it only considered a number of
case studies, selected in part because of poor EPS performance. Also the model versions used were of
lower resolution than the current operational EPS. In order to overcome these limitations, a multi-model
multi-analysis ensemble (MMAE) has been implemented at ECMWF in a quasi-operational mode, using
the current operational versions of both ECMWF and UKMO models and has been running daily since
October 1998. In this paper we describe some initial results from this system.
2. Joint Ensemble System
The multi-model multi-analysis ensemble has been created by installing the NWP version of the
UKMO Unified Model (UM), (see Cullen, 1993), on the computer system at ECMWF. This version is
identical to the current operational global NWP model run at UKMO, except that the horizontal resolution is
reduced to 288x217 grid-points, or 1.25 E-W by 0.833 N-S, giving a similar resolution to the operational
EPS model at TL159L31.
Initial conditions for the UM ensemble are created by adding the EPS perturbations to the UKMO
analysis. Due to limited computing resources, the UM ensemble is run with 27 members (control plus the
first 13 perturbations added to and subtracted from the analysis to create pairs of perturbed analyses),
compared to the EPS which has 51 members (control plus 25 pairs).
Output from the UM ensemble is combined with EPS output to create the MMAE ensembles. Two
versions are used: (i) a full version combining the 27 UM members with all 51 members of EPS to give a
78 member joint ensemble (referred to as EEUU78) and (ii) a reduced version consisting of the 27 UM
members combined with the corresponding 27 members of EPS (EEUU54). The latter provides a useful
comparison with the EPS for assessing the benefits of the MMAE system, as the total number of members
is similar and it can therefore be run at similar computing cost to the EPS.
3. Skill Scores for Comparison of Forecasts
Following normal convention, (eg Wilks, 1995) comparison of the skill of different forecasts is
normally done in terms of a skill score. For any verification diagnostic, X, the skill of a forecast relative to
some reference forecast is given by
X
 X fc
SS  ref
(1)
X ref  X perf
where Xfc is the value of X for the forecast, Xref for the reference forecast and Xperf for a perfect
deterministic forecast. A skill score has a maximum value of 100% for a perfect forecast (Xfc=Xperf) and a
value of zero for performance equal to that of the reference (Xfc=Xref) . SS has no lower limit, with
negative values representing poorer skill than the reference.
Normally the reference forecast used is a baseline such as a forecast of climatology, to measure the
skill of a system in making a forecast. In this case we are interested in the benefits of the MMAE
ensemble compared with the EPS, so the EPS forecast is used as the reference. Results in the remainder
of this paper will mostly be presented as skill scores in this way.
4. Deterministic Forecast Skill
To obtain benefit from the full information content of ensemble forecasts, they are best interpreted
probabilistically. However it is common practise also to use the ensemble mean (EM) as a deterministic
forecast tool, and many studies (eg Molteni et al, 1996) have shown enhanced forecast skill from the EM
compared to the control forecast. It is therefore useful to consider the EM skill of the MMAE ensemble.
In figure 1 the RMSE (root mean square error) skill scores of the two configurations of the MMAE are
plotted as a function of forecast lead-time for MSLP (mean sea-level pressure) over the
North Atlantic and Europe in January 1999. Also shown is the score for the UM ensemble (UU27).
Although the UM ensemble alone is no more skillful than the EPS, the MMAE ensembles both give
between 5 and 10% improvement in EM skill over the EPS for lead-times between about 36 and
Figure 1: Ensemble Mean RMSE skill score of MSLP, relative to the EPS, calculated over the North
Atlantic and Europe for January 1999. Different ensembles are as shown in the key: UU27 is the 27
member UM ensemble; EEUU54 and EEUU78 are the two versions of MMAE ensembles as described in
section 2.
204 hours. When the RMSE is plotted directly, rather than as a skill score (not shown), this can be seen to
give a gain in predictability of the order of 8 hours between T+36 and T+144, and up to 15 hours at longer
lead-times. Similar results (not shown) are obtained for geopotential height at 500hPa (H500). For
temperature at 850hPa (T850) gains in skill of between 2 and 5% are obtained. For precipitation there is
little gain in skill (less than 2%), but equally there is no degradation compared to EPS.
It appears from figure 1 that the UM ensemble is performing very poorly in the first 48 hours of the
forecast, and this results also in poor performance of the MMAE ensembles, with negative skill scores in
the first 24 hours approximately. This is believed to be due to the fact that the verifying analysis used is the
ECMWF analysis. Analysis fields are characterised by the short-period model biases of the model used in
the analysis cycle, and it is therefore to be expected that forecast fields produced by a different model and
using a different analysis will appear to verify poorly in the early stages of the forecast. If the UKMO
analysis is used instead as the verifying analysis, the apparent poor performance of the UM ensembles
and MMAE ensembles is removed, which supports this explanation. After about 48 hours of the forecast,
there is very little difference in results using different verifying analyses. However since the aim of the
verification is to investigate benefits of the MMAE relative to EPS, the ECMWF analysis is used throughout
to avoid any risk of biasing results unfairly in favour of the UM or MMAE ensembles.
In figure 1 there is little difference between results for the two MMAE ensembles, EEUU54 and
EEUU78. This indicates that the gains in EM skill from the MMAE system are due to the greater
effectiveness of the multi-model multi-analysis system, not simply to increasing the ensemble size, and
also that the gains may be obtained without the need to increase the overall size, and therefore cost, of the
ensemble.
Results quoted above are all based on January 1999; similar results were obtained for February 1999
(not shown). However the MMAE system does not always give such large benefits over EPS. In December
1998 (not shown) the individual UM ensemble performed considerably less well than the EPS, with RMSE
EM skill scores for UU27 typically around -20%. Despite this, both configurations of MMAE ensemble
performed as well as the EPS. This variation in performance will be discussed in section 6 below.
5. Probabilistic Forecast Skill
5.1 Brier Scores
The most commonly used verification diagnostic for probabilistic forecasts is the Brier Score,
originally introduced by Brier (1950) and described in its modified standard form by Wilks (1995) as:
1 n
BS 
(f i  o i ) 2
(2)
n i 1

The Brier Score is essentially the mean square error for probability forecasts of an event, and fi and oi
are forecast and observed probabilities respectively, where oi has values of 1 when the event occurs and 0
when it does not. Brier Skill Scores (BSS) are calculated by using BS as X in the general skill score
definition given in eq. (1) above.
In figure 2, Brier Skill Scores relative to EPS for MSLP above normal over the North Atlantic and
Europe, in January 1999, are plotted against forecast lead-time for UU27 and the two configurations of
MMAE. Over the first 2 days of the forecast performance of the UM ensemble appears poor, and this is
again believed to be due to using the ECMWF analysis for verification, as described in section 4. After this
period the skill of UU27 is similar to that of EPS, or slightly poorer. However the MMAE ensembles give
between 6 and 10% improvements in BSS relative to the EPS for lead-times between 48 and 192 hours,
and some improvement out to 240 hours. When Brier Scores are plotted, rather than skill scores, it can be
seen that this improvement in skill
Figure 2: Brier Skill Score for MSLP above normal, relative to the EPS, calculated over the North Atlantic
and Europe for January 1999. Different ensembles as in figure 1.
from the MMAE equates to a gain in predictability of around 10 hours. Similar results (not shown) are
obtained for H500; for T850, improvements in skill are around 5%, and for 24 hour accumulated
precipitation between 2 and 5%, with the largest gains for lead-times of 60 to 120 hours.
From figure 2 it can be seen that there is little difference in BS performance between the two MMAE
ensembles, EEUU54 and EEUU78. As for the deterministic EM benefits, the improvements are due to real
gains in skill, and not simply increasing ensemble size.
In section 4 it was stated that EM performance of the UM and MMAE ensembles was less good in
December 1998 than in January 1999. In figure 3a BSS results are shown for the same events as given in
figure 2, but for December 1998, and it is clear that in this case there is much less benefit from the MMAE
approach. Nevertheless there are skill improvements of 2 to 5% over EPS for lead-times of 48 to 120
hours for EEUU54, and some gains out to 204 hours for the full EEUU78. In figure 3b results are given for
the same period, but calculated only over Europe. In this case the UU27 ensemble performed rather better
over lead-times of 96 to 168 hours, and this resulted in better MMAE performance over the same period,
with gains in skill of between 5 and 10% over EPS for both EEUU54 and EEUU78 in this period.
These results show that the benefits of the multi-model approach vary with both time and with the
geographical area over which they are calculated. This is believed to be due to variations in the relative
performance of the two models in different synoptic situations. Both models are amongst the most skilful
available, but both have different independent strengths and weaknesses, and it is therefore likely that in
some synoptic situations one will perform better, and in other situations the other will. The above results
would suggest that in December 1998 the EPS was performing comparatively well over the North Atlantic
and Europe area, and it was therefore difficult for the UM ensemble to add to or improve on the information
available. Consequently the MMAE ensemble performed only slightly better than the EPS. Over the more
limited area of Europe the gains were greater, suggesting that it was largely in this area where the UM
ensemble was adding useful information to the EPS. In January 1999 the UM ensemble performed
individually almost as well over the larger North Atlantic and Europe area as the EPS, and the combination
of information from both in the MMAE ensemble gave greater gains. Results for January for Europe (not
shown) show that the UM ensemble performed better than the EPS for lead-times of 96 to 192 hours, and
the MMAE skill was between 8 and 13% better than EPS between about 48 and 180 hours. These results
are all consistent with the idea that most of the benefits from MMAE ensembles will be gained in regions
where the EPS is performing relatively poorly, and it can be expected that the regions where this occurs
will vary from time to time. It is important to note that where the EPS is performing well, the addition of
information from the UM ensemble, even at the expense of removing some information from the EPS as in
the case of the EEUU54 ensemble, does not appear to significantly reduce the performance.
(a) North Atlantic and Europe
(b) Europe
Figure 3: Brier Skill Score for MSLP above normal, relative to the EPS, for December 1998: (a) over North
Atlantic and Europe; (b) over Europe. Different ensembles as in figure 1.
The time dependence of the relative performances of the different ensembles can be broken down further.
In figure 4 the Brier Scores for 120 hours forecasts of MSLP for individual days are plotted for the EPS
(EE) and the UU27 and EEUU54 ensembles, averaged over the North Atlantic and Europe area.
Figure 4: Brier Scores for individual days during January 1999, for 120 hour forecasts of MSLP above
normal calculated over the North Atlantic and Europe. Different ensembles are as shown in the key: EE is
the EP; others as in figure 1. Points marked E indicate examples of where the EPS is the better individual
ensemble; points marked U where UU27 is better.
It can be clearly seen that the relative performance of the individual ensembles varies from day to day. The
points marked E indicate examples of days where the EPS performs better, and those marked U where
the UM ensemble is better (noting that the Brier Score is negatively oriented with the lower score indicating
better performance.) However on almost every day the EEUU54 joint ensemble performs as well as the
better individual ensemble, and on occasion, such as the point at the extreme left end of the plot in figure
4, performs better than either individual one. Thus the addition of a second model into the ensemble
frequently adds additional useful information, and importantly does not significantly reduce the
performance of an individual ensemble when it is performing well.
5.2 ROC (Relative Operating Characteristics)
An alternative measure of probabilistic skill which is frequently used is the area under the ROC
(Relative Operating Characteristics) curve, introduced by Stanski et al (1989). ROC measures the skill of a
forecast in terms of a hit rate (HR) and a false alarm rate (FAR) both classified according to the
observations, and more skillful forecasts are characterized by higher hit rates and lower false alarm rates.
ROC therefore measures the usefulness of the forecasts as decision making tools. The ROC curve is
generated by calculating hit rates and false alarm rates for a range of probability thresholds of an event,
and plotting HR against FAR. The area under this curve gives a measure of the overall skill of the forecast
system, with a maximum value of 1.0 for a perfect forecast and a value of 0.5 indicating no skill.
Skill scores relative to EPS may be calculated for area under the ROC curve using equation (1). In
figure 5 ROC area skill scores of the UU27, EEUU54 and EEUU78 ensembles for MSLP above normal
over the North Atlantic and Europe are plotted for (a) January 1999 and (b) December 1998. Results are
similar to those obtained using Brier Skill Scores, but there are some important differences. In both months
shown the advantages of the MMAE approach are greatest at shorter lead-times. For the January 1999
data, both MMAE ensembles give improvements in skill over EPS of between 10 and 20% up to about 108
hours, decreasing at longer lead-times but remaining positive out to 240 hours. There is little difference
between EEUU54 and EEUU78. For December 1998 the larger joint ensemble, EEUU78, gives similar
benefits at lead-times up to 108 hours, but fall to near zero or slightly negative beyond about 144 hours.
The reduced joint ensemble EEUU54 is consistently less good by this measure and in this period, and the
skill is negative by 5-10% for lead-times greater than 132 hours.
6. Discussion and Conclusions
Previous work has demonstrated the benefits of multi-model multi-analysis ensembles for limited
case studies (Evans et al, 1999). The aim of this study is to determine the extent to which such benefits
may be obtained in a quasi-operational system using the current operational versions of models. Results
shown have illustrated that substantial gains in forecast skill may be achieved by adding a second model
and analysis into the EPS system, and that in most cases these gains may be achieved without increasing
the overall size (and therefore running cost) of the ensemble.
It is notable that the benefits of the MMAE system vary on a range of time-scales, from month to
month and from day to day, and also by geographical region. This is believed to be due to the two models
in the system performing differently in differing synoptic situations. In comparison with a single-model
ensemble (such as the EPS), it could be expected that the MMAE might perform better in synoptic
situations which are generally handled better by the additional model. However the differences in
performances of skilful models are often subtle, and the situations in which one or the other is likely to
perform better are normally impossible to identify synoptically. By combining the two models together into a
joint ensemble, the results demonstrate that the benefits of whichever is the better model may be obtained
all the time. On both monthly and daily time-averages, the MMAE ensembles effectively act as a filter for
the better-performing individual ensemble, and on occasions perform better than either individual
ensemble.
(a)
(b)
Figure 5: Area under ROC skill scores, relative to EPS, for MSLP above normal over the North Atlantic
and Europe in (a) January 1999 and (b) December 1998.
An important result of this study is that even when the single model ensemble is performing well, and the
second model less well, as in the December 1998 results, the joint ensemble does not significantly
degrade the performance of the single-model ensemble. The only exception to this is in the area under
ROC skill for the EEUU54 configuration of the MMAE at longer lead-times, where it appears that removing
some EPS members and replacing them with UM members did slightly degrade the overall performance.
However this degradation is smaller than the gains achieved in the same month for shorter lead-times, and
in other months the MMAE gave positive gains by this measure right out to 240 hours in the forecast.
Variations in MMAE performance by geographical region are also believed to be due to synoptic
differences. The greatest gains noted in this study have been over Europe, although insufficient data have
yet been analysed to determine if this is a consistent pattern. However it is reasonable to expect that if
gains are due to the models' different performances for synoptic-scale systems, that on occasions larger
benefits will be observed over smaller geographical regions when particular synoptic types prevail in those
regions. Benefits over larger areas, such as the North Atlantic and Europe region or Northern Hemisphere,
would be expected to be smaller, but achieved more consistently over long periods of time. This idea is
generally supported by the results so far analysed.
In the introduction it was noted that Buizza et al (1999) have introduced stochastic perturbations of the
model physics into the EPS in order to represent some aspects of model uncertainties. Benefits of this
have change have mostly been limited to improvements in verification of weather parameters, particularly
precipitation. This is perhaps to be expected, as the stochastic perturbations do not alter the overall
attractor of the model. This change was introduced to the EPS before the current study was conducted, so
the benefits of the multi-model ensembles demonstrated here have been achieved over the EPS with the
stochastic physics system in operation. This demonstrates that, although the stochastic physics sceme is
beneficial, there are still substantial further benefits to be gained by addition of one or more other models
with different, but skilful, attractors, into the ensemble system.
One aspect of the MMAE system which has not been addressed above, is the extent to which the
benefits of the MMAE are due to incorporating an additional analysis as well as an additional model. Evans
et al (1999) concluded that model dependencies of results are non-negligible in the first 48 hours, become
equal with analysis dependencies around Day 8 of the forecast, and dominate thereafter. It is therefore
likely that a significant part of the benefits of the MMAE described above are due to the additional analysis
rather than the additional model. Richardson (1999) has presented a comparison of results of the MMAE
with a multi-analysis ensemble generated using the ECMWF model, and concluded that the multi-analysis
ensemble gave approximately 50% of the benefits of the MMAE over medium-range time-scales. Thus,
while substantial benefits may be gained by incorporating multi-analyses, the multi-model approach is still
required for the full benefits of the MMAE system.
7. References
Brier,G.W., 1950: Verification of forecasts expressed in terms of probability. Mon.Wea.Rev. 78, 1-3.
Buizza,R., Barkmeijer,J., Palmer,T.N. and Richardson,D., 1999: Current status and future developments of
the ECMWF Ensemble Predition System. Submitted to Meteorol. Appl., January 1999.
Cullen,M.J.P., 1993: The unified forecast/climate model. Meteor. Mag. 122, 81-93.
Evans,R.E., Harrison,M.S.J. and Graham,R.J., 1999: Joint Medium Range Ensembles from the UKMO
and ECMWF Systems. Submitted to Mon. Wea. Rev., March 1999.
Houtekamer,P.L., Lefaivre,L., Derome,J., Ritchie,H., and Mitchell,H.L., 1996: A system simulation
approach to ensemble prediction. Mon. Wea. Rev. 124, 1225-1242.
Molteni,F., Buizza,R., Palmer,T.N. and Petroliagis,T., 1996: The ECMWF Ensemble Prediction System:
Methodology and Validation. Quart.J.Roy.Meteor.Soc. 122, 73-119.
Richardson,D, 1999: Presentation to ECMWF EPS Experts Meeting, 17-18 June 1999, ECMWF, Reading,
UK.
Stanski,H.R., Wilson,L.J. and Burrows,W.R., 1989: Survey of Common Verification Methods in
Meteorology, WMO WWW Tech Report No 8, WMO TD No 358.
Toth,Z., and Kalnay,E., 1993: Ensemble Forecasting at the NMC:The generation of perturbations.
Bull.Amer.Meteor.Soc., 74, 2317-2330.
Wilks,D.S., 1995: Statistical Methods in the Atmospheric Sciences - An Introduction, International
Geophysics Series Vol 59, Academic Press.
Download