QUASI-OPERATIONAL MULTI-MODEL MULTI-ANALYSIS ENSEMBLES ON MEDIUM-RANGE TIMESCALES Kenneth R. Mylne, Robin T. Clark and Ruth E. Evans UK Meteorological Office, Bracknell, UK 1. Introduction Uncertainty in medium-range forecasts is caused by errors in both the analysis and the forecast model. Historically, ensemble forecasting systems have been designed primarily to account for analysis errors by adding perturbations to the control analysis to generate ensemble members; eg Molteni et al (1996) and Toth and Kalnay (1993) describe the original ensemble systems at ECMWF and NCEP respectively. Ensemble systems using only initial condition perturbations typically display insufficient spread in the ensemble, characterised by an excessive proportion of verifying observations falling outside the range of the ensemble spread. In practical terms for forecasting this results in many events not being predicted by any members of the ensemble, and in probability forecasts not giving reliable probabilities. Ensemble research is now increasingly looking at methods of taking account of other uncertainties in the forecast process. For example, Buizza et al (1999) have incorporated some elements of model uncertainty into the operational ECMWF ensemble by adding stochastic perturbations to model physics within individual ensemble members; Houtekamer et al (1996) run a number of separate versions of their model in an ensemble, with different versions of parametrization schemes for example. Another relatively cheap approach is to create a multi-model multi-analysis ensemble by combining ensembles run using different models. The additional use of more than one analysis to which perturbations are added, may provide further unstable directions for ensemble spread not present with a single analysis. Evans et al (1999) conducted a number of case studies by combining ensembles from the ECMWF and UK Met Office (UKMO) NWP systems, using both analyses and both models from the two centres. They concluded that the joint ensemble significantly outperforms either individual ensemble, giving a gain in predictability of the order of one day over the ECMWF Ensemble Prediction System (EPS), with better coverage of the observations and evidence that the UKMO ensemble can include synoptically valuable information not included in the EPS. One limitation of this work was that it only considered a number of case studies, selected in part because of poor EPS performance. Also the model versions used were of lower resolution than the current operational EPS. In order to overcome these limitations, a multi-model multi-analysis ensemble (MMAE) has been implemented at ECMWF in a quasi-operational mode, using the current operational versions of both ECMWF and UKMO models and has been running daily since October 1998. In this paper we describe some initial results from this system. 2. Joint Ensemble System The multi-model multi-analysis ensemble has been created by installing the NWP version of the UKMO Unified Model (UM), (see Cullen, 1993), on the computer system at ECMWF. This version is identical to the current operational global NWP model run at UKMO, except that the horizontal resolution is reduced to 288x217 grid-points, or 1.25 E-W by 0.833 N-S, giving a similar resolution to the operational EPS model at TL159L31. Initial conditions for the UM ensemble are created by adding the EPS perturbations to the UKMO analysis. Due to limited computing resources, the UM ensemble is run with 27 members (control plus the first 13 perturbations added to and subtracted from the analysis to create pairs of perturbed analyses), compared to the EPS which has 51 members (control plus 25 pairs). Output from the UM ensemble is combined with EPS output to create the MMAE ensembles. Two versions are used: (i) a full version combining the 27 UM members with all 51 members of EPS to give a 78 member joint ensemble (referred to as EEUU78) and (ii) a reduced version consisting of the 27 UM members combined with the corresponding 27 members of EPS (EEUU54). The latter provides a useful comparison with the EPS for assessing the benefits of the MMAE system, as the total number of members is similar and it can therefore be run at similar computing cost to the EPS. 3. Skill Scores for Comparison of Forecasts Following normal convention, (eg Wilks, 1995) comparison of the skill of different forecasts is normally done in terms of a skill score. For any verification diagnostic, X, the skill of a forecast relative to some reference forecast is given by X X fc SS ref (1) X ref X perf where Xfc is the value of X for the forecast, Xref for the reference forecast and Xperf for a perfect deterministic forecast. A skill score has a maximum value of 100% for a perfect forecast (Xfc=Xperf) and a value of zero for performance equal to that of the reference (Xfc=Xref) . SS has no lower limit, with negative values representing poorer skill than the reference. Normally the reference forecast used is a baseline such as a forecast of climatology, to measure the skill of a system in making a forecast. In this case we are interested in the benefits of the MMAE ensemble compared with the EPS, so the EPS forecast is used as the reference. Results in the remainder of this paper will mostly be presented as skill scores in this way. 4. Deterministic Forecast Skill To obtain benefit from the full information content of ensemble forecasts, they are best interpreted probabilistically. However it is common practise also to use the ensemble mean (EM) as a deterministic forecast tool, and many studies (eg Molteni et al, 1996) have shown enhanced forecast skill from the EM compared to the control forecast. It is therefore useful to consider the EM skill of the MMAE ensemble. In figure 1 the RMSE (root mean square error) skill scores of the two configurations of the MMAE are plotted as a function of forecast lead-time for MSLP (mean sea-level pressure) over the North Atlantic and Europe in January 1999. Also shown is the score for the UM ensemble (UU27). Although the UM ensemble alone is no more skillful than the EPS, the MMAE ensembles both give between 5 and 10% improvement in EM skill over the EPS for lead-times between about 36 and Figure 1: Ensemble Mean RMSE skill score of MSLP, relative to the EPS, calculated over the North Atlantic and Europe for January 1999. Different ensembles are as shown in the key: UU27 is the 27 member UM ensemble; EEUU54 and EEUU78 are the two versions of MMAE ensembles as described in section 2. 204 hours. When the RMSE is plotted directly, rather than as a skill score (not shown), this can be seen to give a gain in predictability of the order of 8 hours between T+36 and T+144, and up to 15 hours at longer lead-times. Similar results (not shown) are obtained for geopotential height at 500hPa (H500). For temperature at 850hPa (T850) gains in skill of between 2 and 5% are obtained. For precipitation there is little gain in skill (less than 2%), but equally there is no degradation compared to EPS. It appears from figure 1 that the UM ensemble is performing very poorly in the first 48 hours of the forecast, and this results also in poor performance of the MMAE ensembles, with negative skill scores in the first 24 hours approximately. This is believed to be due to the fact that the verifying analysis used is the ECMWF analysis. Analysis fields are characterised by the short-period model biases of the model used in the analysis cycle, and it is therefore to be expected that forecast fields produced by a different model and using a different analysis will appear to verify poorly in the early stages of the forecast. If the UKMO analysis is used instead as the verifying analysis, the apparent poor performance of the UM ensembles and MMAE ensembles is removed, which supports this explanation. After about 48 hours of the forecast, there is very little difference in results using different verifying analyses. However since the aim of the verification is to investigate benefits of the MMAE relative to EPS, the ECMWF analysis is used throughout to avoid any risk of biasing results unfairly in favour of the UM or MMAE ensembles. In figure 1 there is little difference between results for the two MMAE ensembles, EEUU54 and EEUU78. This indicates that the gains in EM skill from the MMAE system are due to the greater effectiveness of the multi-model multi-analysis system, not simply to increasing the ensemble size, and also that the gains may be obtained without the need to increase the overall size, and therefore cost, of the ensemble. Results quoted above are all based on January 1999; similar results were obtained for February 1999 (not shown). However the MMAE system does not always give such large benefits over EPS. In December 1998 (not shown) the individual UM ensemble performed considerably less well than the EPS, with RMSE EM skill scores for UU27 typically around -20%. Despite this, both configurations of MMAE ensemble performed as well as the EPS. This variation in performance will be discussed in section 6 below. 5. Probabilistic Forecast Skill 5.1 Brier Scores The most commonly used verification diagnostic for probabilistic forecasts is the Brier Score, originally introduced by Brier (1950) and described in its modified standard form by Wilks (1995) as: 1 n BS (f i o i ) 2 (2) n i 1 The Brier Score is essentially the mean square error for probability forecasts of an event, and fi and oi are forecast and observed probabilities respectively, where oi has values of 1 when the event occurs and 0 when it does not. Brier Skill Scores (BSS) are calculated by using BS as X in the general skill score definition given in eq. (1) above. In figure 2, Brier Skill Scores relative to EPS for MSLP above normal over the North Atlantic and Europe, in January 1999, are plotted against forecast lead-time for UU27 and the two configurations of MMAE. Over the first 2 days of the forecast performance of the UM ensemble appears poor, and this is again believed to be due to using the ECMWF analysis for verification, as described in section 4. After this period the skill of UU27 is similar to that of EPS, or slightly poorer. However the MMAE ensembles give between 6 and 10% improvements in BSS relative to the EPS for lead-times between 48 and 192 hours, and some improvement out to 240 hours. When Brier Scores are plotted, rather than skill scores, it can be seen that this improvement in skill Figure 2: Brier Skill Score for MSLP above normal, relative to the EPS, calculated over the North Atlantic and Europe for January 1999. Different ensembles as in figure 1. from the MMAE equates to a gain in predictability of around 10 hours. Similar results (not shown) are obtained for H500; for T850, improvements in skill are around 5%, and for 24 hour accumulated precipitation between 2 and 5%, with the largest gains for lead-times of 60 to 120 hours. From figure 2 it can be seen that there is little difference in BS performance between the two MMAE ensembles, EEUU54 and EEUU78. As for the deterministic EM benefits, the improvements are due to real gains in skill, and not simply increasing ensemble size. In section 4 it was stated that EM performance of the UM and MMAE ensembles was less good in December 1998 than in January 1999. In figure 3a BSS results are shown for the same events as given in figure 2, but for December 1998, and it is clear that in this case there is much less benefit from the MMAE approach. Nevertheless there are skill improvements of 2 to 5% over EPS for lead-times of 48 to 120 hours for EEUU54, and some gains out to 204 hours for the full EEUU78. In figure 3b results are given for the same period, but calculated only over Europe. In this case the UU27 ensemble performed rather better over lead-times of 96 to 168 hours, and this resulted in better MMAE performance over the same period, with gains in skill of between 5 and 10% over EPS for both EEUU54 and EEUU78 in this period. These results show that the benefits of the multi-model approach vary with both time and with the geographical area over which they are calculated. This is believed to be due to variations in the relative performance of the two models in different synoptic situations. Both models are amongst the most skilful available, but both have different independent strengths and weaknesses, and it is therefore likely that in some synoptic situations one will perform better, and in other situations the other will. The above results would suggest that in December 1998 the EPS was performing comparatively well over the North Atlantic and Europe area, and it was therefore difficult for the UM ensemble to add to or improve on the information available. Consequently the MMAE ensemble performed only slightly better than the EPS. Over the more limited area of Europe the gains were greater, suggesting that it was largely in this area where the UM ensemble was adding useful information to the EPS. In January 1999 the UM ensemble performed individually almost as well over the larger North Atlantic and Europe area as the EPS, and the combination of information from both in the MMAE ensemble gave greater gains. Results for January for Europe (not shown) show that the UM ensemble performed better than the EPS for lead-times of 96 to 192 hours, and the MMAE skill was between 8 and 13% better than EPS between about 48 and 180 hours. These results are all consistent with the idea that most of the benefits from MMAE ensembles will be gained in regions where the EPS is performing relatively poorly, and it can be expected that the regions where this occurs will vary from time to time. It is important to note that where the EPS is performing well, the addition of information from the UM ensemble, even at the expense of removing some information from the EPS as in the case of the EEUU54 ensemble, does not appear to significantly reduce the performance. (a) North Atlantic and Europe (b) Europe Figure 3: Brier Skill Score for MSLP above normal, relative to the EPS, for December 1998: (a) over North Atlantic and Europe; (b) over Europe. Different ensembles as in figure 1. The time dependence of the relative performances of the different ensembles can be broken down further. In figure 4 the Brier Scores for 120 hours forecasts of MSLP for individual days are plotted for the EPS (EE) and the UU27 and EEUU54 ensembles, averaged over the North Atlantic and Europe area. Figure 4: Brier Scores for individual days during January 1999, for 120 hour forecasts of MSLP above normal calculated over the North Atlantic and Europe. Different ensembles are as shown in the key: EE is the EP; others as in figure 1. Points marked E indicate examples of where the EPS is the better individual ensemble; points marked U where UU27 is better. It can be clearly seen that the relative performance of the individual ensembles varies from day to day. The points marked E indicate examples of days where the EPS performs better, and those marked U where the UM ensemble is better (noting that the Brier Score is negatively oriented with the lower score indicating better performance.) However on almost every day the EEUU54 joint ensemble performs as well as the better individual ensemble, and on occasion, such as the point at the extreme left end of the plot in figure 4, performs better than either individual one. Thus the addition of a second model into the ensemble frequently adds additional useful information, and importantly does not significantly reduce the performance of an individual ensemble when it is performing well. 5.2 ROC (Relative Operating Characteristics) An alternative measure of probabilistic skill which is frequently used is the area under the ROC (Relative Operating Characteristics) curve, introduced by Stanski et al (1989). ROC measures the skill of a forecast in terms of a hit rate (HR) and a false alarm rate (FAR) both classified according to the observations, and more skillful forecasts are characterized by higher hit rates and lower false alarm rates. ROC therefore measures the usefulness of the forecasts as decision making tools. The ROC curve is generated by calculating hit rates and false alarm rates for a range of probability thresholds of an event, and plotting HR against FAR. The area under this curve gives a measure of the overall skill of the forecast system, with a maximum value of 1.0 for a perfect forecast and a value of 0.5 indicating no skill. Skill scores relative to EPS may be calculated for area under the ROC curve using equation (1). In figure 5 ROC area skill scores of the UU27, EEUU54 and EEUU78 ensembles for MSLP above normal over the North Atlantic and Europe are plotted for (a) January 1999 and (b) December 1998. Results are similar to those obtained using Brier Skill Scores, but there are some important differences. In both months shown the advantages of the MMAE approach are greatest at shorter lead-times. For the January 1999 data, both MMAE ensembles give improvements in skill over EPS of between 10 and 20% up to about 108 hours, decreasing at longer lead-times but remaining positive out to 240 hours. There is little difference between EEUU54 and EEUU78. For December 1998 the larger joint ensemble, EEUU78, gives similar benefits at lead-times up to 108 hours, but fall to near zero or slightly negative beyond about 144 hours. The reduced joint ensemble EEUU54 is consistently less good by this measure and in this period, and the skill is negative by 5-10% for lead-times greater than 132 hours. 6. Discussion and Conclusions Previous work has demonstrated the benefits of multi-model multi-analysis ensembles for limited case studies (Evans et al, 1999). The aim of this study is to determine the extent to which such benefits may be obtained in a quasi-operational system using the current operational versions of models. Results shown have illustrated that substantial gains in forecast skill may be achieved by adding a second model and analysis into the EPS system, and that in most cases these gains may be achieved without increasing the overall size (and therefore running cost) of the ensemble. It is notable that the benefits of the MMAE system vary on a range of time-scales, from month to month and from day to day, and also by geographical region. This is believed to be due to the two models in the system performing differently in differing synoptic situations. In comparison with a single-model ensemble (such as the EPS), it could be expected that the MMAE might perform better in synoptic situations which are generally handled better by the additional model. However the differences in performances of skilful models are often subtle, and the situations in which one or the other is likely to perform better are normally impossible to identify synoptically. By combining the two models together into a joint ensemble, the results demonstrate that the benefits of whichever is the better model may be obtained all the time. On both monthly and daily time-averages, the MMAE ensembles effectively act as a filter for the better-performing individual ensemble, and on occasions perform better than either individual ensemble. (a) (b) Figure 5: Area under ROC skill scores, relative to EPS, for MSLP above normal over the North Atlantic and Europe in (a) January 1999 and (b) December 1998. An important result of this study is that even when the single model ensemble is performing well, and the second model less well, as in the December 1998 results, the joint ensemble does not significantly degrade the performance of the single-model ensemble. The only exception to this is in the area under ROC skill for the EEUU54 configuration of the MMAE at longer lead-times, where it appears that removing some EPS members and replacing them with UM members did slightly degrade the overall performance. However this degradation is smaller than the gains achieved in the same month for shorter lead-times, and in other months the MMAE gave positive gains by this measure right out to 240 hours in the forecast. Variations in MMAE performance by geographical region are also believed to be due to synoptic differences. The greatest gains noted in this study have been over Europe, although insufficient data have yet been analysed to determine if this is a consistent pattern. However it is reasonable to expect that if gains are due to the models' different performances for synoptic-scale systems, that on occasions larger benefits will be observed over smaller geographical regions when particular synoptic types prevail in those regions. Benefits over larger areas, such as the North Atlantic and Europe region or Northern Hemisphere, would be expected to be smaller, but achieved more consistently over long periods of time. This idea is generally supported by the results so far analysed. In the introduction it was noted that Buizza et al (1999) have introduced stochastic perturbations of the model physics into the EPS in order to represent some aspects of model uncertainties. Benefits of this have change have mostly been limited to improvements in verification of weather parameters, particularly precipitation. This is perhaps to be expected, as the stochastic perturbations do not alter the overall attractor of the model. This change was introduced to the EPS before the current study was conducted, so the benefits of the multi-model ensembles demonstrated here have been achieved over the EPS with the stochastic physics system in operation. This demonstrates that, although the stochastic physics sceme is beneficial, there are still substantial further benefits to be gained by addition of one or more other models with different, but skilful, attractors, into the ensemble system. One aspect of the MMAE system which has not been addressed above, is the extent to which the benefits of the MMAE are due to incorporating an additional analysis as well as an additional model. Evans et al (1999) concluded that model dependencies of results are non-negligible in the first 48 hours, become equal with analysis dependencies around Day 8 of the forecast, and dominate thereafter. It is therefore likely that a significant part of the benefits of the MMAE described above are due to the additional analysis rather than the additional model. Richardson (1999) has presented a comparison of results of the MMAE with a multi-analysis ensemble generated using the ECMWF model, and concluded that the multi-analysis ensemble gave approximately 50% of the benefits of the MMAE over medium-range time-scales. Thus, while substantial benefits may be gained by incorporating multi-analyses, the multi-model approach is still required for the full benefits of the MMAE system. 7. References Brier,G.W., 1950: Verification of forecasts expressed in terms of probability. Mon.Wea.Rev. 78, 1-3. Buizza,R., Barkmeijer,J., Palmer,T.N. and Richardson,D., 1999: Current status and future developments of the ECMWF Ensemble Predition System. Submitted to Meteorol. Appl., January 1999. Cullen,M.J.P., 1993: The unified forecast/climate model. Meteor. Mag. 122, 81-93. Evans,R.E., Harrison,M.S.J. and Graham,R.J., 1999: Joint Medium Range Ensembles from the UKMO and ECMWF Systems. Submitted to Mon. Wea. Rev., March 1999. Houtekamer,P.L., Lefaivre,L., Derome,J., Ritchie,H., and Mitchell,H.L., 1996: A system simulation approach to ensemble prediction. Mon. Wea. Rev. 124, 1225-1242. Molteni,F., Buizza,R., Palmer,T.N. and Petroliagis,T., 1996: The ECMWF Ensemble Prediction System: Methodology and Validation. Quart.J.Roy.Meteor.Soc. 122, 73-119. Richardson,D, 1999: Presentation to ECMWF EPS Experts Meeting, 17-18 June 1999, ECMWF, Reading, UK. Stanski,H.R., Wilson,L.J. and Burrows,W.R., 1989: Survey of Common Verification Methods in Meteorology, WMO WWW Tech Report No 8, WMO TD No 358. Toth,Z., and Kalnay,E., 1993: Ensemble Forecasting at the NMC:The generation of perturbations. Bull.Amer.Meteor.Soc., 74, 2317-2330. Wilks,D.S., 1995: Statistical Methods in the Atmospheric Sciences - An Introduction, International Geophysics Series Vol 59, Academic Press.