WORLD METEOROLOGICAL ORGANIZATION _________________________ TCM-VI/Doc. 2.4 (13.X.2009) ___________ SIXTH TROPICAL CYCLONE RSMCs/TCWCs TECHNICAL COORDINATION MEETING ITEM 2.4 BRISBANE, AUSTRALIA 2 TO 5 NOVEMBER 2009 ENGLISH ONLY Standard format for verification of TC forecast (Submitted by the Secretariat) This document provides the standard formant for verification of TC forecast which was proposed by La Réunion and submitted to TCM-5. ACTION PROPOSED It was noted at TCM-3 that the proposed methodologies of verification would be acceptable in principle. At the 5th session, TCM reviewed the standard formant proposed by RSMC La Réunion in response to the request from TCM-4 and in consideration of the comment from the centers concerned. Nevertheless, TCM-5 reserved acceptance of the format, because parameters for verification were different between the centers and they also had different views about the measurement of gain in skill. It recognized difficulties in reaching a consensus. The committee may discuss how to deal with this issue. __________________ Standard format for warning verification statistics of TC forecasts provided by TC RSMCs, TCWCs and NWP Centres I. Introduction As requested by the Second TC RSMCs Technical Coordination Meeting (STCM), a methodology to standardize the verification of the forecasts provided by TC RSMCs and TCWCs was submitted to the TCM-3 by the representative of the RSMC La Réunion. TCM-3 noted that the proposed methodology would be acceptable in principle. However, it also noted that the proposed methodology should be further studied by other TC RSMCs and TCWCs. Thus, TCM-4 requested that the centres review the proposed format and forward their comments to the RSMC La Réunion (Mr Philippe Caroff) before 1 February 2003 and that the final format to be sent to the TC RSMCs and TCWCs by 1 September 2003. The proposed standard format is as follows: Global overview of the season (or add individual verification statistics for each system?) Parameters : track forecasts, intensity forecasts (winds; optional : pressure). Forecast periods to verify : Errors at 0h (analysis), 12h, 24h, 48h forecast periods. Recommended 72h. Optional : 36h, long range forecasts Sampling of Data to be verified Proposal : to verify all the forecasts disseminated, excluding those concerning a system classified, at the analysis stage, as extratropical, and excluding those concerning a system for which maximum wind is, both at the time of analysis and at the time of forecast, of inferior strength (in a strict sense) than near gale force (respectively gale force) for average winds over 10 minutes (respectively 1 minute). Optional : discrimination by intensities, i.e. to provide equivalent statistics calculated respectively for: All the forecasts for a system whose maximum wind speed, at analysis or at forecast, is greater than or equal to 64kt (average winds over 10 minutes) All forecasts for a system whose maximum wind speed is between 34 and 63kt (over 10 minutes) at analysis (or at forecast) and not greater than or equal to 64 kt at forecast (or at analysis). Size of samples (number of forecasts for the statistics submitted) should be systematically indicated (for example, between brackets adjacent to the corresponding statistic). II. Statistics on verification of track forecasts All statistical data on track forecast errors are to be indicated in kilometres. Simple Errors A global statement on track forecast errors will be provided. This will contain both Direct Positional Errors - measured as the distance, on the globe, between the forecast and the observed position (Best Track point) -, standard deviation. Furthermore, indication of the median is recommended. As regards systems considered individually, it is recommended to indicate a visual representation, in the form of a graph, of tracks and associated forecast errors (at the different forecast ranges). A further option, not compulsory, would be to indicate the histogram of errors for the different forecast ranges. A distribution of errors by segments of 50km (0-50, 50-100, etc...) and by frequency seems particularly advisable. Forecast error rates less than typical thresholds might also be included (percentage of forecasts with errors less than 100km at 12h forecast period, 150km at 24h, 300km at 48h, 450km at 72h). Measuring biases Two types of biases may be identified: zonal and meridional biases and biases calculated on the observed track of the phenomenon in question. Thus, the classic definitions are: DX = Positional error in an East-West direction, with the convention of the sign that DX is positive when the forecast position is to the east of the observed position. DY = Positional error in a North-South direction, with the convention of the sign that DY is positive when the forecast position is located on the polar side in relation to the observed position. AT = Positional error along the axis of the track (Along Track), with the convention of the sign that AT is positive when the forecast is in advance of the observed position. CT = Positional error transverse to the track (Cross Track), with the convention of the sign that CT is positive when the forecast is located to the right (to the left, respectively) of observed track in the northern hemisphere (southern hemisphere, respectively). As the dimensions indicated above are signed, a simple arithmetical average of these errors for all of the forecasts is of scant importance, as it can hide the biases by artificial compensation between positive and negative values. A scalar average (average absolute errors) conversely, provides information on the average size of deviations between forecasts and observations. This might give an indication of major time differences (low scalar average on CT, with high scalar average on AT) or major track errors (high scalar average on CT, with low scalar average on AT). More informative could be to draw a distinction between positive and negative errors, and to present average values as well as standard deviations associated with respectively positive AT errors, negative AT errors, positive CT errors, negative CT errors (idem for DX and DY) submitted separately. The relative frequency of the occurrence of positive and negative errors is an indication of possible bias and is of note here, even if easily accessible given the comparison of the number of samples mentioned between brackets. So as to avoid an overabundance of numbers, graphical overviews would undoubtedly be more appropriate. To highlight biases tending to over-estimation or under-estimation of track speed (linked to AT), or tending to forecasts too far left or right regarding the real track (linked to CT, and useful for indicating a tendency to anticipate or delay -or miss- track recurvatures), a simple method consists of viewing the sample of forecast errors in the form of a graphical representation (axes of AT, CT or DX, DY coordinates) by forecast period of the distribution of errors (scatter diagram). An equally visual method, but integrating information other than purely qualitative in nature, consists of elaborating, for each forecast period, a “wind rose”, where forecast error is treated vectorally and defined, by the norm DPE and by the angle of deviation in relation to the real track (angle subtracted from AT and CT). For this, errors need to be sorted by class, and the respective frequencies of various classes calculated. The definition of classes should be adapted to the size and type of sample. For 12h forecasts, classes of a 30° (or 45° for a limited sample) angle deviation are proposed and also for 50km steps for DPEs (0-50km, 50-100km, 100-150km, > 200km), which already amounts to 60 classes (40 in the latter case). The ideal -absence of biases and weak errors-, consists of obtaining a wind rose with a distribution balanced between the right and left hemispheres (no one direction deviation) and between the upper and lower hemispheres (no speed bias), and thus the most targeted as possible to the vertical axis (highest frequencies for light deviations), focussed on weak error classes. The measurement of gain in skill In order to compare forecasts made in extremely variable conditions and evaluate their quality by including these forecasts’ degree of difficulty (in particular with the aim of detecting forecasts’ trends with time), there are several possible options. The first being the measurement of gain (or loss) in skill, made possible by forecast in relation to a reference model. A reference model still needs to be chosen. PERSISTENCE (calculated using Best Track and movement during the last 12 hours), which is the simplest forecast model, or a more developed model, CLIMATOLOGICAL or CLIPER (linking climatology and persistence), are usually used as reference. The gain in skill in relation to CLIPER, is quantified in percentage terms by; Gain in skill= (CLIPER DPE - DPE) Х 100% CLIPER DPE The samples for calculating the gain in official forecasts (or NWP forecasts) compared with data from the model of reference (CLIPER or PERSISTENCE) will be less comprehensive ,because obtaining the latter data generally requires knowledge of the positions observed 12hrs and 24hrs before the base time and, therefore, it is not possible to calculate initial values of skill at the beginning of the trajectory. These skill calculations do not exactly reflect (they underestimate) the actual gain in skill provided by the forecasters, because they are calculated using Best Track points. This does not, of course, stop different Centres from taking interest, at an internal level, in a verification of forecasts in relation to real time data, which are the only true work basis for forecasters when making forecasts. One problem is that the definition of CLIPER models can differ according to the cyclone basins. The ideal situation would be to have a universal model serving as a sole reference. The MOCCANA climatological model (model of analogues) developed in La Réunion, which give results similar to CLIPER results, could be proposed as this reference. Otherwise, the simplest solution would be to adopt PERSISTENCE as a reference tool. For a given cyclone basin, once the reference model is chosen, the gain in skill’s graph of time evolution could be presented, but it will not make it possible to validate an improvement (or deterioration) in the quality of forecasts. A season that is rather easy in terms of forecast difficulty, presenting very persistent and/or climatlological trajectories, will be perhaps less valuable in terms of skill than a difficult season, allowing easier large gains. Normalization or weighting of forecasts This is why other options could be explored to try to overcome this natural variation of the cyclone season’s degree of difficulty. One of these solutions is to take into account that with a large enough sample of systems and trajectories, the previous variability tends to disappear (by integrating a large number of trajectories, it is assumed that all forecast situations are included, from the easiest to the most difficult). Consequently, by establishing a running average of gains (or even directly of forecast errors) during several cyclone seasons, a statistically significant seasonal tendency is likely to appear, without interference from the difficulty of individual season’s seasonal variation. The period needed to carry out this “running mean” still has to be determined. The running average of 5 years, used by some, may be not necessarily sufficient. If the previous method appears relevant to assess an evolution in the quality of forecasts, in a given, individually taken cyclone basin, other problems could be considered. If it appears a priori unrealistic to hope that comparisons can be made between groups or forecasts made in different and therefore cyclone basins of varying difficulty (see Neumann’s work on the comparative difficulty of different cyclone basins). However, perhaps it is nevertheless possible to try to quantify the quality of forecasts in relation to their degree of difficulty. In order to do this, a correction factor could be applied to the average error of forecast, the formula for which including the season’s degree of difficulty, harmonized with a standard reference. Ideally, the coefficient could, for example, be defined by a comparison between the season’s persistent average error of forecast and the climatological persistent average error of forecast during a 30 year period. If the season has been easier than usual, the correction coefficient is above 1, and below 1 if the season has been more difficult. In this way, “standardized” average annual errors of forecasts become testable in terms of seasonal comparison. III. Verification of intensity forecast statistics Intensity forecasts carried out by CMRS or Warning Centers apply to central pressure, maximum wind in the tropical systems, and in some cases to the forecast of the Dvorak intensity Ci number. The verification of intensity forecasts will more specifically apply to the two first parameters, with priority given to maximum wind. Verification of maximum wind forecasts The first question concerns the choice of units. The m/s, international unit system, or knot, used almost uniformly in advisory and bulletin forecasts, are to be recommended. These forecast errors being signed, will be treated in the same way as track forecasts for AT, CT and DX.DY. A distinction will be made between absolute average errors (with standard deviation and median), giving the average margin of error forecast (still in relation to Best Track intensities), and data applying to the biases, for whom there will be a separation of positive errors (predicted maximum wind above the maximum wind recorded or estimated) and negative errors, with respective average and standard deviation, as well as relative frequency (giving the possible bias of over-estimation or under-estimation of intensities). Optionally, as with trajectory forecasts, histograms of errors of intensity during different periods can be used effectively. The division of errors into steps of 5 kts or 2.5 m/s (…, -2.5 m/s, 0 m/s, +2.5 m/s, etc…) and by frequency appears to be well adapted and could be recommended. Verification of central pressure forecasts The accepted unit will be hectoPascal, and the statistics will be presented following the aforementioned methodology for maximum wind (5 hPa will be the recommended step). IV. Verification of cyclogenesis forecasts Centres disseminating cyclogenesis forecasts, generally issue their forecasts of formation of tropical depressions in a probabilistic form (probability of “poor”, “fair”, “good” cyclogenesis). The monitoring of this particular type of forecast is therefore through adapted statistical tools, like contingency tables, with aggregatedscores calculated on the tables. These aggregated scores will provide the following elements: percentage of correct forecasts, level of false alarms, level of non-detection, Heidke index of success (compared with a random forecast), Rousseau index (compared with a random forecast, but in keeping with climatology). The quality indices are presented in form: Q= (B-H)/(T-H), where B represents the number of correct forecasts, T is the total number of forecasts and H represents the number of correct forecasts resulting from a reference forecast. This reference forecast is always incorrect in the case of the percentage of correct forecasts. H is the probability (respectively the probability in keeping with climatology) in the case of the Heidke index (respectively Rousseau) and is deduced from the aggregates of lines and columns in the table). N.B. Other types of forecasts could be subject to similar kinds of data processing. The forecasts of the trajectory’s curvatures, the forecasts of intensity above a certain threshold, as well as the forecasts for reaching the threshold of hurricane intensity, for example, could be given.