Introduction to Forecast Verification Laurence J. Wilson MSC Dorval Quebec Outline • Why verify: Principles, goodness and goals of verification • General Framework for verification (Murphy-Winkler) – Joint and conditional distributions – Scores and measures in context of the framework – Murphy’s attributes of forecasts • Value of forecasts • Model Verification – Data matching issues – Verification issues • Verification of probability distributions • Tricks of the trade • Conclusion References • • • • • Wilks, D.S., 1995: Statistical methods in the atmospheric sciences. Academic Press. Chapter 7. Stanski, H.R., L.J. Wilson, and W.R. Burrows, 1990: Survey of common verification methods in meteorology. WMO WWW Technical Report No. 8. Also available on the web – see below. Jolliffe, I.T. and D.B. Stephenson, 2003: Forecast verification: A practitioner’s guide in atmospheric science. Wiley. Murphy, A.H. and R.L. Winkler, 1987: A general framework for forecast verification. Mon. Wea. Rev. 115, 1330-1338. Murphy, A.H., 1993: What is a good forecast? An essay on the nature of goodness in weather forecasting. Wea. Forecasting 8, 281-293. Web Page of the Joint WGNE/WWRP Verification Working Group: http://www.bom.gov.au/bmrc/wefor/staff/eee/verif/verif_web_page.html Has lots of links, and lots of information. (and is changing all the time) ANNOUNCEMENT: This group is organizing a workshop on Verification methods in or near Montreal, Quebec, September 13 to 17, 2004. Evaluation of forecasts • Murphy’s “goodness” – CONSISTENCY: forecasts agree with forecaster’s true belief about the future weather [strictly proper] – QUALITY: correspondence between observations and forecasts [ verification] – VALUE: increase or decrease in economic or other kind of value to someone as a result of using the forecast [decision theory] Evaluation of forecast system • Evaluation of forecast “goodness” • Evaluation of delivery system – timeliness (are forecasts issued in time to be useful?) – relevence (are forecasts delivered to intended users in a form they can understand and use?) – robustness (level of errors or failures in the delivery of forecasts) Principles of (Objective) Verification • Verification activity has value only if the information generated leads to a decision about the forecast or system being verified – User of the information must be identified – Purpose of the verification must be known in advance • No single verification measure provides complete information about the quality of a forecast product. • Forecast must be stated in such a way that it can be verified – “chance” of showers – What does that gridpoint value really mean? • Except for specific validation studies, verification should be carried out independently of the issuer of the product. Goals of Verification • Administrative – Justify cost of provision of weather services – Justify additional or new equipment – Monitor the quality of forecasts and track changes • Scientific – To identify the strengths and weaknesses of a forecast product in sufficient detail that actions can be specified that will lead to improvements in the product, ie to provide information to direct R&D. Verification Model (cont’d) • Predictand Types – Continuous: Forecast is a specific value of the variable • wind • temperature • upper air variables – Categorical/probabilistic: Forecast is the probability of occurrence of ranges of values of the variable (categories) • • • • POP and other weather elements (fog etc) Precipitation type cloud amount precipitation amount – Probability distributions (ensembles) Framework for Verification (Murphy-Winker) • All verification information contained in the joint distribution of forecasts and observations • Factorizations: – calibration-refinement p(f,x) = p(x|f) p(f) p(x|f) = conditional distribution of observation given forecast (calibration/reliability) p(f) = marginal distribution of forecasts (refinement) – likelihood-base rate p(f,x) = p(f|x)p(x) p(f|x) = conditional distribution of forecasts given observations (likelihood/discrimination) p(x) = marginal distribution of observations (base rate/climatology) Verification Samples • Joint distn of forecasts and observations may be: – A time series at points – One or more spatial fields – A combination of these • In meteorological applications: – The events of the sample are not usually even close to independent in the statistical sense – Importance of carefully assessed confidence limits of verification results Spot Temperature Scatter Plot Exercise Spot Temperature Scatter Plot Exercise Contingency Tables - Basic 2 X 2 Forecasts Observations Yes No HITS MISSES Total Events Obs FALSE ALARMS CORRECT NEGATIVES Total non-events Obs Total Events Fcst Total Non-Events Sample size Fcst Yes No Verification -A general model ATTRIBUTE DEFINITION RELATED MEASURES 1. Bias Correspondence between mean forecast and mean observation bias (mean forecast probability-sample observed frequency) 2. Association Strength of linear relationship between pairs of forecasts and observations Average correspondence between individual pairs of observations and forecasts Accuracy of forecasts relative to accuracy of forecasts produced by a standard method covariance, correlation 3. Accuracy 4. Skill mean absolute error (MAE), mean squared error (MSE), root mean squared error, Brier score (BS) Brier skill score, others in the usual format Skill Scores • Format: Sc SSc Skill PSc SSc • Where Sc=score (MAE, Brier etc) • PSc= score for a perfect forecast • SSc= score for a standard (unskilled) forecast ATTRIBUTE 5. Reliability 6. Resolution 7. Sharpness 8. Discrimination 9. Uncertainty DEFINITION Correspondence of conditional mean observation and conditioning forecast, averaged over all forecasts Difference between conditional mean observation and unconditional mean observation, averaged over all forecasts. Variability of forecasts as described by distribution of forecasts Difference between conditional mean forecast and unconditional mean forecast, averaged over all observations Variability of observations as described by the distribution of observations RELATED MEASURES Reliability component of BS, MAE, MSE of binned data from reliability table. Resolution component of BS Variance of forecasts Area under ROC, measures of separation of conditional distributions; MAE,MSE of scatter plot, binned by observation value Variance of observations ATTRIBUTE 5. Reliability 6. Resolution 7. Sharpness 8. Discrimination 9. Uncertainty DEFINITION Correspondence of conditional mean observation and conditioning forecast, averaged over all forecasts Difference between conditional mean observation and unconditional mean observation, averaged over all forecasts. Variability of forecasts as described by distribution of forecasts Difference between conditional mean forecast and unconditional mean forecast, averaged over all observations Variability of observations as described by the distribution of observations RELATED MEASURES Reliability component of BS, MAE, MSE of binned data from reliability table. Resolution component of BS Variance of forecasts Area under ROC, measures of separation of conditional distributions; MAE,MSE of scatter plot, binned by observation value Variance of observations ROC - ECMWF Ensemble Forecasts 24 h POP (>1 mm) Likelihood diagram - 96 h pcpn Relative Operating Characteristic 24 h Precip > 1 mm Europe obs pcpn 0.85 0.65 0.8 0.45 no pcpn 0.05 0.9 0.25 Cases 1 3000 2500 2000 1500 1000 500 0 Forecast Probability 0.7 Likelihood Diagram - 144 h pcpn 0.85 no pcpn 0.65 240 h 0.4 pcpn 0.45 144 h 0.25 0 skill 0.5 3000 2500 2000 1500 1000 500 0 0.05 Hit Rate 96 h Cases 0.6 Forecast Probability 0.3 Az 96h 0.839 144h 0.777 240h 0.709 0.2 Likelihood Diagram - 240 h pcpn Da 1.400 1.077 0.780 2000 1500 Cases 0.1 pcpn 1000 no pcpn 500 0.4 0.5 0.6 False Alarm Rate 0.7 0.8 0.9 1 0 0.85 0.3 0.65 0.2 0.45 0.1 0.25 0 0.05 0 Forecast Probability Current ROC – Canadian EPS Spring 2003 Summary • The choice of verification measure depends on: – The purpose of the verification (admin – science) – The nature of the predictand – The attributes of the forecast to be measured The meaning of ‘Value’ • “Weather forecasts possess no intrinsic value in an economic sense. They acquire value by influencing the behaviour of individuals or organizations (“users”) whose activities are sensitive to weather.” – Allan Murphy, Conference on economic benefits of Meteorological and Hydrological services (Geneva, 1994) 24 Types of “Value” • Social value - Minimization of Hazards to human life and health – Value to individual users • Economic value of forecasts – – – – – Value to a specific business Value to a weather-sensitive industry Value to a weather-sensitive sector Value to the economy of a country Market value (e.g. futures) • Environmental value – minimizing risk to the environment – optimal use of resources 25 Value vs. Quality • Quality refers only to forecast verification; Value implicates a user • A perfect forecast may have no value if no one cares about it • An imperfect forecast will have less value than a perfect forecast • See Murphy and Ehrendorfer 1987 26 Measuring value • The cost-loss decision model – – – – – focus on maximizing gain or loss-avoidance requires objective cost information from user user specific, difficult to generalize economic value to weather-sensitive operation only. easy to evaluate relative value • Contingent-valuation method – focuses on demand for service and “willingness to pay” – requires surveys of users to determine variations in demand as function of variations in price and/or quality of service – less user-specific; a larger crossection of users/industries can be evaluated in one study – measures in terms of perception rather than actual accuracy. – e.g. evaluation of ATADs, Rollins and Shaykewich, Met Apps Mar. 03 Model Verification • Data matching issues • Verification issues Model Verification – Data Matching Issues • Typical situation: Model gridded forecasts, observations at points – Point in situ observations undersample the field, contain information on all spatial and most temporal scales. (“representativeness error”? Not really) • Alternatives: – Model to data: What does the model predict at the obs point? • Interpolation – if the model gives point predictions – Gives answers at all verification points • Nearest gridpoint value – if the model prediction is a grid box average – Verify only those grid boxes where there is at least one obs – UPSCALING: - estimate grid box average using all obs in grid box. – Data to model: Analysing point data: • NOT RECOMMENDED because treats networks of point observations as if they contain information only on the scales represented by the grid on which the analysis is done. Model Verification- Issues • Resolution – scale separation • Spatial verification – object-oriented Model Verification – Scale separation • Mesoscale verification: Separating model errors due to resolution from other errors. • IF high resolution spatial data is available: – Scale separation, wavelets or other method (Mike Baldwin) – Repeat verification on same dataset at different scales to get performance curve – Data combination: Use high resolution data to “inform” statistical estimates such as grid box averages. Spatial Verification • Object-oriented methods • The calculation of displacement, size errors for specific objects (e.g. rain areas, fronts) – Hoffman, 1995; Ebert and McBride 2000 CRA method • Decomposition of errors into location, shape, size components – Others (Mike Baldwin) – Problem always is the matching of the forecast and observed object. Verification of probability distributions • Problem: – Comparison of distribution with a single outcome • Solutions: – Verify density in vicinity of observation (Wilson, Burrows and Lanzinger, 1999) – Verify cdf against observation represented as cdf (CRPS, Hersbach 2000) – Extract probabilities from distribution and verify as probability forecasts (sample several thresholds) – Compare parameters of pooled distribution with sample climatology (Talagrand diagram) Ensemble verification - distribution CRPS example CDF - Forecast-observed 1 0.9 0.8 Probability 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 X Forecast Observed Rank Histogram example Tricks of the trade • “How can I get a better (higher) number?” – Remove the bias before calculating scores (works really well for quadratic scoring rules) and don’t tell anyone. – Claim that the model predicts grid box averages, even if it doesn’t . Make the boxes as large as possible. – Never use observation data. It only contains a lot of “noise”. As an alternative,: • Verify against an analysis that uses the model being verified as a trial field. Works best in data-sparse areas • Use a shorter range forecast from the model being verified and call it observation data. – Design a new or modified score. Don’t be bothered by restrictions such as strictly properness. Then the values can be as high as desired. – Stratify the verification data using posteriori rules. One can always get rid of pathological cases that bring down the average. – When comparing the performance of your model against others, make sure it is your analysis that is used as the verifying one. – Always insist on doing the verification of your own products…. • Remember, you already know you have a good product. The goal of the verification is to show it “objectively” Conclusions • Quick survey of verification and value assessment • A data-oriented perspective