Forecast Verification

advertisement
Introduction to Forecast Verification
Laurence J. Wilson
MSC Dorval Quebec
Outline
• Why verify: Principles, goodness and goals of verification
• General Framework for verification (Murphy-Winkler)
– Joint and conditional distributions
– Scores and measures in context of the framework
– Murphy’s attributes of forecasts
• Value of forecasts
• Model Verification
– Data matching issues
– Verification issues
• Verification of probability distributions
• Tricks of the trade
• Conclusion
References
•
•
•
•
•
Wilks, D.S., 1995: Statistical methods in the atmospheric sciences. Academic
Press. Chapter 7.
Stanski, H.R., L.J. Wilson, and W.R. Burrows, 1990: Survey of common
verification methods in meteorology. WMO WWW Technical Report No. 8.
Also available on the web – see below.
Jolliffe, I.T. and D.B. Stephenson, 2003: Forecast verification: A practitioner’s
guide in atmospheric science. Wiley.
Murphy, A.H. and R.L. Winkler, 1987: A general framework for forecast
verification. Mon. Wea. Rev. 115, 1330-1338.
Murphy, A.H., 1993: What is a good forecast? An essay on the nature of
goodness in weather forecasting. Wea. Forecasting 8, 281-293.
Web Page of the Joint WGNE/WWRP Verification Working Group:
http://www.bom.gov.au/bmrc/wefor/staff/eee/verif/verif_web_page.html
Has lots of links, and lots of information. (and is changing all the time)
ANNOUNCEMENT: This group is organizing a workshop on Verification
methods in or near Montreal, Quebec, September 13 to 17, 2004.
Evaluation of forecasts
• Murphy’s “goodness”
– CONSISTENCY: forecasts agree with forecaster’s true
belief about the future weather [strictly proper]
– QUALITY: correspondence between observations and
forecasts [ verification]
– VALUE: increase or decrease in economic or other kind
of value to someone as a result of using the forecast
[decision theory]
Evaluation of forecast system
• Evaluation of forecast “goodness”
• Evaluation of delivery system
– timeliness (are forecasts issued in time to be useful?)
– relevence (are forecasts delivered to intended users in a
form they can understand and use?)
– robustness (level of errors or failures in the delivery of
forecasts)
Principles of (Objective) Verification
• Verification activity has value only if the information generated leads
to a decision about the forecast or system being verified
– User of the information must be identified
– Purpose of the verification must be known in advance
• No single verification measure provides complete information about
the quality of a forecast product.
• Forecast must be stated in such a way that it can be verified
– “chance” of showers
– What does that gridpoint value really mean?
• Except for specific validation studies, verification should be carried
out independently of the issuer of the product.
Goals of Verification
• Administrative
– Justify cost of provision of weather services
– Justify additional or new equipment
– Monitor the quality of forecasts and track changes
• Scientific
– To identify the strengths and weaknesses of a forecast
product in sufficient detail that actions can be specified
that will lead to improvements in the product, ie to
provide information to direct R&D.
Verification Model (cont’d)
• Predictand Types
– Continuous: Forecast is a specific value of the variable
• wind
• temperature
• upper air variables
– Categorical/probabilistic: Forecast is the probability of
occurrence of ranges of values of the variable
(categories)
•
•
•
•
POP and other weather elements (fog etc)
Precipitation type
cloud amount
precipitation amount
– Probability distributions (ensembles)
Framework for Verification (Murphy-Winker)
• All verification information contained in the joint distribution of
forecasts and observations
• Factorizations:
– calibration-refinement
p(f,x) = p(x|f) p(f)
p(x|f) = conditional distribution of observation given forecast
(calibration/reliability)
p(f) = marginal distribution of forecasts (refinement)
– likelihood-base rate
p(f,x) = p(f|x)p(x)
p(f|x) = conditional distribution of forecasts given observations
(likelihood/discrimination)
p(x) = marginal distribution of observations (base rate/climatology)
Verification Samples
• Joint distn of forecasts and observations may be:
– A time series at points
– One or more spatial fields
– A combination of these
• In meteorological applications:
– The events of the sample are not usually even close to
independent in the statistical sense
– Importance of carefully assessed confidence limits of
verification results
Spot Temperature Scatter Plot Exercise
Spot Temperature Scatter Plot Exercise
Contingency Tables - Basic 2 X 2
Forecasts
Observations
Yes
No
HITS
MISSES
Total Events Obs
FALSE
ALARMS
CORRECT
NEGATIVES
Total non-events
Obs
Total Events
Fcst
Total Non-Events Sample size
Fcst
Yes
No
Verification -A general model
ATTRIBUTE
DEFINITION
RELATED MEASURES
1. Bias
Correspondence
between mean forecast
and mean observation
bias (mean forecast
probability-sample
observed frequency)
2. Association
Strength of linear
relationship between
pairs of forecasts and
observations
Average
correspondence
between individual pairs
of observations and
forecasts
Accuracy of forecasts
relative to accuracy of
forecasts produced by a
standard method
covariance, correlation
3. Accuracy
4. Skill
mean absolute error
(MAE), mean squared
error (MSE), root mean
squared error, Brier
score (BS)
Brier skill score, others
in the usual format
Skill Scores
• Format:

Sc  SSc 
Skill 
PSc  SSc 
• Where Sc=score (MAE, Brier etc)
• PSc= score for a perfect forecast
• SSc= score for a standard (unskilled) forecast
ATTRIBUTE
5. Reliability
6. Resolution
7. Sharpness
8.
Discrimination
9. Uncertainty
DEFINITION
Correspondence of
conditional mean
observation and
conditioning forecast,
averaged over all
forecasts
Difference between
conditional mean
observation and
unconditional mean
observation, averaged
over all forecasts.
Variability of forecasts as
described by distribution
of forecasts
Difference between
conditional mean forecast
and unconditional mean
forecast, averaged over
all observations
Variability of observations
as described by the
distribution of
observations
RELATED MEASURES
Reliability component of BS,
MAE, MSE of binned data from
reliability table.
Resolution component of BS
Variance of forecasts
Area under ROC, measures of
separation of conditional
distributions; MAE,MSE of
scatter plot, binned by
observation value
Variance of observations
ATTRIBUTE
5. Reliability
6. Resolution
7. Sharpness
8.
Discrimination
9. Uncertainty
DEFINITION
Correspondence of
conditional mean
observation and
conditioning forecast,
averaged over all
forecasts
Difference between
conditional mean
observation and
unconditional mean
observation, averaged
over all forecasts.
Variability of forecasts as
described by distribution
of forecasts
Difference between
conditional mean forecast
and unconditional mean
forecast, averaged over
all observations
Variability of observations
as described by the
distribution of
observations
RELATED MEASURES
Reliability component of BS,
MAE, MSE of binned data from
reliability table.
Resolution component of BS
Variance of forecasts
Area under ROC, measures of
separation of conditional
distributions; MAE,MSE of
scatter plot, binned by
observation value
Variance of observations
ROC - ECMWF Ensemble Forecasts
24 h POP (>1 mm)
Likelihood diagram - 96 h pcpn
Relative Operating Characteristic
24 h Precip > 1 mm Europe obs
pcpn
0.85
0.65
0.8
0.45
no pcpn
0.05
0.9
0.25
Cases
1
3000
2500
2000
1500
1000
500
0
Forecast Probability
0.7
Likelihood Diagram - 144 h pcpn
0.85
no pcpn
0.65
240 h
0.4
pcpn
0.45
144 h
0.25
0 skill
0.5
3000
2500
2000
1500
1000
500
0
0.05
Hit Rate
96 h
Cases
0.6
Forecast Probability
0.3
Az
96h 0.839
144h 0.777
240h 0.709
0.2
Likelihood Diagram - 240 h pcpn
Da
1.400
1.077
0.780
2000
1500
Cases
0.1
pcpn
1000
no pcpn
500
0.4
0.5
0.6
False Alarm Rate
0.7
0.8
0.9
1
0
0.85
0.3
0.65
0.2
0.45
0.1
0.25
0
0.05
0
Forecast Probability
Current ROC – Canadian EPS
Spring 2003
Summary
• The choice of verification measure depends on:
– The purpose of the verification (admin – science)
– The nature of the predictand
– The attributes of the forecast to be measured
The meaning of ‘Value’
• “Weather forecasts possess no intrinsic value in an
economic sense. They acquire value by
influencing the behaviour of individuals or
organizations (“users”) whose activities are
sensitive to weather.”
– Allan Murphy, Conference on economic benefits of
Meteorological and Hydrological services (Geneva,
1994)
24
Types of “Value”
• Social value - Minimization of Hazards to human
life and health
– Value to individual users
• Economic value of forecasts
–
–
–
–
–
Value to a specific business
Value to a weather-sensitive industry
Value to a weather-sensitive sector
Value to the economy of a country
Market value (e.g. futures)
• Environmental value
– minimizing risk to the environment
– optimal use of resources
25
Value vs. Quality
• Quality refers only to forecast verification; Value
implicates a user
• A perfect forecast may have no value if no one
cares about it
• An imperfect forecast will have less value than a
perfect forecast
• See Murphy and Ehrendorfer 1987
26
Measuring value
• The cost-loss decision model
–
–
–
–
–
focus on maximizing gain or loss-avoidance
requires objective cost information from user
user specific, difficult to generalize
economic value to weather-sensitive operation only.
easy to evaluate relative value
• Contingent-valuation method
– focuses on demand for service and “willingness to pay”
– requires surveys of users to determine variations in demand as
function of variations in price and/or quality of service
– less user-specific; a larger crossection of users/industries can be
evaluated in one study
– measures in terms of perception rather than actual accuracy.
– e.g. evaluation of ATADs, Rollins and Shaykewich, Met Apps Mar.
03
Model Verification
• Data matching issues
• Verification issues
Model Verification – Data Matching Issues
• Typical situation: Model gridded forecasts, observations at
points
– Point in situ observations undersample the field, contain
information on all spatial and most temporal scales.
(“representativeness error”? Not really)
• Alternatives:
– Model to data: What does the model predict at the obs point?
• Interpolation – if the model gives point predictions
– Gives answers at all verification points
• Nearest gridpoint value – if the model prediction is a grid box average
– Verify only those grid boxes where there is at least one obs
– UPSCALING: - estimate grid box average using all obs in grid
box.
– Data to model: Analysing point data:
• NOT RECOMMENDED because treats networks of point
observations as if they contain information only on the scales
represented by the grid on which the analysis is done.
Model Verification- Issues
• Resolution – scale separation
• Spatial verification – object-oriented
Model Verification – Scale separation
• Mesoscale verification: Separating model errors
due to resolution from other errors.
• IF high resolution spatial data is available:
– Scale separation, wavelets or other method (Mike
Baldwin)
– Repeat verification on same dataset at different scales
to get performance curve
– Data combination: Use high resolution data to “inform”
statistical estimates such as grid box averages.
Spatial Verification
• Object-oriented methods
• The calculation of displacement, size errors for
specific objects (e.g. rain areas, fronts)
– Hoffman, 1995; Ebert and McBride 2000 CRA method
• Decomposition of errors into location, shape, size components
– Others (Mike Baldwin)
– Problem always is the matching of the forecast and
observed object.
Verification of probability distributions
• Problem:
– Comparison of distribution with a single outcome
• Solutions:
– Verify density in vicinity of observation (Wilson, Burrows and
Lanzinger, 1999)
– Verify cdf against observation represented as cdf (CRPS, Hersbach
2000)
– Extract probabilities from distribution and verify as probability
forecasts (sample several thresholds)
– Compare parameters of pooled distribution with sample
climatology (Talagrand diagram)
Ensemble verification - distribution
CRPS example
CDF - Forecast-observed
1
0.9
0.8
Probability
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
X
Forecast
Observed
Rank Histogram example
Tricks of the trade
• “How can I get a better (higher) number?”
– Remove the bias before calculating scores (works really well for quadratic
scoring rules) and don’t tell anyone.
– Claim that the model predicts grid box averages, even if it doesn’t . Make
the boxes as large as possible.
– Never use observation data. It only contains a lot of “noise”. As an
alternative,:
• Verify against an analysis that uses the model being verified as a trial field.
Works best in data-sparse areas
• Use a shorter range forecast from the model being verified and call it
observation data.
– Design a new or modified score. Don’t be bothered by restrictions such as
strictly properness. Then the values can be as high as desired.
– Stratify the verification data using posteriori rules. One can always get rid
of pathological cases that bring down the average.
– When comparing the performance of your model against others, make
sure it is your analysis that is used as the verifying one.
– Always insist on doing the verification of your own products….
• Remember, you already know you have a good product. The goal of
the verification is to show it “objectively”
Conclusions
• Quick survey of verification and value assessment
• A data-oriented perspective
Download