OFFICIAL FORECAST VERSUS MODEL GUIDANCE: COMPARATIVE

advertisement
OFFICIAL FORECAST VERSUS MODEL GUIDANCE: COMPARATIVE
VERIFICATION FOR MAXIMUM AND MINIMUM TEMPERATURES
By
Roman Krzysztofowicz and W. Britt Evans
University of Virginia
Charlottesville,Virginia
Research Paper RK–0703
http://www.faculty.virginia.edu/rk/
August 2007
Revised December 2008
c 2007 by R. Krzysztofowicz and W.B. Evans
Copyright °
———————————————–
Corresponding author address: Professor Roman Krzysztofowicz, University of Virginia, P.O.
Box 400747, Charlottesville,VA 22904–4747. E-mail: rk@virginia.edu
ABSTRACT
A comparative verification is reported of 13,034 matched pairs of the National Weather Service official forecasts and MOS guidance forecasts of the daily maximum temperature prepared
between 2 October 2004 and 28 February 2006. The total sample is arranged into 420 cases (5
stations in diverse climates, 7 lead times of 24–168 h, and 12 sampling windows of four-month
length). The attributes being verified are informativeness, calibration, and accuracy. In addition,
the performance of forecasts for extreme temperature events is examined in detail, and the potential
marginal gain from combining the official forecast with the MOS guidance is evaluated.
The verification measures and the statistical tests of significance support these conclusions.
(i) The official forecast is consistently (in 79% of all cases) and significantly (in 32% of all cases)
less informative than the MOS guidance; only for short lead times (24–48 h) and a few months per
year is the relation reversed. (ii) Neither product is well calibrated (with significant miscalibration
in 30–40% of the cases); the official forecast is slightly better calibrated as the median, while the
MOS guidance is slightly better calibrated as the mean. (iii) For extreme day-to-day changes in
the maximum temperature (having the climatic exceedance probability less than 0.1), the official
forecast actually depreciates the informativeness and the accuracy of the MOS guidance.
(iv)
Combining the two forecasts would yield mostly sporadic and small marginal gains because the
two forecasts are conditionally dependent, strongly and consistently in 96% of the cases, and the
official forecast is uninformative (economically worthless), given the MOS guidance, in 36% of
the cases.
Similar patterns of performance are found in the daily minimum temperature forecasts: 9,799
matched pairs arranged into 300 cases (5 stations, 5 lead times of 36–132 h, and 12 sampling
windows). While not exhaustive, the 420 + 300 = 720 cases are representative enough to call for
ii
further investigation of the revealed patterns of forecast performance. If confirmed, these patterns
should prompt re-examining and re-designing the role of the field forecasters to better suit the
improved guidance products and the emerging paradigm of probabilistic forecasting.
iii
TABLE OF CONTENTS
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Basic Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 The Importance of Temperature Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 The Verifications of Temperature Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. VERIFICATION METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Joint Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Informativeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3. PERFORMANCE IN GENERAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 Informativeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Calibration as Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Calibration as Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4. PERFORMANCE FOR EXTREMES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1 Forecast Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Noisy Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5. WOULD COMBINING ADD VALUE ?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.1 Conditional Independence and Uninformativeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2 Tests and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6. CLOSURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
APPENDIX: RESULTS FOR DAILY MINIMUM TEMPERATURE . . . . . . . . . . . . . . . . . . . 24
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
iv
1. INTRODUCTION
1.1 Basic Questions
“... the local officers might be trained with advantage to supplement the present reports with
forecasts for their several communities ...” — so wrote Walter S. Nichols on the pages of the
American Meteorological Journal more than a century ago (Nichols, 1890), thus expressing the
intellectual foundation for the modus operandi of the National Weather Service (NWS) to this day:
that estimates of future weather calculated centrally from numerical weather prediction models and
statistical post-processors offer only a guidance to the forecasters in some 120 field offices; that the
field forecasters retain the authority to select and adjust the guidance based on recent observations,
local analysis, knowledge of local influences, and experience; and that the adjustments are made
judgmentally (subjectively), notwithstanding various computer aids available these days to perform
the task.
The premise of this modus operandi, of course, is that the field forecasters can and do improve
upon the guidance estimates. But as the numerical weather prediction models continue to improve
as well, it is wise to re-check periodically the validity of this premise. This is the objective of this
paper, albeit limited in scope to two predictands, the daily maximum and minimum temperatures,
and to a set of representative stations.
The paper describes a general statistical methodology
and reports results of a matched comparative verification of the NWS official forecasts, produced
subjectively by the field forecasters, and the guidance forecasts, produced by the Model Output
Statistics (MOS) technique (Glahn and Lowry, 1972) and used in the NWS field offices, along
with other guidance products, to initialize the digital forecast fields (Glahn and Ruth, 2003).
The methodology for the matched comparative verification is structured to answer five basic
questions:
1
1. Is the official forecast more informative than the model guidance? Or, in other words,
does the forecaster’s judgment add economic value to the guidance?
2. Is the official forecast better calibrated than the model guidance? Or, in other words,
can the user take the official forecast (or the model guidance) at face value?
3. Is the official forecast more accurate than the model guidance?
4. Is the official forecast better than the model guidance at predicting extremes?
5. Can a more informative forecast be obtained by fusing the official forecast with the model
guidance?
1.2 Experimental Design
To answer these questions, 22,833 pairs of official and guidance forecasts of daily maximum
and minimum temperatures are verified at five climatically diverse stations thoughout the United
States: Savannah, Georgia (KSAV); Portland, Maine (KPWM); Kalispell, Montana (KFCA); San
Antonio, Texas (KSAT); Fresno, California (KFAT). With the exception of KFCA, the samples
contain data from 2 October 2004 through 28 February 2006; for KFCA, the sample contains
data only through 30 June 2005. The sample sizes for each forecast, official and guidance, are
identical but vary with the predictand and the station. For the daily maximum temperature, they
are 2,887 (KSAV), 2,890 (KPWM), 1,484 (KFCA), 2,892 (KSAT), 2,881 (KFAT); for the daily
minimum temperature, they are 2,187 (KSAV), 2,162 (KPWM), 1,115 (KFCA), 2,168 (KSAT),
2,167 (KFAT); they are about evenly distributed among the lead times.
The body of the paper reports results for the daily maximum temperature. The 13,034 pairs of
official and guidance forecasts are arranged into 420 cases: 5 stations, 7 lead times (24, 48, 72, 96,
120, 144, 168 h), and 12 sampling windows of four-month length. The appendix reports selected
results for the daily minimum temperature. The 9,799 pairs of official and guidance forecasts are
2
arranged into 300 cases: 5 stations, 5 lead times (36, 60, 84, 108, 132), and 12 sampling windows
of four-month length. For both predictands, the revealed patterns of forecast performance are
similar and thus support the same answers to the five basic questions.
1.3 The Importance of Temperature Forecasts
Temperature forecasts are important to many sectors of the nation’s economy: agriculture,
transportation, energy production, and healthcare. For example, orchardists must decide whether
or not to protect their orchards each night during the frost season (Baquet et al., 1976). Since the
cost of heating an orchard is substantial, and since an entire season’s harvest is at stake, forecasts
of minimum temperature are needed to optimally weigh the tradeoffs (Murphy and Winkler, 1979).
Forecasts of maximum temperature during the warm season and of minimum temperature
during the cool season are used regularly by electric utility companies whose operators must decide
when to commit additional generating capacity, or to purchase supplemental power from other
utilities, or to schedule maintenance and repairs. When used in an optimal decision procedure
for power generation planning, the deterministic temperature forecasts yield substantial economic
gains; the probabilistic forecasts yield even higher gains (Alexandridis and Krzysztofowicz, 1982,
1985).
Minimum temperature forecasts can be of value in scheduling aircraft de-icing, outdoor painting, artificial snow production, and service calls for cars (Murphy and Winkler, 1979). Extreme
maximum temperatures need to be forecasted because they can be dangerous: The Centers for
Disease Control and Prevention (2004) report that excessive heat exposure caused 8,966 deaths in
the United States between 1979 and 2002. So, extreme heat caused more deaths than hurricanes,
lightning, tornadoes, floods, and earthquakes combined.
3
1.4 The Verifications of Temperature Forecasts
Murphy et al. (1989) compared MOS guidance forecasts of maximum temperature to subjective forecasts produced by the NWS forecasters in Minneapolis, Minnesota; no conclusion was
reached regarding which forecast is superior.
Roebber and Bosart (1996) compared the value
of MOS guidance forecasts of daily maximum temperature to the value of official forecasts for
Albany, New York, in the years 1970–1993 for several potential users and found that human intervention to produce the official forecasts “has generally let to minimal gains in value beyond that
which is obtainable through direct use of numerical-statistical guidance.”
4
2. VERIFICATION METHODOLOGY
2.1 Joint Samples
The essence of the verification methodology is to compare three continuous variates: W —
the predictand, which is the uncertain quantity being forecasted, XO — the official forecast variate,
and XM — the model guidance variate. Their realizations (observations) are denoted w, xO , xM ,
respectively. Their joint sample of size N is denoted {(w(n), xO (n), xM (n)) : n = 1, ..., N}.
For each station and lead time, the joint observations from the available record are allocated to
twelve 4-month verification windows, designated by the end month. For example, the joint sample
for May contains all the forecasts issued between 1 February and 31 May. The 3-month stagger
of the verification windows implies that any joint observation affects the values of a verification
measure in four consecutive months. As a result, the sample size for each month is increased,
and the time series of a measure behaves similarly to a moving average.
This is a statistical
compromise, of course. It reduces the month-to-month sampling variability of the measure (which
helps to discern seasonal patterns in forecast performance), while risking a degree of heterogeneity
because of the non-stationarity of the predictand (due to seasonality) — but not as much as the
common verifications for 6-month seasons (cool and warm). In this design, the joint samples are
formed and the basic verification measures are calculated and compared for cases: 420 cases for
daily maximum temperature (5 stations, 7 lead times, 12 months), and 300 cases for daily minimum
temperature (5 stations, 5 lead times, 12 months). The measures characterize three attributes of
forecasts: informativeness, calibration, and accuracy. Within the Bayesian decision theory, which
represents the viewpoint of a rational decision maker, only informativeness and calibration matter;
accuracy is included herein because of its traditional usage.
5
2.2 Informativeness
Informativeness of a forecast system with respect to a given predictand is a concept defined
within the Bayesian decision theory for the purpose of ordering forecast systems according to
their economic values. The theoretic foundation of informativeness was layed down by Blackwell
(1951, 1953), while the specific measure of informativeness to be employed herein was derived by
Krzysztofowicz (1987, 1992, 1996). The gist of the concept is this.
Forecast system A is said to be more informative than forecast system B if and only if the
value of the forecast produced by A is at least as high as the value of the forecast produced by
B for all rational decision makers faced with structurally similar decision problems. (The value
of the forecast is to be understood in the mathematical sense, as defined in the Bayesian decision
theory (e.g., Alexandridis and Krzysztofowicz, 1982, 1985).)
The informativeness score, IS, whose calculation is detailed below, is bounded, 0 ≤ IS ≤ 1,
with IS = 0 implying an uninformative (worthless) forecast system, and IS = 1 implying a perfect
forecast system. When it is determined for each of two forecast systems, the following inference
can be made: If ISA > ISB , then forecast system A is more informative than forecast system
B. If ISA = ISB , then the two systems are equivalent, and one should be indifferent between
selecting A or B. (The informativeness score was called the Bayesian correlation score in the
original publication by Krzysztofowicz (1992).)
This inference rule establishes an ordinal correspondence between a statistical performance
measure and an economic performance measure, which has a profound implication: the forecast
system having the maximum informativeness score ensures maximum economic value to every
rational decision maker and, therefore, should be preferred by the utilitarian society. (The informativeness score can also be interpreted as a measure of the degree by which the forecast produced
6
by a given system reduces the uncertainty about the predictand, relative to the prior (climatic) uncertainty.)
The informativeness scores of the official forecast, ISO , and of the model guidance, ISM , are
specified as follows. Let G, KO , KM denote the marginal distribution functions of variates W ,
XO , XM , respectively; let Q−1 denote the inverse of the standard normal distribution function; and
let the normal quantile transform (NQT) of each variate be defined by
V
= Q−1 (G(W )),
(1a)
ZO = Q−1 (KO (XO )),
(1b)
ZM = Q−1 (KM (XM )).
(1c)
The informativeness scores are equal to the Pearson’s product-moment correlation coefficients
from the correlation matrix R = {rij } of the standard normal variates (V, ZO , ZM ):
ISO = r12 = Cor(V, ZO ),
(2a)
ISM = r13 = Cor(V, ZM ).
(2b)
The statistical procedures for implementing (1)–(2) can be found in Krzysztofowicz (1992).
To determine if there is a statistically significant difference between ISO and ISM , Williams’
test statistic TW can be used (Williams, 1959):
s
TW = (r12 − r13 )
(N − 1)(1 + r23 )
¡ N−1 ¢
,
2 N−3 |R| + r̄2 (1 − r23 )3
(3)
where |R| is the determinant of the correlation matrix, r̄ = (r12 + r13 )/2, r23 = Cor(ZO , ZM ),
and N is the sample size. The statistic TW has the t distribution with N − 3 degrees of freedom.
This statistic is ideally suited for testing the null hypothesis that two correlation coefficients are
equal (r12 = r13 ) under the trivariate normal distribution when one of the variates is common and
the sample size is small or moderate (Neill and Dunn, 1975; Sakaori, 2002).
7
2.3 Calibration
A forecast system is said to be well calibrated if the forecast has a well-defined interpretation
which is consistently maintained over time (Krzysztofowicz and Sigrest, 1999a, 1999b). A well
calibrated forecast can be taken at face value by every user — a basic requirement in communicating scientific information. The fundamental deficiency of the NWS forecast is the lack of any
official interpretation. Therefore, two interpretations are considered: the median and the mean.
Only the marginal calibration is verified, which is necessary for the conditional calibration. (A
discussion of the conditional calibration can be found in the references cited above.)
A condition for the marginal calibration of X as the median of W is P (W > X) = 0.5,
where P stands for probability. A measure of calibration is the exceedance frequency:
F =
m
,
N
(4)
where m is the number of times w(n) > x(n) in the joint sample, and N is the sample size. The
forecast may be interpreted as the median of the predictand if F = 0.5. A two-sided exact binomial
test of the null hypothesis that F = 0.5 can be used to determine if the forecast is significantly
uncalibrated as the median. A measure of the degree of miscalibration is the calibration score:
CS = |F − 0.5| .
(5)
A condition for the marginal calibration of X as the mean of W is E(X) = E(W ), where E
stands for expectation. A measure of calibration is the forecast bias, or the mean error:
N
1 X
B=
[x(n) − w(n)] .
N n=1
(6)
The forecast may be interpreted as the mean of the predictand if B = 0. A two-sided t-test of the
null hypothesis that B = 0 can be used to determine if the forecast is significantly uncalibrated as
the mean.
8
2.4 Accuracy
Accuracy is a popular attribute of forecasts.
Herein it is measured in terms of the mean
absolute error:
MAE =
N
1 X
|x(n) − w(n)|.
N n=1
(7)
This measure is consistent with the NWS Operations Manual (National Weather Service, 2000),
and is easier to interpret than the mean square error. A two-sided paired t-test can be performed
of the null hypothesis that MAEO = MAEM .
In advocating the MAE, Brooks and Doswell (1996) note that bias says nothing about the
accuracy of a forecast. For example, a forecast system that makes five forecasts each 20◦ too
warm and five forecasts each 20◦ too cold has the same bias as a forecast system that makes ten
perfect forecasts. We must note that accuracy confounds informativeness with calibration. For
example, a forecast system that makes every forecast 20◦ too high is inaccurate; however, once
the bias is detected, it can easily be subtracted to obtain perfect forecasts. The MAE = 20◦ , as
in the Brooks and Doswell’s example, but the two Bayesian verification measures offer a superior
diagnosis: the system is miscalibrated, because B = 20◦ , yet most informative, because IS = 1.
9
3. PERFORMANCE IN GENERAL
To recall, there are 420 cases (5 stations, 7 lead times, 12 verification windows designated
by the end months). In each case, four basic performance measures are computed for the official
forecast and the MOS guidance. Then the corresponding measures are compared directly and via
statistical tests. The overall results of the case-by-case comparisons are reported in Table 1; they
are discussed below together with the results of more detailed analyses.
3.1 Informativeness
The time series (Fig. 1) of the informativeness score, IS, of the official forecast and the MOS
guidance at every station for lead times of 24, 96, and 168 h reveal four properties. First, the
official forecast IS generally tracks the MOS guidance IS. Second, the IS decreases with lead
time, as expected. Third, at every station, the month-to-month variability of IS increases with lead
time: for the 24 h lead time, the IS is nearly stationary (except for MOS at KFAT), but for the 168
h lead time, it is definitely non-stationary. Fourth, the degree of non-stationarity varies across the
stations: it is the largest at San Antonio, TX, where the forecasts with the 96–168 h lead times in
August–September are the least informative.
In the case-by-case comparisons (Table 1), ISM > ISO as many as 332 times (79% of the
cases), implying that the MOS guidance is more informative than the official forecast; ISO > ISM
only 88 times (21% of the cases). Using Williams’ test statistic (3), ISM is superior 135 times
(32% of the cases) at the 0.05 significance level; ISO is superior only 15 times (4% of the cases)
at the 0.05 significance level. This yields the winning ratio 135/15 = 9/1 in favor of the MOS
guidance.
Whereas the inequality ISM > ISO is statistically significant in only 32% of the cases, it is
present in 79% of the cases. In other words, the MOS guidance is superior consistently, though not
10
always significantly, in the majority of the cases. What are these cases and is the consistent superiority statistically significant? To find the answers, Fig. 2 compares the average informativeness
___
___
scores of the MOS guidance IS M and the official forecast IS O across stations and months for each
lead time. Also shown are P -values from the t-test of the null hypothesis against the one-sided
___
___
alternative hypothesis IS M > IS O . For lead times of 24 h and 48 h, the hypothesis that neither
forecast is more informative cannot be rejected (P -values 0.491, 0.143). For lead times greater
than 48 h, the MOS guidance is significantly more informative than the official forecast with a
P -value near zero. As lead time increases, the difference between the average informativeness
scores also increases.
A closer examination of the individual cases reveals that at four out of five stations, there
exists a season of consistently improved informativeness (CII); it applies to short lead times, either
___
___
24 h, or 24 h and 48 h (Table 2). Within the CII season, IS O > IS M with a P -value of nearly zero.
___
___
However, outside this season, IS M > IS O with a P -value of about zero. Thus, a statistically
significant explanation of the consistent superiority is this: At each station, there exists a CII
season during which the official forecasts with lead times of up to 48 h are more informative than
the MOS guidance; there are 32 such cases (8%). For lead times longer than those within the CII
season, and for all lead times outside the CII season, the MOS guidance is more informative; there
are 332 such cases (79%).
Finally, let us digress that the CII season varies from station to station, and only at Portland,
ME, and at Fresno, CA, does it fully overlap the official cool season (October–March).
This
underscores the deficiency of verification studies that assume the stationarity of forecast system
performance during a fixed six-month season (cool or warm) at every station and pool the samples
from many stations (even from all stations in the U.S.). Such studies may misrepresent the sys-
11
tem performance because they may wash out the statistically significant differences between the
stations and the sub-seasons.
3.2 Calibration as Median
The exceedance frequency F measures the degree to which the forecast is calibrated as the
median of the predictand. The time series (Fig. 3) of F for the official forecast and the MOS
guidance at every station for lead times of 24, 96, and 168 h exhibit four properties. First, both the
MOS guidance and the official forecast lack a consistent probabilistic interpretation; for instance,
the 96-h official forecast at San Antonio, TX, constitutes the 0.32 exceedance probability quantile
of the predictand (essentially the third tercile) in July, and the 0.67 exceedance probability quantile
of the predictand (the first tercile) in December. Second, the differences in the interpretations of
forecasts at various stations in the same month are equally large. Third, a seasonal trend is present
as F generally declines in the summer, below 0.5 in 26 out of 30 time series, indicating that the
forecasts are notoriously too high in the summer. Fourth, the cross-station and the month-to-month
variability of F appears to be unaffected by the lead time — this is how it should be.
The last property is validated formally in Fig. 4, which shows the average calibration scores
___
___
of the MOS guidance CS M and the official forecast CS O across stations and months for each lead
___
___
time. Also shown are P -values from the two-sided t-test of the null hypothesis CS M = CS O .
___
___
While the smallest scores are recorded for 96 h, CS M > CS O for lead times up to 72 h, and
___
___
CS M < CS O for lead times longer than 72 h, the pattern is weak and the null hypothesis is never
rejected.
A forecast system is said to be uncalibrated if the null hypothesis F = 0.5 is rejected at the
0.05 significance level. The official forecast is uncalibrated 153 times and the MOS guidance
is uncalibrated 136 times; however, when FO is compared directly to FM , the official forecast is
12
better calibrated 214 times, while the MOS guidance is better calibrated 187 times (Table 1).
Because the MOS guidance is better calibrated as the mean (the fact to be demonstrated in
Section 3.3), but the official forecast is better calibrated as the median, it appears the forecasters
make their judgmental forecasts as if they were re-calibrating the MOS guidance from one interpretation to another. This type of adjustment to guidance has been documented before (Krzysztofowicz and Sigrest, 1999b). It is natural from the cognitive standpoint because an estimate of the
median has an intuitive interpretation, which the forecaster can validate judgmentally in real time
(via the question: Is the actual observation equally likely to fall below or above my estimate?).
On the contrary, an estimate of the mean (as the mathematical expectation) is an abstraction, which
has no intuitive interpretation and cannot be validated judgmentally. This is the first reason for
adopting the median for the official interpretation of the guidance and the forecast.
3.3 Calibration as Mean
The bias B measures the degree to which the forecast is calibrated as the mean of the predictand. Of the 420 cases (Table 1), the official forecast bias is less than the MOS guidance bias in
201 cases (48%), while the reverse is true in 214 cases (51%). Thus, the MOS guidance is generally better calibrated as the mean of the predictand than the official forecast, but the difference is
not substantial.
The test of the null hypothesis B = 0 against the two-sided alternative hypothesis sharpens
the contrast. At the 0.05 significance level, the official forecast has a non-zero bias in 156 cases
(37%), while the MOS guidance has a non-zero bias in 129 cases (31%). Thus, the MOS guidance
is calibrated as the mean slightly more often than the official forecast.
Whereas the extent to which the forecasters actually anchored their judgments on the MOS
guidance cannot be inferred from the present data, it is still instructive to examine the official
13
forecasts as if they were derived from the MOS guidance (Table 3). The most distressful are the
cases wherein the forecasters switched the sign of the bias and increased the magnitude: 32+36 =
68.
Next are the cases wherein the forecasters retained the sign of the bias but increased the
magnitude: 77 + 66 = 143. Lastly, there are 3 cases wherein the MOS guidance was sans bias
and the forecasters introduced one. All in all, as if tossing a coin, the forecasters worsened the
bias in 50% of the cases, and reduced the bias in 48% of the cases.
The time series of B, not shown herein, exhibit some characteristics similar to those seen in
Fig. 3. The bias is station-dependent and month-dependent. Thus, in general, no spatial and no
temporal stationarity of the bias can be assumed. But unlike the times series of F in Fig. 3, the time
series of B grow more and more variable as the lead time increases. This is confirmed by plotting
_____
____
(Fig. 5) the average bias magnitudes of the MOS guidance |BM | and the official forecast |BO | for
_____
____
each lead time, along with P -values from the two-sided t-test of the null hypothesis |BM | = |BO |.
From the comparison of Fig. 5 with Fig. 4, it is apparent that the calibration of the MOS guidance
and the official forecast is more stable in the median than in the mean. This is the second reason
for adopting the median for the official interpretation of the guidance and the forecast.
3.4 Accuracy
In terms of the accuracy measured by MAE (Table 1), the MOS guidance is superior 306
times (73% of the cases), whereas the official forecast is superior just 111 times (26% of the
cases). At the 0.05 significance level, the corresponding numbers are 48 (11%) and 5 (1%).
The time series of MAE, not shown herein, for all stations and lead times reveal the already
familiar properties: non-stationarity and station-to-station variability, both of which increase with
the lead time.
14
4. PERFORMANCE FOR EXTREMES
4.1 Forecast Accuracy
A common justification of subjective adjustments by the NWS forecasters to guidance forecasts rests on the presumption that, thanks to their expertise and experience, the forecasters can
diagnose a particular weather pattern and assess model performance in evolving that pattern.
Therefore, the argument goes, the forecasters can improve upon the guidance, especially when
the weather forecasts matter most — in times of extremes; however, verifications performed on
samples comprising all observations may not reveal this presumed advantage of the official forecasts. We set to test this hypothesis statistically.
An extreme temperature event is said to occur on day n
if the absolute difference
|w(n) − w(n − 1)| between the maximum temperatures on two consecutive days, n − 1 and n,
has the exceedance probability of 0.1 or lower. The objective is to verify the forecaster’s ability to
predict the extreme day-to-day changes in the daily maximum temperature. Because such changes
are rare, the four-month sampling window is abandoned. Instead, for every station and lead time,
a subsample containing all extreme events is formed by selecting from the entire joint sample the
10% of the days on which the largest absolute differences were recorded. There are thus 35 cases
(5 stations and 7 lead times) with the sample sizes between 35 and 40, except at KFCA where
the sample sizes are 20 or 21. In each case, MAEM and MAEO are calculated and used in the
two-sided t-test of the null hypothesis MAEM = MAEO .
The results for two stations offering the strongest (KSAV) and the weakest (KFCA) support
for the null hypothesis are reported in Table 4. Out of all 35 cases, the MOS guidance is superior
to the official forecast 28 times (80% of the cases).
In the case-by-case tests, the difference
MAEM − MAEO is significant only twice at the 0.05 level. However, the consistency of the
15
________
________
difference is overwhelming, as the null hypothesis on the averages, MAE M = MAE O , is rejected
________
________
by the t-test in favor of the alternative hypothesis MAE M < MAE O with the P -value of 0.030.
In conclusion, the above analysis does not support the hypothesis that human judgment is
superior to the meteorological models in forecasting the extreme day-to-day changes in the daily
maximum temperature. On the contrary, it is the MOS guidance that is superior, while not significantly in individual cases, consistently across all stations and lead times.
4.2 Noisy Channel
Another way to draw an inference from Table 4 is to note that the best improvement the
forecasters could muster in the MAE is 0.42◦ F (which is statistically insignificant at the 0.05 level)
at Savannah, GA, for the 24-h lead time. At the same time, by making their own forecasts rather
than adopting the MOS guidance as the official forecast, the forecasters deteriorated the MAE
by more than 0.42◦ F in 23 cases, with the largest deterioration of 4.15◦ F (which is statistically
significant at the 0.05 level) at Kalispell, MT, for the 120-h lead time.
To visualize the failure of the official forecasts in this last case, all sample points are shown
in Fig. 6.
The scatter plot in the upper left corner shows the mapping of the MOS guidance
into the official forecast. The other two scatter plots compare each of the two forecasts with the
observation. The overall impression they convey is that the scatter of the official forecasts around
the diagonal (on which perfect forecasts would lie) is larger than the scatter of the MOS guidance.
The statisticians have a technical word for such a mapping of one forecast into another (DeGroot,
1970, Chapter 14; Krzysztofowicz, 1992) — an auxiliary randomization. It is as if the forecaster
took the MOS guidance and processed it through a noisy channel. Of course, the output from
the noisy channel is always less informative than the input. The informativeness score quantifies
exactly that: ISO = 0.732 < 0.915 = ISM .
16
5. WOULD COMBINING ADD VALUE ?
Even though in most of the cases the official forecast XO is less informative than the MOS
guidance XM , it is still possible that the information imparted to XO by the field forecaster supplements in some ways the information contained in XM . If this is so, then combining XO with XM
through a Bayesian processor may yield a forecast XC which is more informative than either XO
or XM (and which, therefore, has the economic value at least as high as either XO or XM ). Our
objective is to test this hypothesis.
5.1 Conditional Independence and Uninformativeness
Let f(xO , xM |w) denote the joint density function of variates (XO , XM ), evaluated at the
point (xO , xM ), and conditional on the hypothesis that W = w. When viewed as a function of
w at a fixed point (xO , xM ), it is called the likelihood function of predictand W . All points (xO ,
xM ) specify a family of likelihood functions.
The family of likelihood functions is the key construct in a Bayesian combining model. It
is also the construct that allows us to determine if there would be any gain from combining the
official forecast XO with the MOS guidance XM . Toward this end, the likelihood function is
factorized:
f(xO , xM |w) = f(xO |xM , w)f(xM |w).
(8)
One can interpret this factorization as a framework for developing a combining model in two
stages.
First, the MOS guidance is introduced as a predictor through the likelihood function
f(xM |w). Second, with XM already used, the official forecast XO is introduced as the second
predictor through the conditional likelihood function f (xO |xM , w). Two particular situations are
of special interest.
17
Definition 1 (Conditional Independence). The predictor XO is independent of predictor XM ,
conditional on predictand W , if at every point (xO , xM , w) of the sample space,
f(xO |xM , w) = f (xO |w).
(9)
Definition 2 (Conditional Uninformativeness). The predictor XO is uninformative for predictand W , conditional on predictor XM , if at every point (xO , xM , w) of the sample space,
f(xO |xM , w) = f (xO |xM ).
(10)
These two situations bound the economic gain from combining two predictors. The maximum gain results if XO is conditionally independent of XM . No gain results if XO is conditionally
uninformative, given XM ; and if XM is also more informative than XO , then XO is worthless.
5.2 Tests and Results
To test the hypotheses of conditional independence and conditional uninformativeness, each
of the three variates is subjected to the NQT, as defined in Section 2.2, and then ZO is regressed on
V and ZM in the standard normal space:
ZO = aV + bZM + c + Θ,
(11)
where a and b are regression coefficients, c is the intercept, and Θ is the residual. A two-sided
t-test is next performed on the significance of each regression coefficient. If the null hypothesis
b = 0 cannot be rejected, then the official forecast XO is conditionally independent of the MOS
guidance XM . If the null hypothesis a = 0 cannot be rejected, then the official forecast XO is
conditionally uninformative, given the MOS guidance XM . As before, all results are reported at
the 0.05 significance level.
Among the 420 cases, the official forecast is conditionally independent of the MOS guidance
only 8 times (2%).
Most significantly, the P -values less than 0.005 occur 405 times (96%),
18
making it convincing to accept the alternative hypothesis that the official forecast is conditionally
dependent on the MOS guidance.
Among the 420 cases, the official forecast is conditionally
uninformative, given the MOS guidance, 150 times (36%).
Overall, these results imply that only sporadic gains can be expected from combining the
official forecast with the MOS guidance of the daily maximum temperature. The official forecast
provides independent information in only 2% of the cases, offers no additional information (is
economically worthless) in 36% of the cases, and is conditionally dependent on the MOS guidance
in the remaining 62% of the cases; the strong significance of this dependence suggests a small
marginal gain from combining.
19
6. CLOSURE
6.1 Summary
The 420 + 300 = 720 verification cases reported herein comprise 22,833 pairs of forecasts
for 12 lead times (24–168 h), and span nearly 1
1
2
years (October 2004 – February 2006) of daily
maximum and minimum temperatures at five stations in diverse climates. While not exhaustive,
these cases are representative enough to reveal any patterns in forecast performance across stations,
lead times, and seasons, which are worth attention. Four such patterns have emerged.
1. In terms of informativeness, the official forecast not only fails to improve upon the MOS
guidance, but performs consistently worse (in 79–81% of the cases) and often significantly worse
(in 32–43% of the cases). Only for short lead times (24–48 h) and during a few months per year,
which vary from station to station, is the official forecast consistently more informative than the
MOS guidance.
2. In terms of calibration, neither product is particularly well calibrated (with significant
miscalibration present in 30–40% of the cases); the official forecast is slightly better calibrated
as the median of the daily maximum temperature, whereas the MOS guidance is slightly better
calibrated as the mean. When the official forecast is viewed as an adjustment of the MOS guidance,
adjusting has the statistical effect of tinkering: the calibration of the forecast as the mean is
improved in 45–48% of the cases and worsened in 50–54% of the cases.
3. Contrary to the popular notion that field forecasters can predict extreme events better than
models do, the official forecast actually depreciates the informativeness and the accuracy of the
MOS guidance for extreme events (those having the climatic exceedance probability less than
0.1). From the viewpoint of a forecast user, the official forecast appears like the MOS guidance
processed through a noisy channel.
20
4. Combining the official forecast with the MOS guidance statistically, in the hope of producing a forecast superior to either one of those combined, would yield mostly sporadic and small
marginal gains because the two forecasts are conditionally dependent, strongly and consistently
(in 96% of the cases), and the official forecast does not provide any additional information beyond
that contained in the MOS guidance in 36% of the cases.
6.2 Conclusion
The verification measures and the statistical tests of significance summarized above imply that
the answers to the five basic questions (Section 1.1) are most likely: No, No, No, No, No. Together
with the meaning of the informativeness score (Section 2.2), these answers imply mathematically
that a rational decision maker may expect the economic value of the model guidance to be at least
as high as the economic value of the official forecast. Ergo, the model guidance for the daily
maximum or minimum temperature is the preferred forecast from the viewpoint of the utilitarian
society.
6.3 Discussion
Inasmuch as the advances in numerical weather prediction models and in statistical postprocessors improve the predictive performance of guidance products, the role of the field forecasters should be re-examined periodically, and the tasks they perform should be re-designed to suit
the improved information they receive. The basic question is What tasks should a field forecaster
perform in order to make the optimal use of his judgmental capabilities towards improving the
local weather forecasts? As the verification results reported herein imply, at least for forecasting
the daily maximum and minimum temperatures, the task that has been the staple of the field forecaster’s job throughout the 20th century — the judgmental adjustment of a deterministic guidance
forecast produced centrally to account for information available locally — is about to be rendered
21
purposeless by the improvements in the numerical-statistical models, as anticipated (Bosart, 2003).
A major technical innovation at the onset of the 21st century is the quantification of uncertainty in operational weather forecasts.
Whereas ensemble forecasting methods and statistical
post-processors are in the center of attention, they are far from maturity: their assessments of
uncertainty utilize only a fraction of the total weather information available at any place and time.
Thus arises a significant new opportunity: to re-design the role of the field forecaster to make it
compatible with, and beneficial to, the emerging paradigm of probabilistic forecasting.
Research has shown that weather forecasters can judgmentally assess vast amounts of complex information to detect and quantify the uncertainty (e.g., Murphy and Winkler, 1979; Winkler
and Murphy, 1979; Murphy, 1981; Murphy and Daan, 1984; Krzysztofowicz and Sigrest, 1999a).
This research has been ahead of its time. Now is the opportunity to harness its results in modernizing the role of the field forecaster for the 21st century. As a prelude to systemic re-design, two
steps are recommended, after Krzysztofowicz and Sigrest (1999b, p.452): (i) The deterministic
guidance forecast of every continuous predictand should be given an official probabilistic interpretation. The median of the predictand (not the mean) is the preferred interpretation because it
conveys at least some rudimentary assessment of uncertainty (the 50% chance of the actual observation being either below or above the forecast estimate), which the field forecasters and the
decision makers can grasp intuitively. (ii) The official forecast should be given the same probabilistic interpretation as the guidance forecast, so that the field forecasters can channel their skills
toward improving the calibration of an estimate rather than re-calibrating the estimate from one interpretation to another (and degrading the guidance forecast in the process — as they do currently
for the daily maximum and minimum temperatures).
22
Acknowledgments.
This material is based upon work supported by the National Science
Foundation under Grant No. ATM-0135940, “New Statistical Techniques for Probabilistic Weather
Forecasting”. The Meteorological Development Laboratory of the National Weather Service provided the data. Mark S. Antolik researched a set of representative stations for the U.S. from which
the five stations used herein were selected.
An anonymous reviewer suggested expanding the
study to daily minimum temperature forecasts.
23
APPENDIX: RESULTS FOR DAILY MINIMUM TEMPERATURE
A reviewer of an earlier manuscript, which reported verification results for daily maximum
temperature only, advised expanding the study to daily minimum temperature based on the following argument: “The maximum temperature should be the easiest predictand for MOS because
it is generally well related to large-scale lower tropospheric variables that one expects to be handled well by numerical weather prediction models. On the other hand, the minimum temperature
is much more subject to local influences.
If the human-mediated minimum temperature fore-
casts also do not show improvement over MOS, then the thesis of this paper would be much more
strongly supported.”
In parallel to the results reported in the body of this paper, we performed a comparative
verification of the NWS official forecasts and MOS guidance forecasts of the daily minimum temperature. Having found similar patterns of performance, we report herein the main results of the
case-by-case comparisons.
There are 300 cases (5 stations, 5 lead times, 12 verification windows designated by the end
months). In each case, four basic performance measures are computed for the official forecast and
the MOS guidance. Then the corresponding measures are compared directly and via statistical
tests.
A.1 Informativeness
In the case-by-case comparisons (Table A1), ISM > ISO as many as 244 times (81% of the
cases), implying that MOS guidance is more informative than the official forecast; ISO > ISM
only 56 times (19% of the cases). Using Williams’ test statistic (3), ISM is superior 130 times
(43% of the cases) at the 0.05 significance level; ISO is superior only 10 times (3% of the cases)
at the 0.05 significance level. This yields the winning ratio 130/10 = 13/1 in favor of the MOS
24
guidance. For the maximum temperature, this winning ratio is 9/1 in favor of the MOS guidance
(Table 1). Thus, even if the minimum temperature is subject to local influences, the forecasters
appear incapable of taking advantage of this knowledge.
A.2 Calibration as Median
Of the 300 cases (Table A1), the official forecast is better calibrated 140 times (47% of the
cases), while the MOS guidance is better calibrated 145 times (48% of the cases). The difference
in favor of the MOS guidance (by 1%) is negligible, but is one of the small distinctions between the
daily minimum temperature and the daily maximum temperature (Table 1), for which the official
forecast is better calibrated (by 6%) than the MOS guidance.
A.3 Calibration as Mean
Of the 300 cases (Table A1), the official forecast bias is less than the MOS guidance bias
in 135 cases (45%), while the reverse is true in 162 cases (54%). Thus, the MOS guidance is
generally better calibrated as the mean of the predictand than the official forecast. Though not
substantial, this difference of 9% for the daily minimum temperature is even slightly larger than
the difference of 3% for the daily maximum temperature (Table 1).
In parallel to Table 3 for the daily maximum temperature, Table A2 compares the official
forecast to the MOS guidance for the daily minimum temperature; the comparison is in terms of
the bias, when both the forecast and the guidance are biased. The purpose of this comparison is to
evaluate the goodness of the modifications through which the official forecast is derived from the
MOS guidance under the hypothesis that the forecasters actually anchored their judgments on the
MOS guidance. The most distressful are the cases wherein the forecasters switched the sign of
the bias and increased the magnitude: 17 + 16 = 33. Next are the cases wherein the forecasters
retained the sign of the bias but increased the magnitude: 77 + 51 = 128. All in all, as if tossing
25
a coin, the forecasters worsened the bias in 54% of the cases, and reduced the bias in 45% of the
cases. In general, this performance of the forecasters for the daily minimum temperature follows
the pattern similar to that seen for the daily maximum temperature (Table 3).
A.4 Accuracy
In terms of the accuracy (Table A1), the MOS guidance is superior 239 times (80% of the
cases), whereas the official forecast is superior just 57 times (19% of the cases). This difference
of 61% for the daily minimum temperature is even larger than the difference of 47% for the daily
maximum temperature (Table 1).
26
REFERENCES
Alexandridis, M.G., and R. Krzysztofowicz, 1982: Economic gains from probabilistic temperature
forecasts. Proceedings of the International Symposium on Hydrometeorology, A.I. Johnson
and R.A. Clark (eds.), American Water Resources Association, Bethesda, Maryland, 263–
266.
Alexandridis, M.G., and R. Krzysztofowicz, 1985: Decision models for categorical and probabilistic weather forecasts. Applied Mathematics and Computation, 17, 241–266.
Baquet, A.E., A.N. Halter, and F.S. Conklin, 1976: The value of frost forecasting: A Bayesian
appraisal. American Journal of Agricultural Economics, 58, 511–520.
Blackwell, D., 1951: Comparison of experiments. Proceedings of the Second Berkeley Symposium
on Mathematical Statistics and Probability, J. Neyman (ed.), University of California Press,
Berkeley, pp. 93–102.
Blackwell, D., 1953: Equivalent comparisons of experiments. Annals of Mathematical Statistics,
24, 265–272.
Bosart, L.F., 2003: Whither the weather analysis and forecasting process? Weather and Forecasting, 18, 520–529.
Brooks, H.E., and C.A. Doswell, III, 1996: A comparison of measures-oriented and distributionsoriented approaches to forecast verification. Weather and Forecasting, 11, 288–303.
Centers for Disease Control and Prevention, 2004: About extreme heat. Available online at
http://www.bt.cdc.gov.
DeGroot, M.H., 1970: Optimal Statistical Decisions. McGraw Hill, 490 pp.
27
Glahn, H.R., and D.A. Lowry, 1972: The use of model output statistics (MOS) in objective weather
forecasting. Journal of Applied Meteorology, 11, 1203–1211.
Glahn, H.R., and D.P. Ruth, 2003: The new digital forecast database of the National Weather
Service. Bulletin of the American Meteorological Society, 84, 195–201.
Krzysztofowicz, R., 1987: Markovian forecast processes. Journal of the American Statistical
Association, 82, 31–37.
Krzysztofowicz, R., 1992: Bayesian correlation score: A utilitarian measure of forecast skill.
Monthly Weather Review, 120, 208–219.
Krzysztofowicz, R., 1996: Sufficiency, informativeness, and value of forecasts. Proceedings,
Workshop on the Evaluation of Space Weather Forecasts, Space Environment Center, NOAA,
Boulder, Colorado, 103–112.
Krzysztofowicz, R., and A.A. Sigrest, 1999a: Calibration of probabilistic quantitative precipitation
forecasts. Weather and Forecasting, 14, 427–442.
Krzysztofowicz, R., and A.A. Sigrest, 1999b: Comparative verification of guidance and local quantitative precipitation forecasts: Calibration analyses. Weather and Forecasting, 14, 443–454.
Krzysztofowicz, R., W.J. Drzal, T.R. Drake, J.C. Weyman, and L.A. Giordano, 1993: Probabilistic
quantitative precipitation forecasts for river basins. Weather and Forecasting, 8, 424–439.
Murphy, A.H., 1981: Subjective quantification of uncertainty in weather forecasts in the United
States. Meteorologische Rundschau, 34, 65–77.
Murphy, A.H., and R.L. Winkler, 1979: Probabilistic temperature forecasts: The case for an operational program. Bulletin of the American Meteorological Society, 60, 12–19.
28
Murphy, A.H., and H. Daan, 1984: Impacts of feedback and experience on the quality of subjective
probability forecasts: Comparison of results from the first and second years of the Zierikzee
experiment. Monthly Weather Review, 112, 413–423.
Murphy, A.H., B.G. Brown, and Y. Sheng Chen, 1989: Diagnostic verification of temperature
forecasts. Weather and Forecasting, 4, 485–501.
National Weather Service, 2000: National Weather Service operations manual. Available online at
http://www.nws.noaa.gov/wsom/manual/archives/NC750006.pdf.
Neill, J.J., and O.J. Dunn, 1975: Equality of dependent correlation coefficients. Biometrics, 31,
531–543.
Nichols, W.S., 1890: The mathematical elements in the estimation of the Signal Service reports.
American Meteorological Journal, 6, 386–392.
Roebber, P.J., and L.F. Bosart, 1996: The complex relationship between forecast skill and forecast
value: A real-world analysis. Weather and Forecasting, 11, 544–559.
Sakaori, F., 2002: A nonparametric test for the equality of dependent correlation coefficients under
normality. Communications in Statistics: Theory and Methods, 31, 2379–2389.
Williams, E.J., 1959: The comparison of regression variables. Journal of the Royal Statistical
Society, Series B, 21, 396–399.
Winkler, R.L., and A.H. Murphy, 1979: The use of probabilities in forecasts of maximum and
minimum temperatures. Meteorological Magazine, 108, 317–329.
Zhu, Y., Z. Toth, R. Wobus, D. Richardson, and K. Mylne, 2002: The economic value of ensemblebased weather forecasts. Bulletin of the American Meteorological Society, 83, 73–83.
29
Table 1. Overall results of the comparative verification of the MOS guidance and the
official forecast at five stations, seven lead times, and twelve windows; predictand:
daily maximum temperature.
Performance Attribute
Any significance level
0.05 significance level
Count
Percentage 1/
Count
Percentage 1/
1. Informativeness
MOS superior
Forecast superior
332
88
79%
21%
135
15
32%
4%
2. Calibration as Median
MOS superior
Forecast superior
187
214
45%
51%
136
153
32%
36%
129
156
31%
37%
48
5
11%
1%
MOS uncalibrated
Forecast uncalibrated
3. Calibration as Mean
MOS superior
Forecast superior
214
201
51%
48%
MOS uncalibrated
Forecast uncalibrated
4. Accuracy
MOS superior
Forecast superior
1
306
111
73%
26%
/ Total number of cases is 420.
30
Table 2. Season of consistently improved informativeness of the official
foreast relative to the MOS guidance at each station; predictand: daily maximum temperature.
Station
Lead times [h]
Months
No. of cases
KSAV
24
Jan – Aug
8
KPWM
24
Oct – Mar
6
KFCA
—
—
0
KSAT
24, 48
Dec – Jan
4
KFAT
24, 48
Oct – Apr
14
31
Table 3. Comparison of the official forecast (O) to the MOS guidance (M) in terms
of the bias, when both the forecast and the guidance are biased; predictand:
daily maximum temperature.
Sign of bias
BM
BO
1
Magnitude
of bias
Count
Percentage
−
−
Worsened
77
18%
−
−
Improved
62
15%
−
+
Worsened
32
8%
−
+
Improved
9
2%
+
−
Worsened
36
8%
+
−
Improved
41
10%
+
+
Worsened
66
16%
+
+
Improved
88
21%
Total
Worsened
211
50%
Total
Improved
200
48%
/ Total number of cases is 420.
32
1/
Table 4. Comparison of the official forecast (O) to the MOS guidance (M) for extreme
temperature events in terms of the mean absolute errors (MAE) and their
difference ∆MAE; predictand: daily maximum temperature.
Station
MAEM
MAEO
∆MAE
[◦ F ]
[◦ F ]
[◦ F ]
40
35
40
39
39
39
39
3.45
4.40
4.68
4.77
5.72
6.64
8.64
3.03
4.09
4.73
5.95
6.10
7.67
8.31
0.42
0.31
−0.05
−1.18
−0.38
−1.03
0.33
0.50
0.71
0.95
0.23
0.75
0.42
0.80
21
21
20
20
20
20
20
3.67
3.24
3.90
4.70
4.75
7.35
8.45
4.24
5.00
6.55
5.35
8.90
9.70
10.65
−0.57
−1.76
−2.65
−0.65
−4.15
−2.35
−2.20
0.58
0.15
0.06
0.66
0.04
0.26
0.28
Lead
time
[h]
Sample
size
KSAV
24
48
72
96
120
144
168
KFCA
24
48
72
96
120
144
168
33
P -value
from
t-test
Table A1. Overall results of the comparative verification of the MOS guidance and the
official forecast at five stations, five lead times, and twelve windows; predictand:
daily minimum temperature.
Performance Attribute
Any significance level
0.05 significance level
Count
Percentage 1/
Count
Percentage 1/
1. Informativeness
MOS superior
Forecast superior
244
56
81%
19%
130
10
43%
3%
2. Calibration as Median
MOS superior
Forecast superior
145
140
48%
47%
112
100
37%
33%
122
109
41%
36%
41
7
14%
2%
MOS uncalibrated
Forecast uncalibrated
3. Calibration as Mean
MOS superior
Forecast superior
162
135
54%
45%
MOS uncalibrated
Forecast uncalibrated
4. Accuracy
MOS superior
Forecast superior
1
239
57
80%
19%
/ Total number of cases is 300.
34
Table A2. Comparison of the official forecast (O) to the MOS guidance (M) in terms
of the bias, when both the forecast and the guidance are biased; predictand:
daily minimum temperature.
Sign of bias
BM
BO
1
Magnitude
of bias
Count
Percentage
−
−
Worsened
77
26%
−
−
Improved
54
18%
−
+
Worsened
17
6%
−
+
Improved
26
9%
+
−
Worsened
16
5%
+
−
Improved
17
6%
+
+
Worsened
51
17%
+
+
Improved
37
12%
Total
Worsened
161
54%
Total
Improved
134
45%
/ Total number of cases is 300.
35
1/
FIGURE CAPTIONS
Figure 1. Informativeness of the official forecast and the MOS guidance for five stations and
three lead times; predictand: daily maximum temperature.
___
___
Figure 2. Average informativeness scores, IS M and IS O , for each lead time, and the P -value
___
___
from the t-test of the null hypothesis IS M = IS O against the one-sided alternative
___
___
hypothesis IS M > IS O ; predictand: daily maximum temperature.
Figure 3. Calibration of the official forecast and the MOS guidance as the median of the
predictand for five stations and three lead times; predictand: daily maximum temperature.
____
____
Figure 4. Average calibration scores, CS M and CS O , for each lead time, and the P -value from
____
____
the two-sided t-test of the null hypothesis CS M = CS O ; predictand: daily maximum
temperature.
_______
_______
Figure 5. Average bias magnitudes, |BM | and |BO |, for each lead time, and the P -value from
_______
_______
the two-sided t-test of the null hypothesis |BM | = |BO |; predictand: daily maximum
temperature.
Figure 6. Scatter plots of the forecasts and observations of the daily maximum temperature on the
days of extreme day-to-day changes; KFCA, lead time of 120 h.
36
24 h Lead Time
24 h Lead Time
1.0
MOS Guidance ISM
Official Forecast ISO
1.0
0.8
0.6
0.4
KSAV
KPWM
KFCA
KSAT
KFAT
0.8
0.6
0.4
KSAT
KFAT
KSAV
KPWM
KFCA
0.2
0.2
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Month
Month
96 h Lead Time
96 h Lead Time
1.0
MOS Guidance ISM
Official Forecast ISO
1.0
0.8
0.6
0.4
KSAV
KPWM
KFCA
KSAT
KFAT
0.2
0.8
0.6
0.4
0.2
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Month
Month
168 h Lead Time
168 h Lead Time
1.0
MOS Guidance ISM
Official Forecast ISO
1.0
0.8
0.6
0.4
KSAT
KFAT
KSAV
KPWM
KFCA
KSAV
KPWM
KFCA
KSAT
KFAT
0.2
0.8
0.6
0.4
KSAV
KPWM
KFCA
KSAT
KFAT
0.2
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Month
Month
Figure 1. Informativeness of the official forecast and the MOS guidance for five stations and
three lead times; predictand: daily maximum temperature.
37
1.0
Official Forecast
0.5
0.8
0.4
0.6
0.3
0.4
0.2
0.2
0.1
P-value
Informativeness Score
MOS Guidance
0.0
0
24
48
72
96
120
144
168
Lead Time [h]
___
___
Figure 2. Average informativeness scores, IS M and IS O , for each lead time, and the P -value
___
___
from the t-test of the null hypothesis IS M = IS O against the one-sided alternative
___
___
hypothesis IS M > IS O ; predictand: daily maximum temperature.
38
24 h Lead Time
KSAV
KPWM
KFCA
0.8
24 h Lead Time
1.0
KSAT
KFAT
MOS Guidance FM
Official Forecast FO
1.0
0.6
0.4
0.8
KSAV
KPWM
KFCA
0.6
0.4
0.2
0.2
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Month
Month
96 h Lead Time
KSAV
KPWM
KFCA
0.8
96 h Lead Time
1.0
KSAT
KFAT
MOS Guidance FM
Official Forecast FO
1.0
0.6
0.4
0.2
0.8
KSAV
KPWM
KFCA
KSAT
KFAT
0.6
0.4
0.2
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Month
Month
168 h Lead Time
KSAV
KPWM
KFCA
0.8
168 h Lead Time
1.0
KSAT
KFAT
MOS Guidance FM
1.0
Official Forecast FO
KSAT
KFAT
0.6
0.4
0.2
0.8
KSAV
KPWM
KFCA
KSAT
KFAT
0.6
0.4
0.2
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Month
Month
Figure 3. Calibration of the official forecast and the MOS guidance as the median of the
predictand for five stations and three lead times; predictand: daily maximum temperature.
39
0.12
1
Official Forecast
MOS Guidance
0.10
0.8
0.6
P-value
Calibration Score
0.08
0.06
0.4
0.04
0.2
0.02
0.00
0
24
48
72
96
120
144
168
Lead Time [h]
____
____
Figure 4. Average calibration scores, CS M and CS O , for each lead time, and the P -value from
____
____
the two-sided t-test of the null hypothesis CS M = CS O ; predictand: daily maximum
temperature.
40
1.4
0.7
Official F orecast
0.6
1.0
0.5
0.8
0.4
0.6
0.3
0.4
0.2
0.2
0.1
0.0
P-value
Bias Magnitude [°F]
M OS Guidance
1.2
0
24
48
72
96
120
144
168
Lead Time [h]
_______
_______
Figure 5. Average bias magnitudes, |BM | and |BO |, for each lead time, and the P -value from
_______
_______
the two-sided t-test of the null hypothesis |BM | = |BO |; predictand: daily maximum
temperature.
41
90
Officia l F o reca st [°F ]
IS O = 0.7 32
70
50
30
M OS Guidance [° F ]
90
70
50
Ob servatio n [° F ]
10
30
10
10
30
50
70
90
30
50
70
IS M = 0.9 15
M O S Guida nce [° F ]
90
Figure 6. Scatter plots of the forecasts and observations of the daily maximum temperature on the
days of extreme day-to-day changes; KFCA, lead time of 120 h.
42
Download