OFFICIAL FORECAST VERSUS MODEL GUIDANCE: COMPARATIVE VERIFICATION FOR MAXIMUM AND MINIMUM TEMPERATURES By Roman Krzysztofowicz and W. Britt Evans University of Virginia Charlottesville,Virginia Research Paper RK–0703 http://www.faculty.virginia.edu/rk/ August 2007 Revised December 2008 c 2007 by R. Krzysztofowicz and W.B. Evans Copyright ° ———————————————– Corresponding author address: Professor Roman Krzysztofowicz, University of Virginia, P.O. Box 400747, Charlottesville,VA 22904–4747. E-mail: rk@virginia.edu ABSTRACT A comparative verification is reported of 13,034 matched pairs of the National Weather Service official forecasts and MOS guidance forecasts of the daily maximum temperature prepared between 2 October 2004 and 28 February 2006. The total sample is arranged into 420 cases (5 stations in diverse climates, 7 lead times of 24–168 h, and 12 sampling windows of four-month length). The attributes being verified are informativeness, calibration, and accuracy. In addition, the performance of forecasts for extreme temperature events is examined in detail, and the potential marginal gain from combining the official forecast with the MOS guidance is evaluated. The verification measures and the statistical tests of significance support these conclusions. (i) The official forecast is consistently (in 79% of all cases) and significantly (in 32% of all cases) less informative than the MOS guidance; only for short lead times (24–48 h) and a few months per year is the relation reversed. (ii) Neither product is well calibrated (with significant miscalibration in 30–40% of the cases); the official forecast is slightly better calibrated as the median, while the MOS guidance is slightly better calibrated as the mean. (iii) For extreme day-to-day changes in the maximum temperature (having the climatic exceedance probability less than 0.1), the official forecast actually depreciates the informativeness and the accuracy of the MOS guidance. (iv) Combining the two forecasts would yield mostly sporadic and small marginal gains because the two forecasts are conditionally dependent, strongly and consistently in 96% of the cases, and the official forecast is uninformative (economically worthless), given the MOS guidance, in 36% of the cases. Similar patterns of performance are found in the daily minimum temperature forecasts: 9,799 matched pairs arranged into 300 cases (5 stations, 5 lead times of 36–132 h, and 12 sampling windows). While not exhaustive, the 420 + 300 = 720 cases are representative enough to call for ii further investigation of the revealed patterns of forecast performance. If confirmed, these patterns should prompt re-examining and re-designing the role of the field forecasters to better suit the improved guidance products and the emerging paradigm of probabilistic forecasting. iii TABLE OF CONTENTS ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Basic Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 The Importance of Temperature Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 The Verifications of Temperature Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. VERIFICATION METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Joint Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Informativeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3. PERFORMANCE IN GENERAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1 Informativeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 Calibration as Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 Calibration as Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4. PERFORMANCE FOR EXTREMES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.1 Forecast Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2 Noisy Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5. WOULD COMBINING ADD VALUE ?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.1 Conditional Independence and Uninformativeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.2 Tests and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 6. CLOSURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 6.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 APPENDIX: RESULTS FOR DAILY MINIMUM TEMPERATURE . . . . . . . . . . . . . . . . . . . 24 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 iv 1. INTRODUCTION 1.1 Basic Questions “... the local officers might be trained with advantage to supplement the present reports with forecasts for their several communities ...” — so wrote Walter S. Nichols on the pages of the American Meteorological Journal more than a century ago (Nichols, 1890), thus expressing the intellectual foundation for the modus operandi of the National Weather Service (NWS) to this day: that estimates of future weather calculated centrally from numerical weather prediction models and statistical post-processors offer only a guidance to the forecasters in some 120 field offices; that the field forecasters retain the authority to select and adjust the guidance based on recent observations, local analysis, knowledge of local influences, and experience; and that the adjustments are made judgmentally (subjectively), notwithstanding various computer aids available these days to perform the task. The premise of this modus operandi, of course, is that the field forecasters can and do improve upon the guidance estimates. But as the numerical weather prediction models continue to improve as well, it is wise to re-check periodically the validity of this premise. This is the objective of this paper, albeit limited in scope to two predictands, the daily maximum and minimum temperatures, and to a set of representative stations. The paper describes a general statistical methodology and reports results of a matched comparative verification of the NWS official forecasts, produced subjectively by the field forecasters, and the guidance forecasts, produced by the Model Output Statistics (MOS) technique (Glahn and Lowry, 1972) and used in the NWS field offices, along with other guidance products, to initialize the digital forecast fields (Glahn and Ruth, 2003). The methodology for the matched comparative verification is structured to answer five basic questions: 1 1. Is the official forecast more informative than the model guidance? Or, in other words, does the forecaster’s judgment add economic value to the guidance? 2. Is the official forecast better calibrated than the model guidance? Or, in other words, can the user take the official forecast (or the model guidance) at face value? 3. Is the official forecast more accurate than the model guidance? 4. Is the official forecast better than the model guidance at predicting extremes? 5. Can a more informative forecast be obtained by fusing the official forecast with the model guidance? 1.2 Experimental Design To answer these questions, 22,833 pairs of official and guidance forecasts of daily maximum and minimum temperatures are verified at five climatically diverse stations thoughout the United States: Savannah, Georgia (KSAV); Portland, Maine (KPWM); Kalispell, Montana (KFCA); San Antonio, Texas (KSAT); Fresno, California (KFAT). With the exception of KFCA, the samples contain data from 2 October 2004 through 28 February 2006; for KFCA, the sample contains data only through 30 June 2005. The sample sizes for each forecast, official and guidance, are identical but vary with the predictand and the station. For the daily maximum temperature, they are 2,887 (KSAV), 2,890 (KPWM), 1,484 (KFCA), 2,892 (KSAT), 2,881 (KFAT); for the daily minimum temperature, they are 2,187 (KSAV), 2,162 (KPWM), 1,115 (KFCA), 2,168 (KSAT), 2,167 (KFAT); they are about evenly distributed among the lead times. The body of the paper reports results for the daily maximum temperature. The 13,034 pairs of official and guidance forecasts are arranged into 420 cases: 5 stations, 7 lead times (24, 48, 72, 96, 120, 144, 168 h), and 12 sampling windows of four-month length. The appendix reports selected results for the daily minimum temperature. The 9,799 pairs of official and guidance forecasts are 2 arranged into 300 cases: 5 stations, 5 lead times (36, 60, 84, 108, 132), and 12 sampling windows of four-month length. For both predictands, the revealed patterns of forecast performance are similar and thus support the same answers to the five basic questions. 1.3 The Importance of Temperature Forecasts Temperature forecasts are important to many sectors of the nation’s economy: agriculture, transportation, energy production, and healthcare. For example, orchardists must decide whether or not to protect their orchards each night during the frost season (Baquet et al., 1976). Since the cost of heating an orchard is substantial, and since an entire season’s harvest is at stake, forecasts of minimum temperature are needed to optimally weigh the tradeoffs (Murphy and Winkler, 1979). Forecasts of maximum temperature during the warm season and of minimum temperature during the cool season are used regularly by electric utility companies whose operators must decide when to commit additional generating capacity, or to purchase supplemental power from other utilities, or to schedule maintenance and repairs. When used in an optimal decision procedure for power generation planning, the deterministic temperature forecasts yield substantial economic gains; the probabilistic forecasts yield even higher gains (Alexandridis and Krzysztofowicz, 1982, 1985). Minimum temperature forecasts can be of value in scheduling aircraft de-icing, outdoor painting, artificial snow production, and service calls for cars (Murphy and Winkler, 1979). Extreme maximum temperatures need to be forecasted because they can be dangerous: The Centers for Disease Control and Prevention (2004) report that excessive heat exposure caused 8,966 deaths in the United States between 1979 and 2002. So, extreme heat caused more deaths than hurricanes, lightning, tornadoes, floods, and earthquakes combined. 3 1.4 The Verifications of Temperature Forecasts Murphy et al. (1989) compared MOS guidance forecasts of maximum temperature to subjective forecasts produced by the NWS forecasters in Minneapolis, Minnesota; no conclusion was reached regarding which forecast is superior. Roebber and Bosart (1996) compared the value of MOS guidance forecasts of daily maximum temperature to the value of official forecasts for Albany, New York, in the years 1970–1993 for several potential users and found that human intervention to produce the official forecasts “has generally let to minimal gains in value beyond that which is obtainable through direct use of numerical-statistical guidance.” 4 2. VERIFICATION METHODOLOGY 2.1 Joint Samples The essence of the verification methodology is to compare three continuous variates: W — the predictand, which is the uncertain quantity being forecasted, XO — the official forecast variate, and XM — the model guidance variate. Their realizations (observations) are denoted w, xO , xM , respectively. Their joint sample of size N is denoted {(w(n), xO (n), xM (n)) : n = 1, ..., N}. For each station and lead time, the joint observations from the available record are allocated to twelve 4-month verification windows, designated by the end month. For example, the joint sample for May contains all the forecasts issued between 1 February and 31 May. The 3-month stagger of the verification windows implies that any joint observation affects the values of a verification measure in four consecutive months. As a result, the sample size for each month is increased, and the time series of a measure behaves similarly to a moving average. This is a statistical compromise, of course. It reduces the month-to-month sampling variability of the measure (which helps to discern seasonal patterns in forecast performance), while risking a degree of heterogeneity because of the non-stationarity of the predictand (due to seasonality) — but not as much as the common verifications for 6-month seasons (cool and warm). In this design, the joint samples are formed and the basic verification measures are calculated and compared for cases: 420 cases for daily maximum temperature (5 stations, 7 lead times, 12 months), and 300 cases for daily minimum temperature (5 stations, 5 lead times, 12 months). The measures characterize three attributes of forecasts: informativeness, calibration, and accuracy. Within the Bayesian decision theory, which represents the viewpoint of a rational decision maker, only informativeness and calibration matter; accuracy is included herein because of its traditional usage. 5 2.2 Informativeness Informativeness of a forecast system with respect to a given predictand is a concept defined within the Bayesian decision theory for the purpose of ordering forecast systems according to their economic values. The theoretic foundation of informativeness was layed down by Blackwell (1951, 1953), while the specific measure of informativeness to be employed herein was derived by Krzysztofowicz (1987, 1992, 1996). The gist of the concept is this. Forecast system A is said to be more informative than forecast system B if and only if the value of the forecast produced by A is at least as high as the value of the forecast produced by B for all rational decision makers faced with structurally similar decision problems. (The value of the forecast is to be understood in the mathematical sense, as defined in the Bayesian decision theory (e.g., Alexandridis and Krzysztofowicz, 1982, 1985).) The informativeness score, IS, whose calculation is detailed below, is bounded, 0 ≤ IS ≤ 1, with IS = 0 implying an uninformative (worthless) forecast system, and IS = 1 implying a perfect forecast system. When it is determined for each of two forecast systems, the following inference can be made: If ISA > ISB , then forecast system A is more informative than forecast system B. If ISA = ISB , then the two systems are equivalent, and one should be indifferent between selecting A or B. (The informativeness score was called the Bayesian correlation score in the original publication by Krzysztofowicz (1992).) This inference rule establishes an ordinal correspondence between a statistical performance measure and an economic performance measure, which has a profound implication: the forecast system having the maximum informativeness score ensures maximum economic value to every rational decision maker and, therefore, should be preferred by the utilitarian society. (The informativeness score can also be interpreted as a measure of the degree by which the forecast produced 6 by a given system reduces the uncertainty about the predictand, relative to the prior (climatic) uncertainty.) The informativeness scores of the official forecast, ISO , and of the model guidance, ISM , are specified as follows. Let G, KO , KM denote the marginal distribution functions of variates W , XO , XM , respectively; let Q−1 denote the inverse of the standard normal distribution function; and let the normal quantile transform (NQT) of each variate be defined by V = Q−1 (G(W )), (1a) ZO = Q−1 (KO (XO )), (1b) ZM = Q−1 (KM (XM )). (1c) The informativeness scores are equal to the Pearson’s product-moment correlation coefficients from the correlation matrix R = {rij } of the standard normal variates (V, ZO , ZM ): ISO = r12 = Cor(V, ZO ), (2a) ISM = r13 = Cor(V, ZM ). (2b) The statistical procedures for implementing (1)–(2) can be found in Krzysztofowicz (1992). To determine if there is a statistically significant difference between ISO and ISM , Williams’ test statistic TW can be used (Williams, 1959): s TW = (r12 − r13 ) (N − 1)(1 + r23 ) ¡ N−1 ¢ , 2 N−3 |R| + r̄2 (1 − r23 )3 (3) where |R| is the determinant of the correlation matrix, r̄ = (r12 + r13 )/2, r23 = Cor(ZO , ZM ), and N is the sample size. The statistic TW has the t distribution with N − 3 degrees of freedom. This statistic is ideally suited for testing the null hypothesis that two correlation coefficients are equal (r12 = r13 ) under the trivariate normal distribution when one of the variates is common and the sample size is small or moderate (Neill and Dunn, 1975; Sakaori, 2002). 7 2.3 Calibration A forecast system is said to be well calibrated if the forecast has a well-defined interpretation which is consistently maintained over time (Krzysztofowicz and Sigrest, 1999a, 1999b). A well calibrated forecast can be taken at face value by every user — a basic requirement in communicating scientific information. The fundamental deficiency of the NWS forecast is the lack of any official interpretation. Therefore, two interpretations are considered: the median and the mean. Only the marginal calibration is verified, which is necessary for the conditional calibration. (A discussion of the conditional calibration can be found in the references cited above.) A condition for the marginal calibration of X as the median of W is P (W > X) = 0.5, where P stands for probability. A measure of calibration is the exceedance frequency: F = m , N (4) where m is the number of times w(n) > x(n) in the joint sample, and N is the sample size. The forecast may be interpreted as the median of the predictand if F = 0.5. A two-sided exact binomial test of the null hypothesis that F = 0.5 can be used to determine if the forecast is significantly uncalibrated as the median. A measure of the degree of miscalibration is the calibration score: CS = |F − 0.5| . (5) A condition for the marginal calibration of X as the mean of W is E(X) = E(W ), where E stands for expectation. A measure of calibration is the forecast bias, or the mean error: N 1 X B= [x(n) − w(n)] . N n=1 (6) The forecast may be interpreted as the mean of the predictand if B = 0. A two-sided t-test of the null hypothesis that B = 0 can be used to determine if the forecast is significantly uncalibrated as the mean. 8 2.4 Accuracy Accuracy is a popular attribute of forecasts. Herein it is measured in terms of the mean absolute error: MAE = N 1 X |x(n) − w(n)|. N n=1 (7) This measure is consistent with the NWS Operations Manual (National Weather Service, 2000), and is easier to interpret than the mean square error. A two-sided paired t-test can be performed of the null hypothesis that MAEO = MAEM . In advocating the MAE, Brooks and Doswell (1996) note that bias says nothing about the accuracy of a forecast. For example, a forecast system that makes five forecasts each 20◦ too warm and five forecasts each 20◦ too cold has the same bias as a forecast system that makes ten perfect forecasts. We must note that accuracy confounds informativeness with calibration. For example, a forecast system that makes every forecast 20◦ too high is inaccurate; however, once the bias is detected, it can easily be subtracted to obtain perfect forecasts. The MAE = 20◦ , as in the Brooks and Doswell’s example, but the two Bayesian verification measures offer a superior diagnosis: the system is miscalibrated, because B = 20◦ , yet most informative, because IS = 1. 9 3. PERFORMANCE IN GENERAL To recall, there are 420 cases (5 stations, 7 lead times, 12 verification windows designated by the end months). In each case, four basic performance measures are computed for the official forecast and the MOS guidance. Then the corresponding measures are compared directly and via statistical tests. The overall results of the case-by-case comparisons are reported in Table 1; they are discussed below together with the results of more detailed analyses. 3.1 Informativeness The time series (Fig. 1) of the informativeness score, IS, of the official forecast and the MOS guidance at every station for lead times of 24, 96, and 168 h reveal four properties. First, the official forecast IS generally tracks the MOS guidance IS. Second, the IS decreases with lead time, as expected. Third, at every station, the month-to-month variability of IS increases with lead time: for the 24 h lead time, the IS is nearly stationary (except for MOS at KFAT), but for the 168 h lead time, it is definitely non-stationary. Fourth, the degree of non-stationarity varies across the stations: it is the largest at San Antonio, TX, where the forecasts with the 96–168 h lead times in August–September are the least informative. In the case-by-case comparisons (Table 1), ISM > ISO as many as 332 times (79% of the cases), implying that the MOS guidance is more informative than the official forecast; ISO > ISM only 88 times (21% of the cases). Using Williams’ test statistic (3), ISM is superior 135 times (32% of the cases) at the 0.05 significance level; ISO is superior only 15 times (4% of the cases) at the 0.05 significance level. This yields the winning ratio 135/15 = 9/1 in favor of the MOS guidance. Whereas the inequality ISM > ISO is statistically significant in only 32% of the cases, it is present in 79% of the cases. In other words, the MOS guidance is superior consistently, though not 10 always significantly, in the majority of the cases. What are these cases and is the consistent superiority statistically significant? To find the answers, Fig. 2 compares the average informativeness ___ ___ scores of the MOS guidance IS M and the official forecast IS O across stations and months for each lead time. Also shown are P -values from the t-test of the null hypothesis against the one-sided ___ ___ alternative hypothesis IS M > IS O . For lead times of 24 h and 48 h, the hypothesis that neither forecast is more informative cannot be rejected (P -values 0.491, 0.143). For lead times greater than 48 h, the MOS guidance is significantly more informative than the official forecast with a P -value near zero. As lead time increases, the difference between the average informativeness scores also increases. A closer examination of the individual cases reveals that at four out of five stations, there exists a season of consistently improved informativeness (CII); it applies to short lead times, either ___ ___ 24 h, or 24 h and 48 h (Table 2). Within the CII season, IS O > IS M with a P -value of nearly zero. ___ ___ However, outside this season, IS M > IS O with a P -value of about zero. Thus, a statistically significant explanation of the consistent superiority is this: At each station, there exists a CII season during which the official forecasts with lead times of up to 48 h are more informative than the MOS guidance; there are 32 such cases (8%). For lead times longer than those within the CII season, and for all lead times outside the CII season, the MOS guidance is more informative; there are 332 such cases (79%). Finally, let us digress that the CII season varies from station to station, and only at Portland, ME, and at Fresno, CA, does it fully overlap the official cool season (October–March). This underscores the deficiency of verification studies that assume the stationarity of forecast system performance during a fixed six-month season (cool or warm) at every station and pool the samples from many stations (even from all stations in the U.S.). Such studies may misrepresent the sys- 11 tem performance because they may wash out the statistically significant differences between the stations and the sub-seasons. 3.2 Calibration as Median The exceedance frequency F measures the degree to which the forecast is calibrated as the median of the predictand. The time series (Fig. 3) of F for the official forecast and the MOS guidance at every station for lead times of 24, 96, and 168 h exhibit four properties. First, both the MOS guidance and the official forecast lack a consistent probabilistic interpretation; for instance, the 96-h official forecast at San Antonio, TX, constitutes the 0.32 exceedance probability quantile of the predictand (essentially the third tercile) in July, and the 0.67 exceedance probability quantile of the predictand (the first tercile) in December. Second, the differences in the interpretations of forecasts at various stations in the same month are equally large. Third, a seasonal trend is present as F generally declines in the summer, below 0.5 in 26 out of 30 time series, indicating that the forecasts are notoriously too high in the summer. Fourth, the cross-station and the month-to-month variability of F appears to be unaffected by the lead time — this is how it should be. The last property is validated formally in Fig. 4, which shows the average calibration scores ___ ___ of the MOS guidance CS M and the official forecast CS O across stations and months for each lead ___ ___ time. Also shown are P -values from the two-sided t-test of the null hypothesis CS M = CS O . ___ ___ While the smallest scores are recorded for 96 h, CS M > CS O for lead times up to 72 h, and ___ ___ CS M < CS O for lead times longer than 72 h, the pattern is weak and the null hypothesis is never rejected. A forecast system is said to be uncalibrated if the null hypothesis F = 0.5 is rejected at the 0.05 significance level. The official forecast is uncalibrated 153 times and the MOS guidance is uncalibrated 136 times; however, when FO is compared directly to FM , the official forecast is 12 better calibrated 214 times, while the MOS guidance is better calibrated 187 times (Table 1). Because the MOS guidance is better calibrated as the mean (the fact to be demonstrated in Section 3.3), but the official forecast is better calibrated as the median, it appears the forecasters make their judgmental forecasts as if they were re-calibrating the MOS guidance from one interpretation to another. This type of adjustment to guidance has been documented before (Krzysztofowicz and Sigrest, 1999b). It is natural from the cognitive standpoint because an estimate of the median has an intuitive interpretation, which the forecaster can validate judgmentally in real time (via the question: Is the actual observation equally likely to fall below or above my estimate?). On the contrary, an estimate of the mean (as the mathematical expectation) is an abstraction, which has no intuitive interpretation and cannot be validated judgmentally. This is the first reason for adopting the median for the official interpretation of the guidance and the forecast. 3.3 Calibration as Mean The bias B measures the degree to which the forecast is calibrated as the mean of the predictand. Of the 420 cases (Table 1), the official forecast bias is less than the MOS guidance bias in 201 cases (48%), while the reverse is true in 214 cases (51%). Thus, the MOS guidance is generally better calibrated as the mean of the predictand than the official forecast, but the difference is not substantial. The test of the null hypothesis B = 0 against the two-sided alternative hypothesis sharpens the contrast. At the 0.05 significance level, the official forecast has a non-zero bias in 156 cases (37%), while the MOS guidance has a non-zero bias in 129 cases (31%). Thus, the MOS guidance is calibrated as the mean slightly more often than the official forecast. Whereas the extent to which the forecasters actually anchored their judgments on the MOS guidance cannot be inferred from the present data, it is still instructive to examine the official 13 forecasts as if they were derived from the MOS guidance (Table 3). The most distressful are the cases wherein the forecasters switched the sign of the bias and increased the magnitude: 32+36 = 68. Next are the cases wherein the forecasters retained the sign of the bias but increased the magnitude: 77 + 66 = 143. Lastly, there are 3 cases wherein the MOS guidance was sans bias and the forecasters introduced one. All in all, as if tossing a coin, the forecasters worsened the bias in 50% of the cases, and reduced the bias in 48% of the cases. The time series of B, not shown herein, exhibit some characteristics similar to those seen in Fig. 3. The bias is station-dependent and month-dependent. Thus, in general, no spatial and no temporal stationarity of the bias can be assumed. But unlike the times series of F in Fig. 3, the time series of B grow more and more variable as the lead time increases. This is confirmed by plotting _____ ____ (Fig. 5) the average bias magnitudes of the MOS guidance |BM | and the official forecast |BO | for _____ ____ each lead time, along with P -values from the two-sided t-test of the null hypothesis |BM | = |BO |. From the comparison of Fig. 5 with Fig. 4, it is apparent that the calibration of the MOS guidance and the official forecast is more stable in the median than in the mean. This is the second reason for adopting the median for the official interpretation of the guidance and the forecast. 3.4 Accuracy In terms of the accuracy measured by MAE (Table 1), the MOS guidance is superior 306 times (73% of the cases), whereas the official forecast is superior just 111 times (26% of the cases). At the 0.05 significance level, the corresponding numbers are 48 (11%) and 5 (1%). The time series of MAE, not shown herein, for all stations and lead times reveal the already familiar properties: non-stationarity and station-to-station variability, both of which increase with the lead time. 14 4. PERFORMANCE FOR EXTREMES 4.1 Forecast Accuracy A common justification of subjective adjustments by the NWS forecasters to guidance forecasts rests on the presumption that, thanks to their expertise and experience, the forecasters can diagnose a particular weather pattern and assess model performance in evolving that pattern. Therefore, the argument goes, the forecasters can improve upon the guidance, especially when the weather forecasts matter most — in times of extremes; however, verifications performed on samples comprising all observations may not reveal this presumed advantage of the official forecasts. We set to test this hypothesis statistically. An extreme temperature event is said to occur on day n if the absolute difference |w(n) − w(n − 1)| between the maximum temperatures on two consecutive days, n − 1 and n, has the exceedance probability of 0.1 or lower. The objective is to verify the forecaster’s ability to predict the extreme day-to-day changes in the daily maximum temperature. Because such changes are rare, the four-month sampling window is abandoned. Instead, for every station and lead time, a subsample containing all extreme events is formed by selecting from the entire joint sample the 10% of the days on which the largest absolute differences were recorded. There are thus 35 cases (5 stations and 7 lead times) with the sample sizes between 35 and 40, except at KFCA where the sample sizes are 20 or 21. In each case, MAEM and MAEO are calculated and used in the two-sided t-test of the null hypothesis MAEM = MAEO . The results for two stations offering the strongest (KSAV) and the weakest (KFCA) support for the null hypothesis are reported in Table 4. Out of all 35 cases, the MOS guidance is superior to the official forecast 28 times (80% of the cases). In the case-by-case tests, the difference MAEM − MAEO is significant only twice at the 0.05 level. However, the consistency of the 15 ________ ________ difference is overwhelming, as the null hypothesis on the averages, MAE M = MAE O , is rejected ________ ________ by the t-test in favor of the alternative hypothesis MAE M < MAE O with the P -value of 0.030. In conclusion, the above analysis does not support the hypothesis that human judgment is superior to the meteorological models in forecasting the extreme day-to-day changes in the daily maximum temperature. On the contrary, it is the MOS guidance that is superior, while not significantly in individual cases, consistently across all stations and lead times. 4.2 Noisy Channel Another way to draw an inference from Table 4 is to note that the best improvement the forecasters could muster in the MAE is 0.42◦ F (which is statistically insignificant at the 0.05 level) at Savannah, GA, for the 24-h lead time. At the same time, by making their own forecasts rather than adopting the MOS guidance as the official forecast, the forecasters deteriorated the MAE by more than 0.42◦ F in 23 cases, with the largest deterioration of 4.15◦ F (which is statistically significant at the 0.05 level) at Kalispell, MT, for the 120-h lead time. To visualize the failure of the official forecasts in this last case, all sample points are shown in Fig. 6. The scatter plot in the upper left corner shows the mapping of the MOS guidance into the official forecast. The other two scatter plots compare each of the two forecasts with the observation. The overall impression they convey is that the scatter of the official forecasts around the diagonal (on which perfect forecasts would lie) is larger than the scatter of the MOS guidance. The statisticians have a technical word for such a mapping of one forecast into another (DeGroot, 1970, Chapter 14; Krzysztofowicz, 1992) — an auxiliary randomization. It is as if the forecaster took the MOS guidance and processed it through a noisy channel. Of course, the output from the noisy channel is always less informative than the input. The informativeness score quantifies exactly that: ISO = 0.732 < 0.915 = ISM . 16 5. WOULD COMBINING ADD VALUE ? Even though in most of the cases the official forecast XO is less informative than the MOS guidance XM , it is still possible that the information imparted to XO by the field forecaster supplements in some ways the information contained in XM . If this is so, then combining XO with XM through a Bayesian processor may yield a forecast XC which is more informative than either XO or XM (and which, therefore, has the economic value at least as high as either XO or XM ). Our objective is to test this hypothesis. 5.1 Conditional Independence and Uninformativeness Let f(xO , xM |w) denote the joint density function of variates (XO , XM ), evaluated at the point (xO , xM ), and conditional on the hypothesis that W = w. When viewed as a function of w at a fixed point (xO , xM ), it is called the likelihood function of predictand W . All points (xO , xM ) specify a family of likelihood functions. The family of likelihood functions is the key construct in a Bayesian combining model. It is also the construct that allows us to determine if there would be any gain from combining the official forecast XO with the MOS guidance XM . Toward this end, the likelihood function is factorized: f(xO , xM |w) = f(xO |xM , w)f(xM |w). (8) One can interpret this factorization as a framework for developing a combining model in two stages. First, the MOS guidance is introduced as a predictor through the likelihood function f(xM |w). Second, with XM already used, the official forecast XO is introduced as the second predictor through the conditional likelihood function f (xO |xM , w). Two particular situations are of special interest. 17 Definition 1 (Conditional Independence). The predictor XO is independent of predictor XM , conditional on predictand W , if at every point (xO , xM , w) of the sample space, f(xO |xM , w) = f (xO |w). (9) Definition 2 (Conditional Uninformativeness). The predictor XO is uninformative for predictand W , conditional on predictor XM , if at every point (xO , xM , w) of the sample space, f(xO |xM , w) = f (xO |xM ). (10) These two situations bound the economic gain from combining two predictors. The maximum gain results if XO is conditionally independent of XM . No gain results if XO is conditionally uninformative, given XM ; and if XM is also more informative than XO , then XO is worthless. 5.2 Tests and Results To test the hypotheses of conditional independence and conditional uninformativeness, each of the three variates is subjected to the NQT, as defined in Section 2.2, and then ZO is regressed on V and ZM in the standard normal space: ZO = aV + bZM + c + Θ, (11) where a and b are regression coefficients, c is the intercept, and Θ is the residual. A two-sided t-test is next performed on the significance of each regression coefficient. If the null hypothesis b = 0 cannot be rejected, then the official forecast XO is conditionally independent of the MOS guidance XM . If the null hypothesis a = 0 cannot be rejected, then the official forecast XO is conditionally uninformative, given the MOS guidance XM . As before, all results are reported at the 0.05 significance level. Among the 420 cases, the official forecast is conditionally independent of the MOS guidance only 8 times (2%). Most significantly, the P -values less than 0.005 occur 405 times (96%), 18 making it convincing to accept the alternative hypothesis that the official forecast is conditionally dependent on the MOS guidance. Among the 420 cases, the official forecast is conditionally uninformative, given the MOS guidance, 150 times (36%). Overall, these results imply that only sporadic gains can be expected from combining the official forecast with the MOS guidance of the daily maximum temperature. The official forecast provides independent information in only 2% of the cases, offers no additional information (is economically worthless) in 36% of the cases, and is conditionally dependent on the MOS guidance in the remaining 62% of the cases; the strong significance of this dependence suggests a small marginal gain from combining. 19 6. CLOSURE 6.1 Summary The 420 + 300 = 720 verification cases reported herein comprise 22,833 pairs of forecasts for 12 lead times (24–168 h), and span nearly 1 1 2 years (October 2004 – February 2006) of daily maximum and minimum temperatures at five stations in diverse climates. While not exhaustive, these cases are representative enough to reveal any patterns in forecast performance across stations, lead times, and seasons, which are worth attention. Four such patterns have emerged. 1. In terms of informativeness, the official forecast not only fails to improve upon the MOS guidance, but performs consistently worse (in 79–81% of the cases) and often significantly worse (in 32–43% of the cases). Only for short lead times (24–48 h) and during a few months per year, which vary from station to station, is the official forecast consistently more informative than the MOS guidance. 2. In terms of calibration, neither product is particularly well calibrated (with significant miscalibration present in 30–40% of the cases); the official forecast is slightly better calibrated as the median of the daily maximum temperature, whereas the MOS guidance is slightly better calibrated as the mean. When the official forecast is viewed as an adjustment of the MOS guidance, adjusting has the statistical effect of tinkering: the calibration of the forecast as the mean is improved in 45–48% of the cases and worsened in 50–54% of the cases. 3. Contrary to the popular notion that field forecasters can predict extreme events better than models do, the official forecast actually depreciates the informativeness and the accuracy of the MOS guidance for extreme events (those having the climatic exceedance probability less than 0.1). From the viewpoint of a forecast user, the official forecast appears like the MOS guidance processed through a noisy channel. 20 4. Combining the official forecast with the MOS guidance statistically, in the hope of producing a forecast superior to either one of those combined, would yield mostly sporadic and small marginal gains because the two forecasts are conditionally dependent, strongly and consistently (in 96% of the cases), and the official forecast does not provide any additional information beyond that contained in the MOS guidance in 36% of the cases. 6.2 Conclusion The verification measures and the statistical tests of significance summarized above imply that the answers to the five basic questions (Section 1.1) are most likely: No, No, No, No, No. Together with the meaning of the informativeness score (Section 2.2), these answers imply mathematically that a rational decision maker may expect the economic value of the model guidance to be at least as high as the economic value of the official forecast. Ergo, the model guidance for the daily maximum or minimum temperature is the preferred forecast from the viewpoint of the utilitarian society. 6.3 Discussion Inasmuch as the advances in numerical weather prediction models and in statistical postprocessors improve the predictive performance of guidance products, the role of the field forecasters should be re-examined periodically, and the tasks they perform should be re-designed to suit the improved information they receive. The basic question is What tasks should a field forecaster perform in order to make the optimal use of his judgmental capabilities towards improving the local weather forecasts? As the verification results reported herein imply, at least for forecasting the daily maximum and minimum temperatures, the task that has been the staple of the field forecaster’s job throughout the 20th century — the judgmental adjustment of a deterministic guidance forecast produced centrally to account for information available locally — is about to be rendered 21 purposeless by the improvements in the numerical-statistical models, as anticipated (Bosart, 2003). A major technical innovation at the onset of the 21st century is the quantification of uncertainty in operational weather forecasts. Whereas ensemble forecasting methods and statistical post-processors are in the center of attention, they are far from maturity: their assessments of uncertainty utilize only a fraction of the total weather information available at any place and time. Thus arises a significant new opportunity: to re-design the role of the field forecaster to make it compatible with, and beneficial to, the emerging paradigm of probabilistic forecasting. Research has shown that weather forecasters can judgmentally assess vast amounts of complex information to detect and quantify the uncertainty (e.g., Murphy and Winkler, 1979; Winkler and Murphy, 1979; Murphy, 1981; Murphy and Daan, 1984; Krzysztofowicz and Sigrest, 1999a). This research has been ahead of its time. Now is the opportunity to harness its results in modernizing the role of the field forecaster for the 21st century. As a prelude to systemic re-design, two steps are recommended, after Krzysztofowicz and Sigrest (1999b, p.452): (i) The deterministic guidance forecast of every continuous predictand should be given an official probabilistic interpretation. The median of the predictand (not the mean) is the preferred interpretation because it conveys at least some rudimentary assessment of uncertainty (the 50% chance of the actual observation being either below or above the forecast estimate), which the field forecasters and the decision makers can grasp intuitively. (ii) The official forecast should be given the same probabilistic interpretation as the guidance forecast, so that the field forecasters can channel their skills toward improving the calibration of an estimate rather than re-calibrating the estimate from one interpretation to another (and degrading the guidance forecast in the process — as they do currently for the daily maximum and minimum temperatures). 22 Acknowledgments. This material is based upon work supported by the National Science Foundation under Grant No. ATM-0135940, “New Statistical Techniques for Probabilistic Weather Forecasting”. The Meteorological Development Laboratory of the National Weather Service provided the data. Mark S. Antolik researched a set of representative stations for the U.S. from which the five stations used herein were selected. An anonymous reviewer suggested expanding the study to daily minimum temperature forecasts. 23 APPENDIX: RESULTS FOR DAILY MINIMUM TEMPERATURE A reviewer of an earlier manuscript, which reported verification results for daily maximum temperature only, advised expanding the study to daily minimum temperature based on the following argument: “The maximum temperature should be the easiest predictand for MOS because it is generally well related to large-scale lower tropospheric variables that one expects to be handled well by numerical weather prediction models. On the other hand, the minimum temperature is much more subject to local influences. If the human-mediated minimum temperature fore- casts also do not show improvement over MOS, then the thesis of this paper would be much more strongly supported.” In parallel to the results reported in the body of this paper, we performed a comparative verification of the NWS official forecasts and MOS guidance forecasts of the daily minimum temperature. Having found similar patterns of performance, we report herein the main results of the case-by-case comparisons. There are 300 cases (5 stations, 5 lead times, 12 verification windows designated by the end months). In each case, four basic performance measures are computed for the official forecast and the MOS guidance. Then the corresponding measures are compared directly and via statistical tests. A.1 Informativeness In the case-by-case comparisons (Table A1), ISM > ISO as many as 244 times (81% of the cases), implying that MOS guidance is more informative than the official forecast; ISO > ISM only 56 times (19% of the cases). Using Williams’ test statistic (3), ISM is superior 130 times (43% of the cases) at the 0.05 significance level; ISO is superior only 10 times (3% of the cases) at the 0.05 significance level. This yields the winning ratio 130/10 = 13/1 in favor of the MOS 24 guidance. For the maximum temperature, this winning ratio is 9/1 in favor of the MOS guidance (Table 1). Thus, even if the minimum temperature is subject to local influences, the forecasters appear incapable of taking advantage of this knowledge. A.2 Calibration as Median Of the 300 cases (Table A1), the official forecast is better calibrated 140 times (47% of the cases), while the MOS guidance is better calibrated 145 times (48% of the cases). The difference in favor of the MOS guidance (by 1%) is negligible, but is one of the small distinctions between the daily minimum temperature and the daily maximum temperature (Table 1), for which the official forecast is better calibrated (by 6%) than the MOS guidance. A.3 Calibration as Mean Of the 300 cases (Table A1), the official forecast bias is less than the MOS guidance bias in 135 cases (45%), while the reverse is true in 162 cases (54%). Thus, the MOS guidance is generally better calibrated as the mean of the predictand than the official forecast. Though not substantial, this difference of 9% for the daily minimum temperature is even slightly larger than the difference of 3% for the daily maximum temperature (Table 1). In parallel to Table 3 for the daily maximum temperature, Table A2 compares the official forecast to the MOS guidance for the daily minimum temperature; the comparison is in terms of the bias, when both the forecast and the guidance are biased. The purpose of this comparison is to evaluate the goodness of the modifications through which the official forecast is derived from the MOS guidance under the hypothesis that the forecasters actually anchored their judgments on the MOS guidance. The most distressful are the cases wherein the forecasters switched the sign of the bias and increased the magnitude: 17 + 16 = 33. Next are the cases wherein the forecasters retained the sign of the bias but increased the magnitude: 77 + 51 = 128. All in all, as if tossing 25 a coin, the forecasters worsened the bias in 54% of the cases, and reduced the bias in 45% of the cases. In general, this performance of the forecasters for the daily minimum temperature follows the pattern similar to that seen for the daily maximum temperature (Table 3). A.4 Accuracy In terms of the accuracy (Table A1), the MOS guidance is superior 239 times (80% of the cases), whereas the official forecast is superior just 57 times (19% of the cases). This difference of 61% for the daily minimum temperature is even larger than the difference of 47% for the daily maximum temperature (Table 1). 26 REFERENCES Alexandridis, M.G., and R. Krzysztofowicz, 1982: Economic gains from probabilistic temperature forecasts. Proceedings of the International Symposium on Hydrometeorology, A.I. Johnson and R.A. Clark (eds.), American Water Resources Association, Bethesda, Maryland, 263– 266. Alexandridis, M.G., and R. Krzysztofowicz, 1985: Decision models for categorical and probabilistic weather forecasts. Applied Mathematics and Computation, 17, 241–266. Baquet, A.E., A.N. Halter, and F.S. Conklin, 1976: The value of frost forecasting: A Bayesian appraisal. American Journal of Agricultural Economics, 58, 511–520. Blackwell, D., 1951: Comparison of experiments. Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, J. Neyman (ed.), University of California Press, Berkeley, pp. 93–102. Blackwell, D., 1953: Equivalent comparisons of experiments. Annals of Mathematical Statistics, 24, 265–272. Bosart, L.F., 2003: Whither the weather analysis and forecasting process? Weather and Forecasting, 18, 520–529. Brooks, H.E., and C.A. Doswell, III, 1996: A comparison of measures-oriented and distributionsoriented approaches to forecast verification. Weather and Forecasting, 11, 288–303. Centers for Disease Control and Prevention, 2004: About extreme heat. Available online at http://www.bt.cdc.gov. DeGroot, M.H., 1970: Optimal Statistical Decisions. McGraw Hill, 490 pp. 27 Glahn, H.R., and D.A. Lowry, 1972: The use of model output statistics (MOS) in objective weather forecasting. Journal of Applied Meteorology, 11, 1203–1211. Glahn, H.R., and D.P. Ruth, 2003: The new digital forecast database of the National Weather Service. Bulletin of the American Meteorological Society, 84, 195–201. Krzysztofowicz, R., 1987: Markovian forecast processes. Journal of the American Statistical Association, 82, 31–37. Krzysztofowicz, R., 1992: Bayesian correlation score: A utilitarian measure of forecast skill. Monthly Weather Review, 120, 208–219. Krzysztofowicz, R., 1996: Sufficiency, informativeness, and value of forecasts. Proceedings, Workshop on the Evaluation of Space Weather Forecasts, Space Environment Center, NOAA, Boulder, Colorado, 103–112. Krzysztofowicz, R., and A.A. Sigrest, 1999a: Calibration of probabilistic quantitative precipitation forecasts. Weather and Forecasting, 14, 427–442. Krzysztofowicz, R., and A.A. Sigrest, 1999b: Comparative verification of guidance and local quantitative precipitation forecasts: Calibration analyses. Weather and Forecasting, 14, 443–454. Krzysztofowicz, R., W.J. Drzal, T.R. Drake, J.C. Weyman, and L.A. Giordano, 1993: Probabilistic quantitative precipitation forecasts for river basins. Weather and Forecasting, 8, 424–439. Murphy, A.H., 1981: Subjective quantification of uncertainty in weather forecasts in the United States. Meteorologische Rundschau, 34, 65–77. Murphy, A.H., and R.L. Winkler, 1979: Probabilistic temperature forecasts: The case for an operational program. Bulletin of the American Meteorological Society, 60, 12–19. 28 Murphy, A.H., and H. Daan, 1984: Impacts of feedback and experience on the quality of subjective probability forecasts: Comparison of results from the first and second years of the Zierikzee experiment. Monthly Weather Review, 112, 413–423. Murphy, A.H., B.G. Brown, and Y. Sheng Chen, 1989: Diagnostic verification of temperature forecasts. Weather and Forecasting, 4, 485–501. National Weather Service, 2000: National Weather Service operations manual. Available online at http://www.nws.noaa.gov/wsom/manual/archives/NC750006.pdf. Neill, J.J., and O.J. Dunn, 1975: Equality of dependent correlation coefficients. Biometrics, 31, 531–543. Nichols, W.S., 1890: The mathematical elements in the estimation of the Signal Service reports. American Meteorological Journal, 6, 386–392. Roebber, P.J., and L.F. Bosart, 1996: The complex relationship between forecast skill and forecast value: A real-world analysis. Weather and Forecasting, 11, 544–559. Sakaori, F., 2002: A nonparametric test for the equality of dependent correlation coefficients under normality. Communications in Statistics: Theory and Methods, 31, 2379–2389. Williams, E.J., 1959: The comparison of regression variables. Journal of the Royal Statistical Society, Series B, 21, 396–399. Winkler, R.L., and A.H. Murphy, 1979: The use of probabilities in forecasts of maximum and minimum temperatures. Meteorological Magazine, 108, 317–329. Zhu, Y., Z. Toth, R. Wobus, D. Richardson, and K. Mylne, 2002: The economic value of ensemblebased weather forecasts. Bulletin of the American Meteorological Society, 83, 73–83. 29 Table 1. Overall results of the comparative verification of the MOS guidance and the official forecast at five stations, seven lead times, and twelve windows; predictand: daily maximum temperature. Performance Attribute Any significance level 0.05 significance level Count Percentage 1/ Count Percentage 1/ 1. Informativeness MOS superior Forecast superior 332 88 79% 21% 135 15 32% 4% 2. Calibration as Median MOS superior Forecast superior 187 214 45% 51% 136 153 32% 36% 129 156 31% 37% 48 5 11% 1% MOS uncalibrated Forecast uncalibrated 3. Calibration as Mean MOS superior Forecast superior 214 201 51% 48% MOS uncalibrated Forecast uncalibrated 4. Accuracy MOS superior Forecast superior 1 306 111 73% 26% / Total number of cases is 420. 30 Table 2. Season of consistently improved informativeness of the official foreast relative to the MOS guidance at each station; predictand: daily maximum temperature. Station Lead times [h] Months No. of cases KSAV 24 Jan – Aug 8 KPWM 24 Oct – Mar 6 KFCA — — 0 KSAT 24, 48 Dec – Jan 4 KFAT 24, 48 Oct – Apr 14 31 Table 3. Comparison of the official forecast (O) to the MOS guidance (M) in terms of the bias, when both the forecast and the guidance are biased; predictand: daily maximum temperature. Sign of bias BM BO 1 Magnitude of bias Count Percentage − − Worsened 77 18% − − Improved 62 15% − + Worsened 32 8% − + Improved 9 2% + − Worsened 36 8% + − Improved 41 10% + + Worsened 66 16% + + Improved 88 21% Total Worsened 211 50% Total Improved 200 48% / Total number of cases is 420. 32 1/ Table 4. Comparison of the official forecast (O) to the MOS guidance (M) for extreme temperature events in terms of the mean absolute errors (MAE) and their difference ∆MAE; predictand: daily maximum temperature. Station MAEM MAEO ∆MAE [◦ F ] [◦ F ] [◦ F ] 40 35 40 39 39 39 39 3.45 4.40 4.68 4.77 5.72 6.64 8.64 3.03 4.09 4.73 5.95 6.10 7.67 8.31 0.42 0.31 −0.05 −1.18 −0.38 −1.03 0.33 0.50 0.71 0.95 0.23 0.75 0.42 0.80 21 21 20 20 20 20 20 3.67 3.24 3.90 4.70 4.75 7.35 8.45 4.24 5.00 6.55 5.35 8.90 9.70 10.65 −0.57 −1.76 −2.65 −0.65 −4.15 −2.35 −2.20 0.58 0.15 0.06 0.66 0.04 0.26 0.28 Lead time [h] Sample size KSAV 24 48 72 96 120 144 168 KFCA 24 48 72 96 120 144 168 33 P -value from t-test Table A1. Overall results of the comparative verification of the MOS guidance and the official forecast at five stations, five lead times, and twelve windows; predictand: daily minimum temperature. Performance Attribute Any significance level 0.05 significance level Count Percentage 1/ Count Percentage 1/ 1. Informativeness MOS superior Forecast superior 244 56 81% 19% 130 10 43% 3% 2. Calibration as Median MOS superior Forecast superior 145 140 48% 47% 112 100 37% 33% 122 109 41% 36% 41 7 14% 2% MOS uncalibrated Forecast uncalibrated 3. Calibration as Mean MOS superior Forecast superior 162 135 54% 45% MOS uncalibrated Forecast uncalibrated 4. Accuracy MOS superior Forecast superior 1 239 57 80% 19% / Total number of cases is 300. 34 Table A2. Comparison of the official forecast (O) to the MOS guidance (M) in terms of the bias, when both the forecast and the guidance are biased; predictand: daily minimum temperature. Sign of bias BM BO 1 Magnitude of bias Count Percentage − − Worsened 77 26% − − Improved 54 18% − + Worsened 17 6% − + Improved 26 9% + − Worsened 16 5% + − Improved 17 6% + + Worsened 51 17% + + Improved 37 12% Total Worsened 161 54% Total Improved 134 45% / Total number of cases is 300. 35 1/ FIGURE CAPTIONS Figure 1. Informativeness of the official forecast and the MOS guidance for five stations and three lead times; predictand: daily maximum temperature. ___ ___ Figure 2. Average informativeness scores, IS M and IS O , for each lead time, and the P -value ___ ___ from the t-test of the null hypothesis IS M = IS O against the one-sided alternative ___ ___ hypothesis IS M > IS O ; predictand: daily maximum temperature. Figure 3. Calibration of the official forecast and the MOS guidance as the median of the predictand for five stations and three lead times; predictand: daily maximum temperature. ____ ____ Figure 4. Average calibration scores, CS M and CS O , for each lead time, and the P -value from ____ ____ the two-sided t-test of the null hypothesis CS M = CS O ; predictand: daily maximum temperature. _______ _______ Figure 5. Average bias magnitudes, |BM | and |BO |, for each lead time, and the P -value from _______ _______ the two-sided t-test of the null hypothesis |BM | = |BO |; predictand: daily maximum temperature. Figure 6. Scatter plots of the forecasts and observations of the daily maximum temperature on the days of extreme day-to-day changes; KFCA, lead time of 120 h. 36 24 h Lead Time 24 h Lead Time 1.0 MOS Guidance ISM Official Forecast ISO 1.0 0.8 0.6 0.4 KSAV KPWM KFCA KSAT KFAT 0.8 0.6 0.4 KSAT KFAT KSAV KPWM KFCA 0.2 0.2 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Month Month 96 h Lead Time 96 h Lead Time 1.0 MOS Guidance ISM Official Forecast ISO 1.0 0.8 0.6 0.4 KSAV KPWM KFCA KSAT KFAT 0.2 0.8 0.6 0.4 0.2 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Month Month 168 h Lead Time 168 h Lead Time 1.0 MOS Guidance ISM Official Forecast ISO 1.0 0.8 0.6 0.4 KSAT KFAT KSAV KPWM KFCA KSAV KPWM KFCA KSAT KFAT 0.2 0.8 0.6 0.4 KSAV KPWM KFCA KSAT KFAT 0.2 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Month Month Figure 1. Informativeness of the official forecast and the MOS guidance for five stations and three lead times; predictand: daily maximum temperature. 37 1.0 Official Forecast 0.5 0.8 0.4 0.6 0.3 0.4 0.2 0.2 0.1 P-value Informativeness Score MOS Guidance 0.0 0 24 48 72 96 120 144 168 Lead Time [h] ___ ___ Figure 2. Average informativeness scores, IS M and IS O , for each lead time, and the P -value ___ ___ from the t-test of the null hypothesis IS M = IS O against the one-sided alternative ___ ___ hypothesis IS M > IS O ; predictand: daily maximum temperature. 38 24 h Lead Time KSAV KPWM KFCA 0.8 24 h Lead Time 1.0 KSAT KFAT MOS Guidance FM Official Forecast FO 1.0 0.6 0.4 0.8 KSAV KPWM KFCA 0.6 0.4 0.2 0.2 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Month Month 96 h Lead Time KSAV KPWM KFCA 0.8 96 h Lead Time 1.0 KSAT KFAT MOS Guidance FM Official Forecast FO 1.0 0.6 0.4 0.2 0.8 KSAV KPWM KFCA KSAT KFAT 0.6 0.4 0.2 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Month Month 168 h Lead Time KSAV KPWM KFCA 0.8 168 h Lead Time 1.0 KSAT KFAT MOS Guidance FM 1.0 Official Forecast FO KSAT KFAT 0.6 0.4 0.2 0.8 KSAV KPWM KFCA KSAT KFAT 0.6 0.4 0.2 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Month Month Figure 3. Calibration of the official forecast and the MOS guidance as the median of the predictand for five stations and three lead times; predictand: daily maximum temperature. 39 0.12 1 Official Forecast MOS Guidance 0.10 0.8 0.6 P-value Calibration Score 0.08 0.06 0.4 0.04 0.2 0.02 0.00 0 24 48 72 96 120 144 168 Lead Time [h] ____ ____ Figure 4. Average calibration scores, CS M and CS O , for each lead time, and the P -value from ____ ____ the two-sided t-test of the null hypothesis CS M = CS O ; predictand: daily maximum temperature. 40 1.4 0.7 Official F orecast 0.6 1.0 0.5 0.8 0.4 0.6 0.3 0.4 0.2 0.2 0.1 0.0 P-value Bias Magnitude [°F] M OS Guidance 1.2 0 24 48 72 96 120 144 168 Lead Time [h] _______ _______ Figure 5. Average bias magnitudes, |BM | and |BO |, for each lead time, and the P -value from _______ _______ the two-sided t-test of the null hypothesis |BM | = |BO |; predictand: daily maximum temperature. 41 90 Officia l F o reca st [°F ] IS O = 0.7 32 70 50 30 M OS Guidance [° F ] 90 70 50 Ob servatio n [° F ] 10 30 10 10 30 50 70 90 30 50 70 IS M = 0.9 15 M O S Guida nce [° F ] 90 Figure 6. Scatter plots of the forecasts and observations of the daily maximum temperature on the days of extreme day-to-day changes; KFCA, lead time of 120 h. 42