Performance of National Weather Service Forecasts Compared to Operational, Consensus, and Weighted Model Output Statistics Jeff Baars Cliff Mass ABSTRACT Model Output Statistics (MOS) have been a useful tool for forecasters for years and have shown improving forecast performance over time. A more recent advancement in the use of MOS is the application of "consensus" MOS (CMOS), which is an average of MOS from two or more models. CMOS has shown additional skill over individual MOS forecasts and has performed particularly well in comparison with human forecasters in forecasting contests. Versions of a weighted consensus MOS (WMOS) have been studied but have shown little improvement over simple averaging, although these WMOS’s were rather simplistic in nature. This study compares MOS, CMOS, and WMOS forecasts for temperature and precipitation to those of the National Weather Service (NWS) subjective forecasts. Data from 29 locations throughout the United States for the August 1 2003 – August1 2004 time period are used. MOS forecasts from the GFS (GMOS), Eta (EMOS) and NGM (NMOS) models are included, with CMOS being a consensus of these three models. WMOS is calculated using weights calculated using a minimum variance method, varying training periods for each station and variable. Performance is analyzed at various forecast periods, by region of the U.S., and by time/season. Additionally, performance is analyzed during periods of large daily temperature changes and times of large departure from climatology. The results show that by most verification methods, CMOS is competitive or superior to human forecasts at nearly all locations and that WMOS is superior to a simple averaged CMOS. NWS performs well for short-range forecasts during times of temperature extremes. 1. INTRODUCTION Since their advent in the 1970’s, Model Output Statistics (MOS) have greatly improved the efficacy of numerical forecast models, and have become an integral part of weather forecasting. MOS is a large multiple regression on numerical model output which yields a more accurate forecast at a point of interest. The increased accuracy is a result of MOS being able to correct model biases and from accounting for sub grid-scale differences between the model and the real world (e.g. topography). MOS has the added benefit of producing probabilistic forecasts as well as deterministic forecasts (Wilks 1995). Over time, MOS has shown steady improvement as models and MOS equations have improved. Dallavalle and Dagostaro (2004) showed that in recent years the forecasting skill of MOS is actually approaching the skill of National Weather Service (NWS) forecasters, particularly for longer forecast projections. As priorities continually change at the NWS, it is important to understand just when and how human forecasters perform better than models and MOS. Using this information, forecasters at the NWS can allocate their skills to where it can be most beneficial. It has long been recognized that a consensus of forecasters, be it human or machine, often performs quite well relative to the individual forecasters themselves. Initially noted in human forecasting contests (Sanders 1973; Bosart 1975; Gyakum (1986)), Vislocky and Fritsch (1995) first showed the increased skill from a consensus of MOS products. Subsequently, Vislocky and Fritsch (1997) demonstrated that a more advanced consensus MOS using MOS, model output, and surface weather observations, performed well in a national forecasting competition. Given that a simple average of different MOS predictions shows good forecast skill, and that performance varies between the individual MOS forecasts, it seems obvious that a system of weighting the individual MOS’s should show improvement over CMOS performance. Vislocky and Fritsch (1995) for example utilized a simple weighting scheme where NGM and LFM MOS were given the same weights (one for temperature and one for precipitation) across all stations using a year of developmental data. This weighted consensus MOS showed “no meaningful improvement” over simple averaging. Several studies have shown an increase in forecasting skill of both human forecasts (Murphy and Sabin 1986) and of model output statistics (Dallavalle and Dagostaro 2004) . Recent studies have shown that MOS is approaching or matching the skill of human forecasters (Dallavalle and Dagostaro 2004). Thus, it is timely to compare recent performance of MOS and NWS forecasters, and to determne the performance of combined and weighted MOS. In addition to examining simple measures-based verification statistics such as mean absolute error (MAE), mean squared error (MSE), and bias, it is informative to investigate the circumstances under which a given forecast performs well or performs poorly (Brooks and Doswell 1996; Murphy and Winkler 1986). Statistics such as MAE often do not give a complete picture of a forecast’s skill and can be quite misleading in assessing the overall quality of forecasts. For instance, one forecast may perform well on most days and have the best MAE compared to others, but shows poor performance during periods of large departure from climatology. These periods may be of the most interest to a given user, such as agricultural interests or energy companies, if extreme weather conditions affect them severely. Section 2 describes the data used and the quality control used in this study. Section 3 details the methods used in generating statistics, and how WMOS is calculated. Results are shown in section 4, and Section 5 summarizes and interprets the results. 2. DATA Daily forecasts of maximum temperature (MAX-T), minimum temperature (MIN-T) and probability of precipitation (POPs) were gathered from July 1, 2003 – July ??, 2004 for 30 stations spread across the U.S. (Figure 1). Forecasts were taken from the NWS, GFS MOS (GMOS), Eta MOS (EMOS), and the NGM MOS (NMOS). Two average or “consensus” MOS’s were also calculated, one from GMOS, EMOS and NMOS (CMOS) and one from only the GMOS and EMOS (CMOS_GE). Also, a weighted MOS (WMOS) was calculated by combining the three MOSs based on their prior performance. Further discussion on how the consensus and weighted MOS’s were calculated is given in the Methods section. MOS data was gathered for the 0000 UTC model run cycle, and NWS forecast data was gathered from the early morning (~1000 UTC or about 0400 PST) forecast issued the following morning. This was done for so that NWS forecasters would have access to the 0000 UTC model output and corresponding MOS data. This gives some advantages to the NWS forecasters, who have additional time to observe the development of the current weather. And, of course, the models do not have access to NWS forecasts, but the NWS is able to consider the MOS output in making their forecasts. Data were gathered for 48 hours, providing two MAX-T forecasts, two MIN-T forecasts, and four 12-hr POP forecasts. During the study period, the Meteorological Development Laboratory (MDL) implemented changes to two of the MOS’s used in the study. On December 15, 2003, Aviation MOS was phased out and became the Global Forecast System (GFS) MOS, or GMOS. Output from the AMOS/GMOS was treated as one MOS over the December 15 2003 boundary as the model and equations are essentially the same. Labeling is used as “GMOS” for this MOS throughout this paper. The second change occurred on February 17, 2004 when EMOS equations were changed, being derived from the higherresolution archive of the Eta model (NWS 2004). There was also an attempt to correct a known warm bias in EMOS maximum temperatures as well (Paul Dallavalle personal communication, January 23, 2004). As with the AMOS/GMOS change, EMOS was treated as one continuous model throughout the study. Stations were primarily chosen to be at or near major weather forecast offices (WFOs) and to represent a wide range of geographical areas. The definitions of observed maximum and minimum temperatures follow the National Weather Service MOS definitions (Jensenius et al 1993), which are that maximum temperatures occur between 7 AM through 7 PM local time and minimum temperatures occur between 7 PM through 8 AM local time. For POP, two forecasts per day were gathered: 0000 – 1200 UTC and 1200 – 0000 UTC. Definitions of MAX-T, MIN-T, and POP from the NWS follow similar definitions (Chris Hill, personal communication, July 17, 2003). While quality control measures are implemented at the agencies from which the data was gathered, simple range checking was performed to ensure quality of the data used in the analysis. Temperatures below –85F and above 140F were removed, POP data were required to be in the range of 0 to 100%, and quantitative precipitation amounts had to be in the range of 0.0-in to 25.0-in for a 12-hr period. On occasion forecasts and/or observation data were not available for a given time period and these days were removed from analysis. Also, absolute forecast – observation differences, used in calculating many of the statistics in this study, were removed from the dataset if they exceeded 40F. Differences greater than 40F were considered unrealistically large and a sign of error in either the observations or the forecasts. A few periods of such large forecast-observations differences were found and were apparently due to erroneous MOS data. The data set was analyzed to determine the percentage of days when all forecasts and required verification observations were available, and it was found that each station had complete data for about 85-90% of the days. There were many days (greater than 50%) when at least one observation and/or forecast was missing from at least one station, making it impossible to remove a day entirely from analysis when all data was not present. Therefore, only individual missing data (and not the corresponding entire day) were removed from the analysis. Each MOS forecast and NWS forecast were required to be present for a date to be used for a given station and variable. 3. METHODS CMOS was calculated by simple averaging of GMOS, EMOS, and NMOS for MAX-T, MINT, and POP. CMOS values were calculated only if two or more of the three MOS’s were available and were considered “missing” otherwise. A second CMOS (CMOS_GE) was also calculated which only used GMOS and EMOS. Both GMOS and EMOS had to be available for calculation of CMOS_GE. CMOS_GE was calculated in an attempt to improve the original CMOS by eliminating the consistently weakest member (NMOS) of the individual MOS’s, WMOS was calculated using minimum variance estimated weights (Daley 1991). Weights were determined for a set training period by blending MOS outputs by determining the minimum mean squared error with the equation wn n2 n2 n where is the mean square error and n is model (GMOS, EMOS and NMOS). WMOS was calculated for each station and variable using training day periods of 10, 15, 20, 25, and 30 days. The training period that produced the minimum mean squared error from July 1 2003 – July 1 2004 for each station and variable was then saved and used in the final WMOS calculation. With eight variables (two MAX-T’s, two MIN-T’s and four POP forecasts) and 30 stations, there were 240 different weights stored. A table of the average weights across all stations is given in Table 1. NMOS has the lowest weights of the three MOS’s for all variables. GMOS generally has the highest weights. A plot showing a typical time series of weights for MAX-T period 1 for a station (Nashville, TN KBNA in this case) is given in Figure 2. The training period for this station and variable is 30 days. GMOS has high weights for October and November 2003, and a more even blend of weights is seen for the remainder of the year. Each model shows higher weights than the others for periods of time, and weights for the three MOS’s mostly fall between 0.2 and 0.5. Variable GMOS EMOS NMOS MAX-T 0.368 0.322 0.310 pd1: MAX-T 0.374 0.324 0.303 pd2: MIN-T pd1: 0.373 0.334 0.294 MIN-T pd2: 0.394 0.329 0.278 PRC1: 0.332 0.372 0.279 PRC2: 0.346 0.342 0.280 PRC3: 0.358 0.355 0.274 PRC4: 0.355 0.333 0.291 Table 1. Weights used in WMOS, averaged over all 30 stations. [update this table] Bias, or mean error, is defined as n ( f o) i 1 where f is the forecast, o is the observation, and n is the total number of forecast minus observations differences. MAE is defined as n f o i 1 where f is the forecast, o is the observation, and n is the total number of forecast minus observations differences. Precipitation observations were converted to binary rain/no-rain data (with trace amount treated as no-rain cases), which are needed for calculating Brier Scores. The Brier Score is defined as 1 n ( f o) 2 n i 1 where f is the forecast probability of rain, o is the observation converted to binary rain/no-rain data, and n is the total number of forecast minus observations differences. Brier Scores range from 0.0 (perfect forecasting) to 1.0 (worst possible forecasting). To better understand the circumstances under which each forecast performed well or poorly, a type of “distributions-based” verification was performed (Brooks and Doswell 1996; Murphy and Winkler 1986). One circumstance examined was periods of large one-day temperature change. These periods are expected to be more challenging for the forecaster and the models. To determine forecast skill during these periods, forecast minus observations temperature differences were gathered for days when the observed MAX-T or MIN-T varied by +/-10F from the pervious day. MAE’s and biases were then calculated from these differences for each forecast and forecast variable. This study also examined periods when observed temperatures departed significantly from that of climatology, since it is hypothesized that human forecasters might have an advantage during such times. Days showing a large departure from climatology were determined using daily average maximum and minimum temperature data from the National Climatic Data Center for the 1971-2000 data period, using monthly average maximum and minimum temperatures interpolated to each date. A large departure from climatology was defined to be +/- 20F. As another measure of forecast quality, the number of days each forecast was the most or least accurate were determined. A given forecast may have low MAE for a forecast variable but is rarely the most accurate forecast. Or a given forecast may be most accurate on more days than other forecasts, but also least accurate on average due to infrequent large errors. This type of information, which remains hidden when only considering standard verification measures such as the MAE. The smallest forecast error was determined for each day, with ties counting towards each forecast’s total. 4. RESULTS 4.1 Temperature Total MAE scores were calculated using all stations, both MAX-T and MIN-T’s, for all forecast periods available (Table 2). It can be seen that WMOS has the lowest total MAE, followed by the CMOS_GE, CMOS, NWS, GMOS, EMOS and NMOS. MAE’s are notably lower than was found by Vislocky and Fritsch (1995). This is presumably due in part to more than ten years of model improvement. The National Weather Service’s National Verification Program, using data from 2003, reports similar total MAE for GMOS, MMOS and NMOS to those shown here (Taylor and Stram 2003). Forecast Total MAE (F) NWS 2.54 CMOS 2.50 CMOS_GE 2.49 WMOS 2.41 GMOS 2.64 EMOS 2.84 NMOS 2.97 Table 2: Total MAE for each forecast type, August 1 2003 – August 1 2004. Totals include data for all stations, all forecast periods, and both MAX-T and MIN-T for temperature. Figure 3 shows the distribution of the MAE over all stations and the entire period of the study. Bins are 1F in size, with plotting centered on the bin, so every 0.5F. It is interesting to note the similarities in the errors for all forecasts. Subtle changes in the curve account for differences in mean error type statistics. NWS, CMOS, CMOS_GE, and WMOS are all very similar to each other, with the individual MOS’s showing some separation at the number of occurrences of small differences (< 2F) and larger differences (> 4F). Figure 4 shows MAE by MAX-T and MIN-T for each of the forecast periods. Period 1, “pd1” is the day 1 MAX-T, period 2 , “pd2,” is the day 2 MIN-T, period 3, “pd3,” is the day 2 MAX-T, and period 4, “pd4,” is the day 3 MIN-T. WMOS has lower MAE’s over all periods. NWS has a lower MAE than both CMOS and CMOS_GE for the period 1 MAX-T, but CMOS and CMOS_GE have similar MAE’s for period 3 MAX-T and lower MAE’s for both MIN-T’s. The individual MOSs have higher MAE’s than WMOS, NWS, CMOS and CMOS_GE for all periods and GMOS has the lowest MAE’s of the individual MOS’s. To determine the forecast skill during periods of large temperature change, MAE’s were calculated on days having a 10F change in MAX-T or MIN-T from the previous day. Results of these calculations are shown in Fig. 5. There is about a 1.0 to 1.5F increase in MAE’s in the seven forecast types over MAE’s for all times (Fig. 4). WMOS again has the lowest MAE for all periods except, interestingly, for MIN-T for period 4, when GMOS shows the lowest MAE. NWS has lower MAE’s than CMOS for all periods but similar or higher MAE’s than CMOS_GE. MAEs were also calculated for days on which observed temperatures departed by more than 20F from the daily climatological value (see Methods section). Results from these calculations are given in Figure 6. NWS shows considerably higher skill (lower MAE) relative to the other forecasts for the period 1 MAX-T, with MAE’s being about 0.5F lower than the next best forecast, WMOS. For MIN-T period 2 and MAX-T period 3, CMOS_GE performs the best, and apparently the poor performance of NMOS increases MAE’s for CMOS and even for WMOS. The weighting scheme in WMOS appears to be unable to compensate enough during times of large departure from climatology for these variables. As was seen for the one-day temperature change statistics, GMOS performs best out of all forecasts for MIN-T period 4. Figure 7 shows the number of days that each forecast was most accurate (i.e. had the smallest absolute forecast minus observation difference). Situations when two or more forecasts had the most accurate result were awarded to each such forecast. For MAX-T period 1, NWS was most frequently the best forecast, with over 50% more days than CMOS, about 40% more days than CMOS_GE and WMOS, and about 15% more best-day forecasts than the individual MOS’s. For the remaining periods, the individual MOS’s are more comparable to NWS, and GMOS actually shows more days with the most accurate forecast by MIN-T period 4. Consensus and weighted MOS forecasts show the fewest number of days having the most accurate forecasts for all periods and variables. Similar results for this type of statistic for a consensus MOS have been reported (Wilk 1998). Figure 8 shows the number of days each forecast had the least accurate forecast. Again, when two or more forecasts had equal least accurate results, each received one day towards their totals. This figure shows that NWS and the individual MOS’s (GMOS, EMOS, and NMOS) all have higher totals of least (most) accurate days, revealing the conservative nature of the CMOS, CMOS_GE, and WMOS forecasts. The averaging for these forecasts tend to eliminate extremes, so that they are seldom the most or least accurate forecast, relative to NWS and the individual MOS’s (GMOS, EMOS, NMOS). CMOS_GE has notably more least-accurate days than CMOS or WMOS due to averaging over only two models. NWS has considerably less least-accurate forecasts than the individual MOS’s for most periods. Figure 9 shows a time series of bias for all stations for August 1, 2003 –August 1, 2004 for NWS, CMOS, and WMOS. The correlation between the three forecast biases is quite evident, with a pronounced positive (warm) bias by CMOS and WMOS seen through the 2003-2004 cold season. This presumably shows to some degree the use of models and of MOS by forecasters, and perhaps shows their knowledge of biases within the MOS. SHOw a period when the NWS is inferior for contrast… or at least show me to see if there is anything useful. Figure 10 compares the performance of the various NWS forecast offices compared to MOS, showing MAE’s for MAX-T, period 1 for August 1, 2003 – August 1, 2004. The stations are sorted by broad geographic region, starting in the West and moving through the Intermountain West and Southwest, the Southern Plains, the Southeast, the Midwest, and the Northeast (see map, Figure 1). MAE’s vary more from location to location in the western U.S., while MAE’s in the northeastern U.S. are more uniform. Not surprisingly, MIA (Miami, FL) shows the lowest MAE of all stations, and LAS (Las Vegas, NV) and PHX (Phoenix, AZ) also have low MAE’s. Higher MAE’s are seen at individual stations in the west (Missoula-MSO and Denver-DEN). Speculate on why Denver is the worst? Very variable weather? Figure 11 shows biases for MAX-T, period 1, for each of the 30 individual stations in the study for August 1, 2003 – August 1, 2004. A prominent feature is the positive (warm) biases through much of the Western U.S., particularly in the Intermountain West and Southwest, and extending through the southern plains and into the South. Small negative (cool) biases are observed in much of the Midwest. NWS forecasts have biases that are over 0.5F smaller than CMOS at stations such as MSO, SLC, DEN, PHX, ABQ and BNO. 4.2 Precipitation Total Brier Scores for the study period for the seven forecasts for all stations and forecast periods are given in Table 6. The scores do not vary greatly from each other. WMOS and CMOS_GE show the lowest scores, however, followed by CMOS, AMOS, NWS and EMOS, and NMOS. Although WMOS should give low weight to the poor-performing NMOS, WMOS scores are probably still affected by the use of this MOS. As a result, scores for WMOS are the same as those for CMOS_GE. Forecast NWS Total Brier Score 0.096 CMOS 0.092 CMOS_GE 0.091 WMOS 0.091 GMOS 0.094 EMOS 0.096 NMOS 0.105 Table 6. Total Brier Scores for August 1 2003 – August 1 2004. Totals include data for all stations and all forecast periods. Figure 12 shows the distribution of the absolute value of forecast POP minus binary observed precipitation (1 for rain, 0 for no rain) over all stations and over the entire period record for forecast period 1. Bins are 0.10 in size, centered every 0.10. The forecasts report POP differently in that MOS’s have a resolution of 1% while NWS only reports to the nearest 10% (there were occurrences of 5% POP forecasts in the NWS data as well). Figure 12 can be viewed in two parts, one for occurrences of rain (less than 0 on the x-axis), and one for periods of no rain (greater than 0 on the x-axis). For rain cases, there is a relatively uniform distribution from –0.7 to –0.2 for all forecasts, but a small maximum near –0.5 to –0.4 (POP forecast of 5060%). During no rain occasions (again, greater than 0.0 on the x-axis), more separation can be seen between forecasts. NWS shows a higher frequency of POP of 30% which can be seen at –0.7 and at 0.3. At these points there are more separation between the curves of the NWS and the other forecasts. NWS has a lower frequency in the 10% bin during no rain cases relative to the other forecasts. NWS also has a higher frequency of a 0% POP. Figure 13 shows the distribution of the forecast POP minus binary observed precipitation (1 for rain, 0 for no rain), normalized by each forecast’s total error. This plot shows the situations contributing to the total error of each forecast. Only CMOS, WMOS, and NWS are shown to allow each to be seen more clearly. The preponderance of 30% POP forecasts by NWS is seen again as in Figure 12. These are the two maxima in the NWS curve. Much of the error for CMOS and WMOS occurs when no rain is observed and small POP is forecast. Obviously, CMOS can only be zero if the average of all three individual MOS’s is zero, and WMOS can only be zero if the weighted average of the MOS’s is zero. This leads to many cases when these POP forecasts are small but non-zero. (**lack of sharpness) All three forecasts tend incur most of their error from POP forecasts of 20-40%. This is reasonable as this value approaches the climatological value for many stations. NWS incurs more error than CMOS and WMOS for forecasting 100% POP in no-rain situations (1.0 on x-axis) and 0% POP in rain situations (-1.0 on x-axis). So got to weight by number of times each % is forecast to take out climatological issues out…? Figure 13 shows Brier Scores for each of the four 12-hr precipitation forecast periods. WMOS has the lowest (higher skill) scores for all periods. There is a substantial increase in Brier Scores with time for all forecasts. CMOS_GE has the second lowest Brier Scores, followed by CMOS and NWS and the individual MOS’s. NMOS has particularly poor Brier Scores relative to the other forecasts. Figure 14 shows a time series of Brier Scores for all stations in the data set for period 1. A very high correlation between CMOS and NWS is seen with time, far more so than for temperature. An increase in Brier Scores is seen in the warm seasons when convection is more prominent. Comparing Fig 14 to Fig 9 shows that clearly humans add less skill to precipitation forecasts than to temperatures forecasts. Figure 15 shows Brier Scores by station. Low precipitation locations show the lowest (most skillful) Brier Scores, such as at LAS, PHX and ABQ. MIA shows the highest Brier Scores due to considerable convective precipitation at that location. CMOS shows the greatest skill improvement over NWS at some Midwest locations like STL, ORD and IND. NWS outperforms CMOS at a few locations in the South, and at select stations in the Intermountain West. 5. CONCLUSIONS This study compared the skill of National Weather Service forecasts throughout the United States with the predictions of individual, composite, and weighted Model Output Statistics. Similar to model ensemble averaging, increased skill is obtained by averaging MOS from several models. Also, the removal of the weakest model from a consensus of forecasts shows some increase in skill for some forecast variables. A simple weighted average consensus shows further increase in skill in terms of MAE and Brier Score. The use of training periods determined specifically for each station and forecast variable appears to give more favorable results than seen for a weighted consensus MOS in other studies. CMOS shows equal or superior forecast performance in terms of overall MAE’s and Brier Scores to that of the NWS and of individual MOS’s. WMOS shows superior forecast performance to that of CMOS. NWS forecasts add less value to individual MOS’s for precipitation than for temperature, as even GMOS outperforms NWS for precipitation for all but the 12-hr forecast. Comparison of time series of NWS and CMOS MAE’s and biases show the apparent extensive use of MOS by forecasters, as well as an awareness by forecasters of seasonal temperature biases in the models. Regional variations in MAE and bias are also apparent in the data. Further investigation shows that NWS forecasters perform particularly well for short-term forecasts for periods when temperatures show a large departure from climatology. During periods of a one-day temperature change, consensus MOS’s and WMOS are competitive or have lower MAE’s than NWS. Calculating the total number of days that each forecast predicted the most or least accurate temperatures revealed that the NWS predictions are less conservative than the MOS forecasts, in that it has the most days that it was most accurate as well as least accurate. The consensus and weighted MOS’s have the least number of both accurate and least accurate forecast days. Of course, the consensus and weighted MOS’s can only have a most or least accurate day in a “tie” situation, when all forecasts agree (MOS and NWS). Results presented here suggest that at the current stage of model development, human forecasters are best served in directing effort towards short-term weather forecasting, particularly in times of temperature extremes. It is very difficult to outperform a consensus of MOS’s in the short term during other periods, and it is even difficult to outperform the better of the individual MOS’s (e.g. GMOS) in longer-term forecasts. Given the varying distributions of errors between the forecasts, the question arises as to whether lower long-term MAE scores are the end goal of a forecast or model, or more days with higher accuracy is the desired result. Obviously, it is the end user that determines the answer to this question, and this answer varies from user to user. It is difficult to imagine however a user that requires slightly better forecasts on most days at the cost of having more high error days. Perhaps if more was known as to when are where these high error days occurred, the user would be willing to use this type of forecast. Conversely, it is easy to imagine a user concerned with short-term forecasts during extreme weather conditions. National Weather Service and human forecasts obviously still fill a roll in forecasting, for arguably the most important forecast situation (during extremes). Advancement in numerical models and MOS for longer-term forecasts and during periods of more “regular” weather have in one sense caught up with human forecasters however. Note dallavalle figure and paper. Lets bring in IPPS… doesn’t make much sense to have humans deal with multi-day forecasts when most of their skill is short-term. 6. REFERENCES Bosart, L. F., 2003: Whither the weather analysis and forecasting process? Weather and Forecasting: 18, 520-529. Brooks, H. E. and C. A. Doswell, 1996: A comparison of measures-oriented and distributionsoriented approaches to forecast verification. Weather and Forecasting, 11, 288-303. Brooks, H. E., A. Witt, and M. D. Eilts, 1997: Verification of public weather forecasts available via the media. Bull. Amer. Met. Soc., 78, 2167-2177. Daley, R. 1991: Atmospheric Data Analysis. New York, NY, Cambridge University Press, 457 pp. Dallavalle, J. P. and V. J. Dagostaro, 2004: Objective interpretation of numerical weather prediction model output – A Perspective based on verification of temperature and precipitation guidance. Symposium on the 50th Anniversary of Operational Numerical Weather Prediction, June 14-17, 2004, College Park, MD. Jensenius, J. S., Jr., J. P. Dallavalle, and S. A. Gilbert, 1993: The MRF-based statistical guidance message. NWS Technical Procedures Bulletin No. 411, NOAA, U.S. Dept. of Commerce, 11 pp. Jolliffe, Ian T and D. B. Stephenson, 2003: Forecast verification: A Practitioner’s guide in atmospheric science. West Sussex, England, John Wiley & Sons Ltd, 240 pp. Murphy, A. H., and R. L. Winkler, 1986: A general framework for forecast verification. Monthly Weather Review, 115, 1330-1338. Murphy, A. H., and T. E. Sabin, 1986: Trends in the quality of the National Weather Service forecasts. Wea. Forecasting, 1, 42-55. NWS 2004: http://nws.noaa.gov/om/notifications/tin03-48eta_mos_aaa.txt Taylor, A. and M. Stram, 2003: MOS verification. Retrieved August 11 2004 from the NWS Meteorological Development Laboratory’s MOS Verification webpage: http://www.nws.noaa.gov/mdl/verif/. Vislocky, Robert L., Fritsch, J. Michael. 1997: Performance of an advanced MOS system in the 1996-97 National Collegiate Weather Forecasting Contest. Bull. Amer. Meteor. Soc.: 78, 2851-2857. Vislocky, Robert L., Fritsch, J. Michael. 1995a: Improved model output statistics forecasts through model consensus. Bull. Amer. Meteor. Soc.: 76, 1157-1164. Wilk, Gregory E., 1998: Intercomparisons among the NGM-MOS, AVN-MOS, and consensus temperature forecasts for West Texas. Retrieved August 16, 2004 from the NWS Southern Region Headquarters webpage: http://www.srh.noaa.gov/topics/attach/html/ssd98-39.htm. Wilks, Daniel S. 1995: Statistical methods in the atmospheric sciences. San Diego, CA, Academic Press, 467 pp. Figure 1. Map of U.S. showing National Weather Service locations used in the study. Figure 2. Time series of weights used for each of the three MOS’s in WMOS for station BNA over the study period. Figure 3. Absolute value of forecast minus observations for all stations, all time periods, August 1 2003 – August 1 2004. Bins are 1F in size, with data plotted in the center of each bin [update this figure] Figure 4. MAE (F) for the seven forecasts for all stations, all time periods, August 1 2003 – August 1 2004. Figure 5. Total MAE for each forecast during periods of large temperature change (10 F over 24-hr—see text), August 1 2003 – August 1 2004. Totals include data for all stations. Figure 6. Total MAE for each forecast type during periods of large departure (20F) from daily climatological value, August 1 2003 – August 1 2004. Totals include data for all stations. Figure 7. Number of days each forecast is the most accurate, all stations, August 1 2003 – August 1 2004. Ties count as one day for each forecast. Figure 8. Number of days each forecast is the least accurate, all stations, August 1 2003 – August 1 2004. Ties count as one day for each forecast. Figure 9. Time series of bias in MAX-T for period one for all stations, August 1 2003 – August 1 2004. Mean temperature over all stations is shown with a dotted line. 3-day smoothing is performed on the data. Figure 10. MAE for all stations, August 1, 2003 – August 1 2004, sorted by geographic region. Figure 11. Bias for all stations, August 1, 2003 – August 1 2004, sorted by geographic region. Figure 12. Absolute value of forecast minus binary observed precipitation for all stations, all time periods, July 1 2003 – July ?? 2004. Bins are 10% in size, with data plotted in the center of each bin [update this figure] Figure 13. ratio of… [update this figure] Figure 13. Brier Scores for the seven forecasts for all stations, all time periods, August 1 2003 – August 1 2004, separated by forecast period. Figure 14. Brier Score for all stations, August 1, 2003 – August 1 2004, sorted by geographic region. 3-day smoothing is performed on the data. Figure 15. Brier Score for all stations, August 1, 2003 – August 1 2004, sorted by geographic region.