Performance of National Weather Service Forecasts Versus Model

advertisement
Performance of National Weather Service Forecasts Compared to Operational,
Consensus, and Weighted Model Output Statistics
Jeff Baars
Cliff Mass
ABSTRACT
Model Output Statistics (MOS) have been a useful tool for forecasters for years and have shown
improving forecast performance over time. A more recent advancement in the use of MOS is
the application of "consensus" MOS (CMOS), which is an average of MOS from two or more
models. CMOS has shown additional skill over individual MOS forecasts and has performed
particularly well in comparison with human forecasters in forecasting contests. Versions of a
weighted consensus MOS (WMOS) have been studied but have shown little improvement over
simple averaging, although these WMOS’s were rather simplistic in nature. This study compares
MOS, CMOS, and WMOS forecasts for temperature and precipitation to those of the National
Weather Service (NWS) subjective forecasts. Data from 29 locations throughout the United
States for the August 1 2003 – August1 2004 time period are used. MOS forecasts from the
GFS (GMOS), Eta (EMOS) and NGM (NMOS) models are included, with CMOS being a
consensus of these three models. WMOS is calculated using weights calculated using a
minimum variance method, varying training periods for each station and variable. Performance
is analyzed at various forecast periods, by region of the U.S., and by time/season. Additionally,
performance is analyzed during periods of large daily temperature changes and times of large
departure from climatology. The results show that by most verification methods, CMOS is
competitive or superior to human forecasts at nearly all locations and that WMOS is superior to
a simple averaged CMOS. NWS performs well for short-range forecasts during times of
temperature extremes.
1. INTRODUCTION
Since their advent in the 1970’s, Model Output Statistics (MOS) have greatly improved the
efficacy of numerical forecast models, and have become an integral part of weather forecasting.
MOS is a large multiple regression on numerical model output which yields a more accurate
forecast at a point of interest. The increased accuracy is a result of MOS being able to correct
model biases and from accounting for sub grid-scale differences between the model and the
real world (e.g. topography). MOS has the added benefit of producing probabilistic forecasts as
well as deterministic forecasts (Wilks 1995).
Over time, MOS has shown steady improvement as models and MOS equations have
improved. Dallavalle and Dagostaro (2004) showed that in recent years the forecasting skill of
MOS is actually approaching the skill of National Weather Service (NWS) forecasters,
particularly for longer forecast projections. As priorities continually change at the NWS, it is
important to understand just when and how human forecasters perform better than models and
MOS. Using this information, forecasters at the NWS can allocate their skills to where it can be
most beneficial.
It has long been recognized that a consensus of forecasters, be it human or machine, often
performs quite well relative to the individual forecasters themselves. Initially noted in human
forecasting contests (Sanders 1973; Bosart 1975; Gyakum (1986)), Vislocky and Fritsch (1995)
first showed the increased skill from a consensus of MOS products. Subsequently, Vislocky and
Fritsch (1997) demonstrated that a more advanced consensus MOS using MOS, model output,
and surface weather observations, performed well in a national forecasting competition.
Given that a simple average of different MOS predictions shows good forecast skill, and that
performance varies between the individual MOS forecasts, it seems obvious that a system of
weighting the individual MOS’s should show improvement over CMOS performance. Vislocky
and Fritsch (1995) for example utilized a simple weighting scheme where NGM and LFM MOS
were given the same weights (one for temperature and one for precipitation) across all stations
using a year of developmental data. This weighted consensus MOS showed “no meaningful
improvement” over simple averaging.
Several studies have shown an increase in forecasting skill of both human forecasts
(Murphy and Sabin 1986) and of model output statistics (Dallavalle and Dagostaro 2004) .
Recent studies have shown that MOS is approaching or matching the skill of human forecasters
(Dallavalle and Dagostaro 2004). Thus, it is timely to compare recent performance of MOS and
NWS forecasters, and to determne the performance of combined and weighted MOS.
In addition to examining simple measures-based verification statistics such as mean
absolute error (MAE), mean squared error (MSE), and bias, it is informative to investigate the
circumstances under which a given forecast performs well or performs poorly (Brooks and
Doswell 1996; Murphy and Winkler 1986). Statistics such as MAE often do not give a complete
picture of a forecast’s skill and can be quite misleading in assessing the overall quality of
forecasts. For instance, one forecast may perform well on most days and have the best MAE
compared to others, but shows poor performance during periods of large departure from
climatology. These periods may be of the most interest to a given user, such as agricultural
interests or energy companies, if extreme weather conditions affect them severely.
Section 2 describes the data used and the quality control used in this study. Section 3
details the methods used in generating statistics, and how WMOS is calculated. Results are
shown in section 4, and Section 5 summarizes and interprets the results.
2. DATA
Daily forecasts of maximum temperature (MAX-T), minimum temperature (MIN-T) and
probability of precipitation (POPs) were gathered from July 1, 2003 – July ??, 2004 for 30
stations spread across the U.S. (Figure 1). Forecasts were taken from the NWS, GFS MOS
(GMOS), Eta MOS (EMOS), and the NGM MOS (NMOS). Two average or “consensus” MOS’s
were also calculated, one from GMOS, EMOS and NMOS (CMOS) and one from only the
GMOS and EMOS (CMOS_GE). Also, a weighted MOS (WMOS) was calculated by combining
the three MOSs based on their prior performance. Further discussion on how the consensus
and weighted MOS’s were calculated is given in the Methods section.
MOS data was gathered for the 0000 UTC model run cycle, and NWS forecast data was
gathered from the early morning (~1000 UTC or about 0400 PST) forecast issued the following
morning. This was done for so that NWS forecasters would have access to the 0000 UTC
model output and corresponding MOS data. This gives some advantages to the NWS
forecasters, who have additional time to observe the development of the current weather. And,
of course, the models do not have access to NWS forecasts, but the NWS is able to consider
the MOS output in making their forecasts. Data were gathered for 48 hours, providing two
MAX-T forecasts, two MIN-T forecasts, and four 12-hr POP forecasts.
During the study period, the Meteorological Development Laboratory (MDL) implemented
changes to two of the MOS’s used in the study. On December 15, 2003, Aviation MOS was
phased out and became the Global Forecast System (GFS) MOS, or GMOS. Output from the
AMOS/GMOS was treated as one MOS over the December 15 2003 boundary as the model
and equations are essentially the same. Labeling is used as “GMOS” for this MOS throughout
this paper. The second change occurred on February 17, 2004 when EMOS equations were
changed, being derived from the higherresolution archive of the Eta model (NWS 2004). There
was also an attempt to correct a known warm bias in EMOS maximum temperatures as well
(Paul Dallavalle personal communication, January 23, 2004). As with the AMOS/GMOS
change, EMOS was treated as one continuous model throughout the study.
Stations were primarily chosen to be at or near major weather forecast offices (WFOs) and
to represent a wide range of geographical areas.
The definitions of observed maximum and minimum temperatures follow the National
Weather Service MOS definitions (Jensenius et al 1993), which are that maximum temperatures
occur between 7 AM through 7 PM local time and minimum temperatures occur between 7 PM
through 8 AM local time. For POP, two forecasts per day were gathered: 0000 – 1200 UTC
and 1200 – 0000 UTC. Definitions of MAX-T, MIN-T, and POP from the NWS follow similar
definitions (Chris Hill, personal communication, July 17, 2003).
While quality control measures are implemented at the agencies from which the data was
gathered, simple range checking was performed to ensure quality of the data used in the
analysis. Temperatures below –85F and above 140F were removed, POP data were required
to be in the range of 0 to 100%, and quantitative precipitation amounts had to be in the range of
0.0-in to 25.0-in for a 12-hr period. On occasion forecasts and/or observation data were not
available for a given time period and these days were removed from analysis. Also, absolute
forecast – observation differences, used in calculating many of the statistics in this study, were
removed from the dataset if they exceeded 40F. Differences greater than 40F were
considered unrealistically large and a sign of error in either the observations or the forecasts. A
few periods of such large forecast-observations differences were found and were apparently
due to erroneous MOS data.
The data set was analyzed to determine the percentage of days when all forecasts and
required verification observations were available, and it was found that each station had
complete data for about 85-90% of the days. There were many days (greater than 50%) when
at least one observation and/or forecast was missing from at least one station, making it
impossible to remove a day entirely from analysis when all data was not present. Therefore,
only individual missing data (and not the corresponding entire day) were removed from the
analysis. Each MOS forecast and NWS forecast were required to be present for a date to be
used for a given station and variable.
3. METHODS
CMOS was calculated by simple averaging of GMOS, EMOS, and NMOS for MAX-T, MINT, and POP. CMOS values were calculated only if two or more of the three MOS’s were
available and were considered “missing” otherwise. A second CMOS (CMOS_GE) was also
calculated which only used GMOS and EMOS. Both GMOS and EMOS had to be available for
calculation of CMOS_GE. CMOS_GE was calculated in an attempt to improve the original
CMOS by eliminating the consistently weakest member (NMOS) of the individual MOS’s,
WMOS was calculated using minimum variance estimated weights (Daley 1991). Weights
were determined for a set training period by blending MOS outputs by determining the minimum
mean squared error with the equation
wn 
n2
 n2
n
where  is the mean square error and n is model (GMOS, EMOS and NMOS). WMOS was
calculated for each station and variable using training day periods of 10, 15, 20, 25, and 30
days. The training period that produced the minimum mean squared error from July 1 2003 –
July 1 2004 for each station and variable was then saved and used in the final WMOS
calculation. With eight variables (two MAX-T’s, two MIN-T’s and four POP forecasts) and 30
stations, there were 240 different weights stored. A table of the average weights across all
stations is given in Table 1. NMOS has the lowest weights of the three MOS’s for all variables.
GMOS generally has the highest weights. A plot showing a typical time series of weights for
MAX-T period 1 for a station (Nashville, TN KBNA in this case) is given in Figure 2. The training
period for this station and variable is 30 days. GMOS has high weights for October and
November 2003, and a more even blend of weights is seen for the remainder of the year. Each
model shows higher weights than the others for periods of time, and weights for the three
MOS’s mostly fall between 0.2 and 0.5.
Variable
GMOS EMOS NMOS
MAX-T
0.368
0.322
0.310
pd1:
MAX-T
0.374
0.324
0.303
pd2:
MIN-T pd1: 0.373
0.334
0.294
MIN-T pd2: 0.394
0.329
0.278
PRC1:
0.332
0.372
0.279
PRC2:
0.346
0.342
0.280
PRC3:
0.358
0.355
0.274
PRC4:
0.355
0.333
0.291
Table 1. Weights used in WMOS, averaged over all 30 stations. [update this table]
Bias, or mean error, is defined as
n
 ( f  o)
i 1
where f is the forecast, o is the observation, and n is the total number of forecast minus
observations differences. MAE is defined as
n

f o
i 1
where f is the forecast, o is the observation, and n is the total number of forecast minus
observations differences. Precipitation observations were converted to binary rain/no-rain data
(with trace amount treated as no-rain cases), which are needed for calculating Brier Scores.
The Brier Score is defined as
1 n
( f  o) 2

n i 1
where f is the forecast probability of rain, o is the observation converted to binary rain/no-rain
data, and n is the total number of forecast minus observations differences. Brier Scores range
from 0.0 (perfect forecasting) to 1.0 (worst possible forecasting).
To better understand the circumstances under which each forecast performed well or
poorly, a type of “distributions-based” verification was performed (Brooks and Doswell 1996;
Murphy and Winkler 1986). One circumstance examined was periods of large one-day
temperature change. These periods are expected to be more challenging for the forecaster and
the models. To determine forecast skill during these periods, forecast minus observations
temperature differences were gathered for days when the observed MAX-T or MIN-T varied by
+/-10F from the pervious day. MAE’s and biases were then calculated from these differences
for each forecast and forecast variable.
This study also examined periods when observed temperatures departed significantly from
that of climatology, since it is hypothesized that human forecasters might have an advantage
during such times. Days showing a large departure from climatology were determined using
daily average maximum and minimum temperature data from the National Climatic Data Center
for the 1971-2000 data period, using monthly average maximum and minimum temperatures
interpolated to each date. A large departure from climatology was defined to be +/- 20F.
As another measure of forecast quality, the number of days each forecast was the most or
least accurate were determined. A given forecast may have low MAE for a forecast variable but
is rarely the most accurate forecast. Or a given forecast may be most accurate on more days
than other forecasts, but also least accurate on average due to infrequent large errors. This
type of information, which remains hidden when only considering standard verification measures
such as the MAE. The smallest forecast error was determined for each day, with ties counting
towards each forecast’s total.
4. RESULTS
4.1 Temperature
Total MAE scores were calculated using all stations, both MAX-T and MIN-T’s, for all
forecast periods available (Table 2). It can be seen that WMOS has the lowest total MAE,
followed by the CMOS_GE, CMOS, NWS, GMOS, EMOS and NMOS. MAE’s are notably lower
than was found by Vislocky and Fritsch (1995). This is presumably due in part to more than ten
years of model improvement. The National Weather Service’s National Verification Program,
using data from 2003, reports similar total MAE for GMOS, MMOS and NMOS to those shown
here (Taylor and Stram 2003).
Forecast
Total MAE (F)
NWS
2.54
CMOS
2.50
CMOS_GE
2.49
WMOS
2.41
GMOS
2.64
EMOS
2.84
NMOS
2.97
Table 2: Total MAE for each forecast type, August 1 2003 – August 1 2004. Totals include data
for all stations, all forecast periods, and both MAX-T and MIN-T for temperature.
Figure 3 shows the distribution of the MAE over all stations and the entire period of the
study. Bins are 1F in size, with plotting centered on the bin, so every 0.5F. It is interesting to
note the similarities in the errors for all forecasts. Subtle changes in the curve account for
differences in mean error type statistics. NWS, CMOS, CMOS_GE, and WMOS are all very
similar to each other, with the individual MOS’s showing some separation at the number of
occurrences of small differences (< 2F) and larger differences (> 4F).
Figure 4 shows MAE by MAX-T and MIN-T for each of the forecast periods. Period 1, “pd1”
is the day 1 MAX-T, period 2 , “pd2,” is the day 2 MIN-T, period 3, “pd3,” is the day 2 MAX-T,
and period 4, “pd4,” is the day 3 MIN-T. WMOS has lower MAE’s over all periods. NWS has a
lower MAE than both CMOS and CMOS_GE for the period 1 MAX-T, but CMOS and
CMOS_GE have similar MAE’s for period 3 MAX-T and lower MAE’s for both MIN-T’s. The
individual MOSs have higher MAE’s than WMOS, NWS, CMOS and CMOS_GE for all periods
and GMOS has the lowest MAE’s of the individual MOS’s.
To determine the forecast skill during periods of large temperature change, MAE’s were
calculated on days having a 10F change in MAX-T or MIN-T from the previous day. Results of
these calculations are shown in Fig. 5. There is about a 1.0 to 1.5F increase in MAE’s in the
seven forecast types over MAE’s for all times (Fig. 4). WMOS again has the lowest MAE for all
periods except, interestingly, for MIN-T for period 4, when GMOS shows the lowest MAE. NWS
has lower MAE’s than CMOS for all periods but similar or higher MAE’s than CMOS_GE.
MAEs were also calculated for days on which observed temperatures departed by more
than 20F from the daily climatological value (see Methods section). Results from these
calculations are given in Figure 6. NWS shows considerably higher skill (lower MAE) relative to
the other forecasts for the period 1 MAX-T, with MAE’s being about 0.5F lower than the next
best forecast, WMOS. For MIN-T period 2 and MAX-T period 3, CMOS_GE performs the best,
and apparently the poor performance of NMOS increases MAE’s for CMOS and even for
WMOS. The weighting scheme in WMOS appears to be unable to compensate enough during
times of large departure from climatology for these variables. As was seen for the one-day
temperature change statistics, GMOS performs best out of all forecasts for MIN-T period 4.
Figure 7 shows the number of days that each forecast was most accurate (i.e. had the
smallest absolute forecast minus observation difference). Situations when two or more
forecasts had the most accurate result were awarded to each such forecast. For MAX-T period
1, NWS was most frequently the best forecast, with over 50% more days than CMOS, about
40% more days than CMOS_GE and WMOS, and about 15% more best-day forecasts than the
individual MOS’s. For the remaining periods, the individual MOS’s are more comparable to
NWS, and GMOS actually shows more days with the most accurate forecast by MIN-T period 4.
Consensus and weighted MOS forecasts show the fewest number of days having the most
accurate forecasts for all periods and variables. Similar results for this type of statistic for a
consensus MOS have been reported (Wilk 1998).
Figure 8 shows the number of days each forecast had the least accurate forecast.
Again, when two or more forecasts had equal least accurate results, each received one day
towards their totals. This figure shows that NWS and the individual MOS’s (GMOS, EMOS, and
NMOS) all have higher totals of least (most) accurate days, revealing the conservative nature of
the CMOS, CMOS_GE, and WMOS forecasts. The averaging for these forecasts tend to
eliminate extremes, so that they are seldom the most or least accurate forecast, relative to NWS
and the individual MOS’s (GMOS, EMOS, NMOS). CMOS_GE has notably more least-accurate
days than CMOS or WMOS due to averaging over only two models. NWS has considerably
less least-accurate forecasts than the individual MOS’s for most periods.
Figure 9 shows a time series of bias for all stations for August 1, 2003 –August 1, 2004 for
NWS, CMOS, and WMOS. The correlation between the three forecast biases is quite evident,
with a pronounced positive (warm) bias by CMOS and WMOS seen through the 2003-2004 cold
season. This presumably shows to some degree the use of models and of MOS by forecasters,
and perhaps shows their knowledge of biases within the MOS.
SHOw a period when the NWS is inferior for contrast… or at least show me to see if there
is anything useful.
Figure 10 compares the performance of the various NWS forecast offices compared to
MOS, showing MAE’s for MAX-T, period 1 for August 1, 2003 – August 1, 2004. The stations
are sorted by broad geographic region, starting in the West and moving through the Intermountain West and Southwest, the Southern Plains, the Southeast, the Midwest, and the
Northeast (see map, Figure 1). MAE’s vary more from location to location in the western U.S.,
while MAE’s in the northeastern U.S. are more uniform. Not surprisingly, MIA (Miami, FL)
shows the lowest MAE of all stations, and LAS (Las Vegas, NV) and PHX (Phoenix, AZ) also
have low MAE’s. Higher MAE’s are seen at individual stations in the west (Missoula-MSO and
Denver-DEN). Speculate on why Denver is the worst? Very variable weather?
Figure 11 shows biases for MAX-T, period 1, for each of the 30 individual stations in the
study for August 1, 2003 – August 1, 2004. A prominent feature is the positive (warm) biases
through much of the Western U.S., particularly in the Intermountain West and Southwest, and
extending through the southern plains and into the South. Small negative (cool) biases are
observed in much of the Midwest. NWS forecasts have biases that are over 0.5F smaller than
CMOS at stations such as MSO, SLC, DEN, PHX, ABQ and BNO.
4.2 Precipitation
Total Brier Scores for the study period for the seven forecasts for all stations and forecast
periods are given in Table 6. The scores do not vary greatly from each other. WMOS and
CMOS_GE show the lowest scores, however, followed by CMOS, AMOS, NWS and EMOS,
and NMOS. Although WMOS should give low weight to the poor-performing NMOS, WMOS
scores are probably still affected by the use of this MOS. As a result, scores for WMOS are the
same as those for CMOS_GE.
Forecast
NWS
Total Brier Score
0.096
CMOS
0.092
CMOS_GE
0.091
WMOS
0.091
GMOS
0.094
EMOS
0.096
NMOS
0.105
Table 6. Total Brier Scores for August 1 2003 – August 1 2004. Totals include data for all
stations and all forecast periods.
Figure 12 shows the distribution of the absolute value of forecast POP minus binary
observed precipitation (1 for rain, 0 for no rain) over all stations and over the entire period
record for forecast period 1. Bins are 0.10 in size, centered every 0.10. The forecasts report
POP differently in that MOS’s have a resolution of 1% while NWS only reports to the nearest
10% (there were occurrences of 5% POP forecasts in the NWS data as well). Figure 12 can be
viewed in two parts, one for occurrences of rain (less than 0 on the x-axis), and one for periods
of no rain (greater than 0 on the x-axis). For rain cases, there is a relatively uniform distribution
from –0.7 to –0.2 for all forecasts, but a small maximum near –0.5 to –0.4 (POP forecast of 5060%). During no rain occasions (again, greater than 0.0 on the x-axis), more separation can be
seen between forecasts. NWS shows a higher frequency of POP of 30% which can be seen at
–0.7 and at 0.3. At these points there are more separation between the curves of the NWS and
the other forecasts. NWS has a lower frequency in the 10% bin during no rain cases relative to
the other forecasts. NWS also has a higher frequency of a 0% POP.
Figure 13 shows the distribution of the forecast POP minus binary observed precipitation (1
for rain, 0 for no rain), normalized by each forecast’s total error. This plot shows the situations
contributing to the total error of each forecast. Only CMOS, WMOS, and NWS are shown to
allow each to be seen more clearly. The preponderance of 30% POP forecasts by NWS is seen
again as in Figure 12. These are the two maxima in the NWS curve. Much of the error for
CMOS and WMOS occurs when no rain is observed and small POP is forecast. Obviously,
CMOS can only be zero if the average of all three individual MOS’s is zero, and WMOS can
only be zero if the weighted average of the MOS’s is zero. This leads to many cases when
these POP forecasts are small but non-zero. (**lack of sharpness) All three forecasts tend incur
most of their error from POP forecasts of 20-40%. This is reasonable as this value approaches
the climatological value for many stations. NWS incurs more error than CMOS and WMOS for
forecasting 100% POP in no-rain situations (1.0 on x-axis) and 0% POP in rain situations (-1.0
on x-axis).
So got to weight by number of times each % is forecast to take out climatological issues
out…?
Figure 13 shows Brier Scores for each of the four 12-hr precipitation forecast periods.
WMOS has the lowest (higher skill) scores for all periods. There is a substantial increase in
Brier Scores with time for all forecasts. CMOS_GE has the second lowest Brier Scores,
followed by CMOS and NWS and the individual MOS’s. NMOS has particularly poor Brier
Scores relative to the other forecasts.
Figure 14 shows a time series of Brier Scores for all stations in the data set for period 1. A
very high correlation between CMOS and NWS is seen with time, far more so than for
temperature. An increase in Brier Scores is seen in the warm seasons when convection is more
prominent. Comparing Fig 14 to Fig 9 shows that clearly humans add less skill to precipitation
forecasts than to temperatures forecasts.
Figure 15 shows Brier Scores by station. Low precipitation locations show the lowest (most
skillful) Brier Scores, such as at LAS, PHX and ABQ. MIA shows the highest Brier Scores due
to considerable convective precipitation at that location. CMOS shows the greatest skill
improvement over NWS at some Midwest locations like STL, ORD and IND. NWS outperforms
CMOS at a few locations in the South, and at select stations in the Intermountain West.
5. CONCLUSIONS
This study compared the skill of National Weather Service forecasts throughout the United
States with the predictions of individual, composite, and weighted Model Output Statistics.
Similar to model ensemble averaging, increased skill is obtained by averaging MOS from
several models. Also, the removal of the weakest model from a consensus of forecasts shows
some increase in skill for some forecast variables. A simple weighted average consensus
shows further increase in skill in terms of MAE and Brier Score. The use of training periods
determined specifically for each station and forecast variable appears to give more favorable
results than seen for a weighted consensus MOS in other studies. CMOS shows equal or
superior forecast performance in terms of overall MAE’s and Brier Scores to that of the NWS
and of individual MOS’s. WMOS shows superior forecast performance to that of CMOS. NWS
forecasts add less value to individual MOS’s for precipitation than for temperature, as even
GMOS outperforms NWS for precipitation for all but the 12-hr forecast.
Comparison of time series of NWS and CMOS MAE’s and biases show the apparent
extensive use of MOS by forecasters, as well as an awareness by forecasters of seasonal
temperature biases in the models. Regional variations in MAE and bias are also apparent in the
data.
Further investigation shows that NWS forecasters perform particularly well for short-term
forecasts for periods when temperatures show a large departure from climatology. During
periods of a one-day temperature change, consensus MOS’s and WMOS are competitive or
have lower MAE’s than NWS.
Calculating the total number of days that each forecast predicted the most or least accurate
temperatures revealed that the NWS predictions are less conservative than the MOS forecasts,
in that it has the most days that it was most accurate as well as least accurate. The consensus
and weighted MOS’s have the least number of both accurate and least accurate forecast days.
Of course, the consensus and weighted MOS’s can only have a most or least accurate day in a
“tie” situation, when all forecasts agree (MOS and NWS).
Results presented here suggest that at the current stage of model development, human
forecasters are best served in directing effort towards short-term weather forecasting,
particularly in times of temperature extremes. It is very difficult to outperform a consensus of
MOS’s in the short term during other periods, and it is even difficult to outperform the better of
the individual MOS’s (e.g. GMOS) in longer-term forecasts.
Given the varying distributions of errors between the forecasts, the question arises as to
whether lower long-term MAE scores are the end goal of a forecast or model, or more days with
higher accuracy is the desired result. Obviously, it is the end user that determines the answer
to this question, and this answer varies from user to user. It is difficult to imagine however a
user that requires slightly better forecasts on most days at the cost of having more high error
days. Perhaps if more was known as to when are where these high error days occurred, the
user would be willing to use this type of forecast. Conversely, it is easy to imagine a user
concerned with short-term forecasts during extreme weather conditions. National Weather
Service and human forecasts obviously still fill a roll in forecasting, for arguably the most
important forecast situation (during extremes). Advancement in numerical models and MOS for
longer-term forecasts and during periods of more “regular” weather have in one sense caught
up with human forecasters however.
Note dallavalle figure and paper. Lets bring in IPPS… doesn’t make much sense to have
humans deal with multi-day forecasts when most of their skill is short-term.
6. REFERENCES
Bosart, L. F., 2003: Whither the weather analysis and forecasting process? Weather and
Forecasting: 18, 520-529.
Brooks, H. E. and C. A. Doswell, 1996: A comparison of measures-oriented and distributionsoriented approaches to forecast verification. Weather and Forecasting, 11, 288-303.
Brooks, H. E., A. Witt, and M. D. Eilts, 1997: Verification of public weather forecasts available
via the media. Bull. Amer. Met. Soc., 78, 2167-2177.
Daley, R. 1991: Atmospheric Data Analysis. New York, NY, Cambridge University Press, 457
pp.
Dallavalle, J. P. and V. J. Dagostaro, 2004: Objective interpretation of numerical weather
prediction model output – A Perspective based on verification of temperature and precipitation
guidance. Symposium on the 50th Anniversary of Operational Numerical Weather Prediction,
June 14-17, 2004, College Park, MD.
Jensenius, J. S., Jr., J. P. Dallavalle, and S. A. Gilbert, 1993: The MRF-based statistical
guidance message. NWS Technical Procedures Bulletin No. 411, NOAA, U.S. Dept. of
Commerce, 11 pp.
Jolliffe, Ian T and D. B. Stephenson, 2003: Forecast verification: A Practitioner’s guide in
atmospheric science. West Sussex, England, John Wiley & Sons Ltd, 240 pp.
Murphy, A. H., and R. L. Winkler, 1986: A general framework for forecast verification. Monthly
Weather Review, 115, 1330-1338.
Murphy, A. H., and T. E. Sabin, 1986: Trends in the quality of the National Weather Service
forecasts. Wea. Forecasting, 1, 42-55.
NWS 2004: http://nws.noaa.gov/om/notifications/tin03-48eta_mos_aaa.txt
Taylor, A. and M. Stram, 2003: MOS verification. Retrieved August 11 2004 from the NWS
Meteorological Development Laboratory’s MOS Verification webpage:
http://www.nws.noaa.gov/mdl/verif/.
Vislocky, Robert L., Fritsch, J. Michael. 1997: Performance of an advanced MOS system in the
1996-97 National Collegiate Weather Forecasting Contest. Bull. Amer. Meteor. Soc.: 78,
2851-2857.
Vislocky, Robert L., Fritsch, J. Michael. 1995a: Improved model output statistics forecasts
through model consensus. Bull. Amer. Meteor. Soc.: 76, 1157-1164.
Wilk, Gregory E., 1998: Intercomparisons among the NGM-MOS, AVN-MOS, and consensus
temperature forecasts for West Texas. Retrieved August 16, 2004 from the NWS Southern
Region Headquarters webpage: http://www.srh.noaa.gov/topics/attach/html/ssd98-39.htm.
Wilks, Daniel S. 1995: Statistical methods in the atmospheric sciences. San Diego, CA,
Academic Press, 467 pp.
Figure 1. Map of U.S. showing National Weather Service locations used in the study.
Figure 2. Time series of weights used for each of the three MOS’s in WMOS for station BNA
over the study period.
Figure 3. Absolute value of forecast minus observations for all stations, all time periods, August
1 2003 – August 1 2004. Bins are 1F in size, with data plotted in the center of each bin
[update this figure]
Figure 4. MAE (F) for the seven forecasts for all stations, all time periods, August 1 2003 –
August 1 2004.
Figure 5. Total MAE for each forecast during periods of large temperature change (10 F over
24-hr—see text), August 1 2003 – August 1 2004. Totals include data for all stations.
Figure 6. Total MAE for each forecast type during periods of large departure (20F) from daily
climatological value, August 1 2003 – August 1 2004. Totals include data for all stations.
Figure 7. Number of days each forecast is the most accurate, all stations, August 1 2003 –
August 1 2004. Ties count as one day for each forecast.
Figure 8. Number of days each forecast is the least accurate, all stations, August 1 2003 –
August 1 2004. Ties count as one day for each forecast.
Figure 9. Time series of bias in MAX-T for period one for all stations, August 1 2003 – August 1
2004. Mean temperature over all stations is shown with a dotted line. 3-day smoothing is
performed on the data.
Figure 10. MAE for all stations, August 1, 2003 – August 1 2004, sorted by geographic region.
Figure 11. Bias for all stations, August 1, 2003 – August 1 2004, sorted by geographic region.
Figure 12. Absolute value of forecast minus binary observed precipitation for all stations, all
time periods, July 1 2003 – July ?? 2004. Bins are 10% in size, with data plotted in the center of
each bin [update this figure]
Figure 13. ratio of… [update this figure]
Figure 13. Brier Scores for the seven forecasts for all stations, all time periods, August 1 2003
– August 1 2004, separated by forecast period.
Figure 14. Brier Score for all stations, August 1, 2003 – August 1 2004, sorted by geographic
region. 3-day smoothing is performed on the data.
Figure 15. Brier Score for all stations, August 1, 2003 – August 1 2004, sorted by geographic
region.
Download