wxstudy

advertisement

WXSIM Temperature Forecast Verification Study

In Comparison with NGM MOS, AVN MOS, NWS Coded City Forecasts

(c) Copyright Thomas J. Ehrensperger 1999

----------------------------------------------------------------------------------------------------------------------------- --------------------

------------------------------

NOTE: Additional data, obtained in 2002, appears at the end of this document as an appendix. Preliminary findings from this recent data are presented. Further analysis and perhaps more data will be presented at a later date.

------------------------------------------------------------------------------------------------------------- ------------------------------------

------------------------------

Introduction

I have spent much of the last 16 years developing a weather forecasting/modeling program called Weather Simulator

(WXSIM). This software package is, as far as I can tell, unique, so that it does not readily fall into any pre-existing category. The most concise description I can think of for it is an 'interactive local weather modeling system for personal computers. It originated as a simple attempt to model diurnal temperature curves, but grew to include advection, user input (during execution as well as for initialization), and eventually certain aspects of output from the

NGM and Eta models.

While this program started out mostly as an experiment in modeling single-site surface temperatures, it has become a very useful short range temperature forecasting tool. Its output includes other variables as well, such as dew point, diurnal variations in low cloud cover and wind speed - among others - but its strength is temperature forecasting, which is the subject of this study.

Purpose

The purpose of this study is twofold: (1) to identify WXSIM's strengths and weaknesses (partly to guide further development of the model) and (2) to establish, as objectively as I can, its accuracy and usefulness for operational short-term forecasting. Regarding the latter, it would seem that this study would be more objective if done by someone other than myself, but I am presently the only person sufficiently interested and motivated to do it. I will simply state here that I have been very careful to avoid biasing the outcome in favor of my model; the truth is what I'm after.

(NOTE: For a quick look at some overall results if you don't wish to read all the details, skip down to the section titled

' Analysis of all three sets together '.)

Methodology

The data analyzed here actually consists of three sets of four-period forecasts, numbering 80, 100, and 50 forecasts

(i.e. 920 forecast numbers in all) for Atlanta's Hartsfield International Airport (ATL), covering essentially the same season (winter and early spring), and using three slightly different versions of the model. The first spans the period

November 29, 1995 through February 3, 1996 and uses an older DOS version of the program. The second spans

December 17, 1997 through February 26, 1998 and uses a newer Windows-based version which, while essentially the same at the core, does have some additional features. The third spans January 14, 1999 to April 3, 1999, with the model having undergone only slight changes since the second set.

My goal was to generate forecasts at about the same times of day that the national Weather Services' Coded City

Forecasts (CCF's) are published, generally between 4 and 5 o'clock E.S.T., AM and PM. I did include some forecasts initialized at other times, partly because my schedule doesn't always permit me to make them at the desired times, but also in order to find how well the program could work in an update mode in between these times.

Along with my forecasts, I included for comparison the latest NGM and AVN MOS products, along with the CCF's. I almost always downloaded and viewed at least one (usually both) of the MOS products before making my forecasts in

order to study their wind, cloud cover, and precipitation probablility outputs, as changes in these can be interjected into

WXSIM as it runs. I usually downloaded NGM and Eta FOUS data along with RAOB soundings (usually for FFC, but often including BMX and BNA), too. I often (but not always) also saw the latest CCF numbers, zone forecasts, and state forecasts discussions (sometimes including those from adjacent states) as well. Less often, I viewed satellite pictures (occasionally from as much as 15 minutes after WXSIM's initialization time), forecast graphics from other models (such as the 29 km Meso-Eta model), and MOS numbers for other cities (to help in some changing advection situations).

The program was initialized using hourly surface reports for ATL and, usually, a number of upwind sites. In the earlier study employing the DOS version, this data had to be entered by hand. In the more recent study, most of this data was ingested by the program using its import features, then reviewed and modified if necessary (i.e. modifications in wind speed or direction to simulate averages over a few hours as opposed to only the most recent hour's conditions). All this downloading and reviw of data generally took half and hour or more, including waiting for files to be download from the internet and the various distractions of running the program in a busy household! Many times, in fact, over an hour passed from the time of latest surface observations until the completion of the forecast. I consciously tried to avoid looking out the window or checking the thermometer in the meantime.

For most of the second set of data, I have made copious notes detailing my inputs to the program. In most cases, I used the 'Use FOUS' routine, which allows a user-defined mixture of NGM and Eta FOUS to exert a user-defined degree of influence on WXSIM's output. In a sense, this 'contaminates' WXSIM with 'foreign' data, but of course one might expect an improvement due to a sort of consensus effect. In most cases I favored the Eta slightly (60% to 40%) over the NGM, since Eta is newer, occasionally varying this in accordance with opinions expressed in various forecast discussions regarding which model was favored at that particular time. I kept the 'contamination' level fairly low, only

20% effective weighting for FOUS in most cases for boundary layer temperature, with 50% usually for upper air temperatures (less WXSIM's specialty), and 60% for surface pressure (which WXSIM only handles to the extent of empirically modeling diurnal pressure variations anyway). I believe inclusion of this outside data helps WXSIM more at 36 and 48 hours than it does at 12 and 24, but in any case the output is closer to pure WXSIM than to what it would be with WXSIM acting as a FOUS 'puppet'.

The methodology for the third set of data was very similar to that of the second, with the exception of use of MAPS analysis soundings (essentially the RUC-2 model) to help initialize many of the runs.

A usually straightforward, but occasionally touchy, issue is that of official verification numbers. Often, especially in winter, the low or high tempreature for the day occur at odd times, like midnight. The exact periodcovered in a max/min temperature forecast therefore has to be defined carefully. From various sources I learned that NGM MOS maximum temperature forecasts refer to the period from 7 AM through 7 PM local standard time. For minimum temperatures, it is 7 PM through 8 AM, the extra hour presumably considering the fact that in the cold season many places have not yet had sunrise at 7 AM. AVN MOS, according to what I've learned, is similar except with the morning minimum period extended to approximately 9 AM. As for the CCF forecasts, the verification periods are similar, though I was told by Terry Murphy at the NWSFO in Peachtree City, GA (FFC) that they cut off the low at 7 AM.

For verification purposes, I have adopted the NGM MOS period definitions, and have assumed these to apply to AVN

MOS and the CCF's as well. WXSIM is configured to report temperature extremes for these periods accordingly, being worded as 'daytime highs' and 'nighttime lows'. My sources for this data are various NWS reports. Climate summaries are released typically at about 7:45 AM and 4:45 PM, and I use these, with the following exceptions. If the temperature continues to rise after the afternoon summary, I use the 'State Max and Min' table, published about 7:30 PM, or

(apparently the same) the maximum temperature for the period 12Z-00Z. For minimum temperatures, I use the morning climate summary, then check the 13Z (usually 1253Z) hourly report, and use the lower of these two.

Not to be presumptuous, but I have had some concerns regarding the accuracy of ATL's temperatures at times. First of all, I have monitored ATL's temperatures for most of the last 25 years, having lived at three locations within about 3 miles of the airport during the entire period. I've used the same max and min ('U-tube') recording thermometer the whole time (though in three locations) and have compared my home readings to ATL on countless occasions. When

ATL changed over to ASOS on August 1, 1995, I noticed a sudden cooling of ATL's temperatures relative to mine.

This appeared to be about 2 degrees F relative to before. Comparison of departures from normal for monthly means at

ATL and other nearby sites, such as AHN, for October and November, 1995 shows ATL having about 2 degrees more negative a departure from normal than the average of the other sites, suggesting that this anomaly is indeed real (I saved this data somewhere and need to find it!). This anomaly also could also explain why all four forecast sources had a positive net error, ranging from 0.23 degrees for the (often cold-biased) AVN MOS to 3.69 degrees for the (often

warm-biased) NGM MOS during the 1995-1996 study. In the analysis presented later, I present an optional 'correction' to these numbers, which helps all sources except the AVN.

The more recent temperatures seem much more consistent with historical trends. I suspect re-calibration has been done at times, and now the temperatures, relative to others in the area, are closer to what I'd expect. In very uniform-temperature conditions, such as rainy, cloudy, and/or windy weather, ATL still appears to come out a little on the low side, but by probably less than a degree and not enough to presume the need for correction. On two occasions, however, strange negative 'spikes' appear to have occured at ATL, each time in conditions that would seem to make such events highly unlikely. The first of these occured on 12/28/97 when, after a low so far of 24 reported on the morning climate summary (issued almost exactly at sunrise), the low temperature reported later in the day was 20! Conditions were clear with light-moderate winds and this 20 was, if I recall, the lowest temperature of any of the metro Atlanta reporting stations. At my house nearby, the low was 26, with no such wierd fluctuations. I have therefore recorded this low as 24.

Another such case occured on 1/19/98, when on a cloudy, windy, rainy night with no hourly temerpatures reported below 39 F, a low of 36 was reported. My home thermometer and the Davis Weather Monitor at my work (only a mile or two from ATL) each recorded lows between 39 and 40. I recorded this low as 39, in keeping with the average of numerous stations nearby on what should have been both spatially and temporaly a very uniform night. Other than these two well-thought-out exceptions, I have stuck strictly to the above discussed standard procedures to the best of my knowledge.

Another minor note may be worth making. I saved each of WXSIM's forecasts using the 'Save Data' option. I was producing the forecasts with an output time interval of 30 minutes with three iterations per interval (i.e. model updated every 10 minutes). To save disk space, however, only the every-30-minute outputs were saved. While the data retrieval program does attempt to statistically 'fill in the gap', there were some occasions on which a retrieved max or min differed from that originally output by a degree (never more than that). In each case I used the retrieved value, since it was documented in the saved file.

Data Analysis (First Set)

First I will present some overall figures for all available data, and then break it down into subsets to evaluate specific strengths and weaknesses. Although my main interest was WXSIM itself, the data also provides useful andinteresting information about the other three sources relative merits.

A total of 80 forecasts were made during the period 11/29/95 through 2/3/96. Of these 41 were AM forecasts, initialized with surface data as of an average of 7:08 AM EST, for the upcoming 12 hour max, 24 hour min, 36 hour max, and 48 hour min. The remaining 39 were PM forecasts, initialized with surface data averaging 5:08 PM EST, for the upcoming 12 hour min, 24 hour max, 36 hour min, and 48 hour max. All 80 entries include WXSIM, but 2 CCF's, 1

NGM, and 7 AVN (plus the 48 hour forecast from another) forecasts were not obtained (i.e. due to availability and downloading problems). Care was taken in the analysis to analyze only the data present; in other words, no error exists when no forecast was made, and only those forecasts made and verified went into the averaging.

For the first 7 columns, a correction was applied to the suspiciously low (see discussion above) verification numbers from ATL. The correction consists of adding 0.9, 1.5, 1.8, and 2.0 degrees to the 12, 24, 36, and 48 hour verification numbers, respectively. The idea here is that ATL's readings were 2 degrees too cold and that, while 12 hour forecasts would pay some heed to current (too low) surface temperatures, by 48 hours this effect would be lost against the background of advection and upper air data, revealing the full extent of the error. The last 3 columns repeat columns 5,

6, and 7, but WITHOUT the 'correction' applied. The net error in this case could be interpreted as evidence for the suspected low readings. Note: the root mean square is the square root of the average of the squares of the errors, averaged here for the four periods.

Table 1 Mean Absolute, Root-Mean-Square, and Net Errors:

------------------------------ Correction Applied --------------------------------- --No Correction ---

12 hr 24 hr 36 hr 48 hr AVG RMS NET AVG RMS NET

NGM 3.08 3.33 3.90 4.74 3.77 4.63 1.78 4.41 5.43 3.33

AVN 3.60 3.30 3.50 3.29 3.42 4.24 -1.20 3.30 4.14 0.31

NWS 2.76 2.62 3.29 3.56 3.06 3.93 0.25 3.34 4.32 1.80

WXSIM 2.18 2.50 2.98 3.59 2.81 3.57 -0.10 2.99 3.84 1.44

WXSIM is the clear winner here, especially at 12 hours. In the 'corrected' data, at least, only NWS' and AVN's 48 hour numbers were better. It may be interesting to note that NGM MOS averaged about 3 degrees colder than AVN MOS, and that AVN did somewhat better. Also interesting is the fact that AVN had the worst 12 hour forecasts, but the best

48 hour forecasts. If further data should bear this out, it might be that AVN should be the model of choice for medium range forecasts.

It might be objected that WXSIM had an advantage with the somewhat later initialization times, averaging nearly 2 hours after the CCF's are decided upon and perhaps 9 hours after the last data goes into NGM and AVN. Indeed, some forecasts made as late as 10 AM or 10 PM are included, matched against older NGM, AVN, and NWS forecasts. Of course, you might note that WXSIM's 24 hour forecasts beat any of the competing 12 hour forecasts, but the objection would be valid. Also, the fact that some of the other sources forecasts were missing may also bring the comparison into question.

To answer such objection, consider a smaller set of forecasts, all made before 9 AM for morning forecasts and before 6

PM for afternoon ones. Here the average initialization times are 5:40 AM and 3:58 PM, affording a fair comparison between WXSIM and NWS (though still several hours after NGM and AVN are produced). This set contains 60 forecasts (31 PM and 29 AM) and NWS and NGM are fully represented, though 3 AVN forecasts are still missing from each the AM and PM sets. In parallel with the table above, we have:

Table 2 Mean Absolute, Root-Mean-Square, and Net Errors:

------------------------------ Correction Applied --------------------------------- --No Correction ---

12 hr 24 hr 36 hr 48 hr AVG RMS NET AVG RMS NET

NGM 3.11 3.40 3.98 4.64 3.78 4.70 1.47 4.31 5.39 3.02

AVN 3.46 3.30 3.80 3.37 3.48 4.21 -1.30 3.32 4.06 0.24

NWS 2.88 2.72 3.66 3.56 3.20 4.09 -0.08 3.37 4.41 1.63

WXSIM 2.32 2.57 3.36 3.71 2.99 3.80 -0.20 3.09 4.01 1.33

This more equitable analysis narrows the gap only slightly, by about 0.11 degrees on the average for the corrected MAE averages, after making the forecasts an average of 1 hour and 19 minutes earlier. Again, though WXSIM's 12 and 24 hour forecasts are better than any of the other sources' 12 hour forecasts, suggesting that WXSIM's advantage is not due solely to the more timely data. Of course, the ability to use more timely data is one of WXSIM's advantages anyway.

Another type of bias I was looking for was that of diurnal range. It is possible that a given model might forecast lows that are too low and highs that are too high, for example, while showing little overall net error. Such ranges, however, are very sensitive to proper input of winds, and especially, clouds. A few blown cloud forecasts could quickly skew such results. I do not have data on predicted versus actual cloud cover for this period, but assuming such errors occured in roughly equal amounts in each direction, the following numbers may be useful (details of the method of calculation available on request, but I will say here that they employ the set of 60):

Table 3: Net Error in Diurnal Range (degrees F):

NGM 0.32 AVN -0.10 NWS 0.76 WXSIM -0.30

It appears all models did quite well in this regard, so that calibration errors with regard to the diurnal variablility were not a significant contribution to the overall errors of any of these sources.

Yet another question is that of whether the models work better for PM forecasts (12 hour min, etc.) or AM forecasts (12 hour max, etc.). For brevity here, I will show only the results from the set of 60 forecasts run 'early' and for which all data (except some AVN) were available, as describd earlier. Shown are the average MAE for each source, followed by the 12 hour MAE in parentheses.

Table 4:

29 AM Forecasts: 12-48 hour avgerage MAE, (MAE of 12 hour max):

NGM 3.92 (3.69) AVN 3.76 (3.89) NWS 3.33 (2.91) WXSIM 3.27 (2.58)

31 PM Forecasts: 12-48 hour avgerage MAE, (MAE of 12 hour min):

NGM 3.65 (2.54) AVN 3.20 (3.03) NWS 3.07 (2.85) WXSIM 2.71 (2.05)

Interestingly, all sources verify better for the PM forecasts. I would have expected this behavior for WXSIM, since it relies to a signicant extent on surface data in characterizing the air mass, and the daytime atmosphere is generally better mixed, both vertically and horizontally in the afternoon than in the early morning. As for the other models, the reasons may be different, but there still seems to be such an effect, especially for NGM's 12 hour lows. It also appears that we may have identified a major strength of WXSIM, namely it's very good performance on 12 hour overnight lows. In fact, inclusion of the other eight afternoon/evening forecasts (sending the average initialization time from 3:58 PM to

5:08 PM) actually brings this error down from 2.05 to 1.93 degrees.

One more item to check with this older data set is how the various sources performance depends on temperature departures from normal. Generally, the period in question was a bit colder than normal, and a couple of significant outbreaks of arctic air occurred. Listed below are the mean absolute and net errors for each model for two sets of data:

30 (15 AM and 15 PM) 'warm' forecasts with 12-48 hour average 'corrected' verification temperatures 37.5 F or higher, and 30 (15 AM and 15 PM) 'cold' forecasts with verification temperatures 37.5 F or lower.

Table 5: 30 Forecasts, average corrected temperature = 46.7 F (about 4-5 degrees above normal):

Net Error 12-48 hr MAE 12 hr MAE

NGM 0.08 3.36 2.75

AVN -1.20

NWS -1.10

3.74

2.96

2.81

2.34

WXSIM -1.10 3.08 2.49

Table 6: 30 Forecasts, average corrected temperature = 31.5 F (about 10-11 degrees below normal):

Net Error 12-48 hr MAE 12 hr MAE

NGM 2.70 4.18 3.38

AVN -1.50

NWS 1.07

3.24

3.48

4.15

3.40

WXSIM 0.56 2.93 2.14

All forecast sources did fairly well in the normal to above normal conditions of the first set, with the NWS forecasts being the best. Interestingly, while the other three models all performed less well in the below normal conditions,

WXSIM's output was better than it was in the warmer weather and better than the other models as well. The AVN

MOS had the rather odd feature of doing substantially better at 24-48 hours than at 12 in this colder weather. I have no hypothesis ready to explain this.

Finally, I list the three worst and three best forecasts made with WXSIM, along with the corresponding forecasts from the other sources and the actual and corrected verification data. The criterion here was the total mean absolute errors of the 12-48 hour forecasts, with respect to the corrected verification data.

Table 7: Three worst, with worst overall listed last:

Dates/Times: 1/22/96 7:56 AM, 1/6/96 7:56 AM, and 1/6/96 3:56 PM

NGM AVN NWS WXSIM ACTUAL ACTUAL(corr.)

53 40 61 50 54 38 62 48 55 38 62 50 55 40 61 51 49 29 64 52 50 31 66 54

54 33 33 16 46 27 27 17 52 35 35 16 45 30 30 11 45 21 22 18 46 23 24 20

32 32 24 36 24 24 15 35 31 32 16 39 32 32 16 27 21 22 18 30 22 24 20 32

The second and third of these refer to the same arctic air outbreak and the main source of error was a failure in timing

(except for the fairly good job done by AVN) during a rapidly falling temperature scenario, with the min and max temperature both occurring between 7 AM and 8 AM EST. The cold air arrived a few hours sooner than expected; the

36 and 48 hour forecasts were better, with the cold air already in place, though WXSIM initially expected an eventual low of 11, which was 7-9 degrees off the mark (depending on whether or not you choose to accept the 'corrected' verification data).

Table 8: Three best, with best overall listed last:

Dates/Times: 12/22/95 4:56 PM, 12/9/96 10:56 AM, and 2/2/96 3:56 PM

NGM AVN NWS WXSIM ACTUAL ACTUAL(corr.)

24 40 23 42 23 38 21 40 23 39 22 41 25 38 25 39 25 37 24 35 26 39 26 37

48 18 41 25 38 13 33 18 40 18 38 25 40 15 33 21 41 13 31 19 42 15 33 21

24 28 18 32 19 21 8 23 27 27 15 27 19 20 9 19 18 19 7 18 19 21 9 20

The first of these was a fairly 'easy' forecast in a well-established, moderately cold airstream. All models did well, with

WXSIM the best. In the second, NGM shows a strong warm-bias bust, with AVN close, the NWS compromising between the two, and WXSIM correctly siding with AVN. The last case is very similar, but more extreme. The NWS apparently compromises between what turns out to be a bad warm-bias bust on the part of NGM and a rather good handling of the situation by AVN. WXSIM correctly sides with AVN again, but outdoes even it, using either the actual or the corrected verification numbers. I would hypothesize here that WXSIM's focus on surface conditions in what may have been rather shallow cold air, along with its more direct, relatively non-statistical approach (as opposed to using what was probably rather rare historical data), allowed it to perform well in this fairly extreme situation.

Data Analysis (Second Set)

This data set, spanning December 17, 1997 through February 26, 1998, differs in a few ways from the first set. Perhaps the most important change is that the newer, Windows version of WXSIM was used. The program's core algorithms are essentially unchanged from the DOS version, but the user interface is much different and FOUS data can be used to affect upper level and boundary layer temperatures as well as surface pressure, whereas before these models' 'opinions' had to be manually entered during program execution. In addition, some very small changes were made to ATL's entry in the site data file (CTY.FDT).

Another difference is the type of winter we've been having. The much-talked about 'El Nino' event has delivered frequent unsettled weather with no extremes of cold and few of warmth. Frequent and rapidly varying cloud cover has presented somewhat of a forecasting difficulty, but on the other hand, the moderate temperatures can perhaps be better forecast than could be extreme ones.

Finally, I have documented this forecast data rather thoroughly and also attempted to keep the initialization time comparable to the release times of the NWS coded city forecasts. As, in the first data set, some other initialization times were used on occasion to test the program's usefulness in an update mode, but most of the forecasts were initialized about the CCF release time or earlier.

In the interest of parallelism with the earlier data set, I present this data in the same order as before. One, difference, however, is that the apparent calibration problems which required consideration in the first data set seem to have been largely solved, so that in this second set no correction has been applied.

A total of 100 forecasts were made during the period 12/17/97 through 2/26/98. Of these 50 were AM forecasts, initialized with surface data as of an average of 4:09 AM EST (achieved by making some forecasts at midnight or 1 AM, and other typically at 6 or 7 AM), for the upcoming 12 hour max, 24 hour min, 36 hour max, and 48 hour min. The remaining 50 were PM forecasts, initialized with surface data averaging 4:54 PM EST, for the upcoming 12 hour min,

24 hour max, 36 hour min, and 48 hour max. This time the record is complete, with no missing forecasts from any source. Note that the average initialization time for this data set is 1 hour and 36 minutes earlier than that of the first set, perhaps handicapping it slightly with respect to the earlier data.

To compare with Table 1, above, we have, for all of the second data set:

Table 9: Mean Absolute, Root-Mean-Square, and Net Errors:

12 hr 24 hr 36 hr 48 hr AVG RMS NET

NGM 3.53 3.36 3.49 4.20 3.65 4.63 1.82

AVN 3.45 3.21 3.93 3.69 3.57 4.67 0.05

NWS 3.08 3.05 3.19 3.70 3.26 4.15 0.98

WXSIM 2.40 3.22 3.42 3.57 3.15 4.00 0.25

The overall 'order of finish' here is essentially the same as before. Curiously, NGM, AVN, and NWS all did better at 24 hours than at 12; this was also true of AVN and NWS in the first set. Also, the slight warm bias of all the sources might be considered suggestive of continued slightly low readings at ATL. If this is in fact the case, then 'correcting' this might slightly reduce the gap between NWS and WXSIM.

Next, as done above, we consider a subset consisting of only 'early' forecasts. These are defined here as those initialized by 8 AM or 6 PM, excluding any that were not the first AM or first PM of the day (i.e., 7 AM was excluded if a forecast was made 7 hours earlier at 12 AM). The result of this culling is a set of 87 forecasts - 44 AM averaging

3:34 AM and 43 PM averaging 4:11 PM, or an average of 56 minutes earlier than the corresponding first set data.

Table 10: Mean Absolute, Root-Mean-Square, and Net Errors:

12 hr 24 hr 36 hr 48 hr AVG RMS NET

NGM 3.48 3.48 3.61 4.36 3.73 4.75 1.76

AVN 3.56 3.33 3.97 3.79 3.66 4.82 0.09

NWS 3.20 3.08 3.31 3.90 3.37 4.27 0.95

WXSIM 2.49 3.32 3.48 3.59 3.22 4.11 0.27

These results are almost identical to those obtained for the entire set. If anything, WXSIM may have done better relative to the others with these early runs. One might suspect that this would imply that running WXSIM as a later update would have failed to add value to the forecast, but a valid check on this involves direct comparison of forecasts with later updated version of the same forecasts; this will be done below.

As before, we take a look at the diurnal ranges (using the set of 87 forecasts):

Table 11: Net Error in Diurnal Range (degrees F):

NGM -0.69 AVN -0.91 NWS -0.49 WXSIM -1.41

As before, it appears all the sources handle diurnal range fairly well, though all (especially WXSIM) show a slightly too-small range. I believe that it is, at least in part, a result of the rapidly changing cloud cover this winter. My very distinct impression was that I overestimated cloud cover more often than I underestimated it, and was generally trying to follow MOS guidance and zone forecasts. The fact that WXSIM's range was a bit smaller than the others could imply a slight calibration error, or perhaps a slight miscalibration or misinterpretation of cloud opacities.

In order to compare AM versus PM forecasts, we can look at the data from the separate, culled sets:

Table 12:

44 AM Forecasts: 12-48 hour avgerage MAE, (MAE of 12 hour max):

NGM 3.61 (3.43) AVN 3.66 (4.16) NWS 3.25 (3.18) WXSIM 3.29 (2.93)

43 PM Forecasts: 12-48 hour average MAE, (MAE of 12 hour min):

NGM 3.86 (3.53) AVN 3.66 (2.95) NWS 3.49 (3.21) WXSIM 3.15 (2.05)

These numbers are quite similar to those in Table 4, showing generally better results at 12 hours for PM forecasts (of overnight lows) than at 12 hours for AM forecasts (of daytime highs). This tendency is seen quite strongly with both

AVN and WXSIM. Once again, WXSIM's PM forecasts of overnight lows are considerably better than those of any of the other sources.

Continuing the parallel analysis with the older data set, we investigate how the various sources performance depends on temperature departures from normal. The recent data cover a period significantly warmer than the first set, with no real arctic air outbreaks. Listed below are the mean absolute and net errors for each model for two sets of data: 44 (22 AM and 22 PM) 'warm' forecasts with 12-48 hour average verification temperatures 45.50 F and higher, and 43 (22 AM and

21 PM) 'cold' forecasts with verification temperatures 45.25 F and lower.

Table 13: 44 Forecasts, average temperature = 51.3 F (about 8-9 degrees above normal):

Net Error 12-48 hr MAE 12 hr MAE

NGM 1.05

AVN -0.56

3.84

4.32

3.50

3.68

NWS 0.23 3.57 3.36

WXSIM -0.26 3.54 2.50

Table 14: 43 Forecasts, average temperature = 40.7 F (about 2 degrees below normal):

Net Error 12-48 hr MAE 12 hr MAE

NGM 2.49 3.63 3.47

AVN 0.76

NWS 1.67

2.99

3.16

3.44

3.02

WXSIM 0.80 2.90 2.49

In contrast to the situation in the older data (see Tables 5 and 6), all forecast sources did better in the cooler set.

Perhaps this shouldn't be surprising, though, as in this recent data, the cooler set wasthe one closer to normal. WXSIM and (to a greater extent) AVN, seem to 'prefer' the colder conditions. As in the older data, AVN displays the curious feature of doing worse (in cold weather) at 12 hours than at later periods. Note that WXSIM's 12 hour forecasts remain rather dependable in both warm and cold weather.

Finally, I set up the spreadsheet for the recent data in such a way as to allow one more type of analysis. In considering how one might use WXSIM's output in operational forecasting, it occurs to me that a conservative approach would be to simply "fudge" one's previously made forecast by a small amount (say one degree) in the direction of WXSIM's numbers, should they be different. If doing so imroves one's forecast more often than it degrades it, WXSIM should be deemed useful. To test WXSIM in this way, I set up the spreadsheet to add a degree to the NWS forecast numbers if

WXSIM's forecast temperature were higher, subtract a degree if WXSIM's numbers were lower, and, of course, leave the NWS forecast alone in cases where WXSIM and it agree. The number of times this procedure helped, hurt, and made no difference were tallied, for 12, 24, 36, and 48 hours, and overall.

Table 15: Effect of "fudging" NWS towards WXSIM by 1 degree F (set of 87 forecasts):

Set Helped Hurt No Difference

(Period) 12 24 36 48 All 12 24 36 48 All 12 24 36 48 All

44 AM 24 25 20 22 91 13 14 19 10 56 7 5 5 12 29

30 18 21 23 92 8 19 15 15 57 5 6 7 5 23

All 87 54 43 41 45 183 21 33 34 25 113 12 11 12 17 52

43 PM

Overall, in 183 (62%) of the 296 cases in which WXSIM and NWS disagreed, fudging NWS 1 degree towards WXSIM would have helped, while in only 38% of cases would it have hurt. In all cases, except the 24 hour PM forecasts for the next day's high, this procedure would have helped. the effect is most pronounced with 12 hour forecasts. Overall, 54

(72%) of the 75 NWS forecasts in which WXSIM disagreed would have been helped. The figure rises to 79% (30 out of 38) if only PM forecasts of overnight lows are considered.

Careful consideration will show that a forecast source does not actually have to be better than NWS for such fudging to be helpful. If the two sources (i.e. NWS and WXSIM) are sufficiently independent of each other, they may tend to complement each other and the fudging would be helpful in either direction. The question might then arise: which one helps the other the most? To answer this, I reversed the process, fudging WXSIM towards NWS by a degree and checking whether this helped or hurt. Note that this is NOT the same as simply reversing the 'help' and 'hurt' columns in Table 15, because often the two help each other. Here are the results:

Table 16: Effect of "fudging" WXSIM towards NWS by 1 degree F (set of 87 forecasts):

Set Helped Hurt No Difference

(Period) 12 24 36 48 All 12 24 36 48 All 12 24 36 48 All

44 AM 22 21 27 15 85 15 18 12 17 62 7 5 5 12 29

15 26 23 20 84 23 11 13 18 65 5 6 7 5 23

43 PM

All 87 37 47 50 35 169 38 29 25 35 127 12 11 12 17 52

Overall, NWS does indeed help WXSIM, but generally by a smaller margin than vice versa. In 57% of the 296 cases in which the sources disagreed, fudging towards NWS helped; in 43% of the cases it hurt. It was quite helpful in some cases (i.e. 26 helped vs. 11 hurt for 24 hour highs), but hurt in a couple of others (i..e. 15 helped vs. 23 hurt for 12 hour lows).

I will also briefly note here that in my spreadsheet analysis I included columns with weighted averages of WXSIM and

NWS. These weightings were 2:1 WXSIM:NWS and 2:1 NWS:WXSIM, each rounded off to the nearest whole number. For the group of 87 forecasts, recall from Table 10 that NWS had an overall MAE of 3.37 as compare with

WXSIM's 3.22. To give an idea of the degree of improvement, the MAE for the 2:1 NWS:WXSIM average is 3.13, while the 2:1 WXSIM:NWS average yiels a MAE of 3.05 degrees.

It appears that overall WXSIM's slightly smaller overall error along with a degree of indepedence from NWS forecasts

(and the sources that contribute to them) allow consideration of its output to be of significant value in improving NWS forecasts.

Data Analysis (Third Set)

This set of data is much like the second, though - due to convenience - afternoon forecasts dominate (30 PM versus 20

AM forecasts). With an average morning forecast time of 5:30 AM EST and an average afetrnoon forecast time of 4:02

PM EST, here we present all 50 forecasts.

Table 17: Mean Absolute, Root-Mean-Square, and Net Errors:

12 hr 24 hr 36 hr 48 hr AVG RMS NET

NGM 3.18 3.26 3.48 3.94 3.47 4.72 1.26

AVN 3.60 3.18 3.90 4.36 3.76 4.90 0.51

NWS 2.94 2.94 3.12 3.80 3.20 4.24 0.53

WXSIM 2.46 3.42 3.52 4.48 3.47 4.30 -0.67

In order to compare AM versus PM forecasts, we can look at the data from the separate, culled sets:

Table 18:

20 AM Forecasts: 12-48 hour avgerage MAE, (MAE of 12 hour max):

NGM 3.41 (3.25) AVN 3.74 (4.15) NWS 3.21 (3.10) WXSIM 3.58 (2.70)

30 PM Forecasts: 12-48 hour average MAE, (MAE of 12 hour min):

NGM 3.50 (3.13) AVN 3.78 (3.23) NWS 3.19 (2.83) WXSIM 3.40 (2.30)

Once more, we see generally better results at 12 hours for PM forecasts (of overnight lows) than at 12 hours for AM forecasts (of daytime highs). Also, WXSIM's PM forecasts of overnight lows continue to be better than those of any of the other sources.

Analysis of all three sets together

Perhaps the most meaningful results consist of averages of the three sets of data. To keep WXSIM's initialization times in line with the times at which the NWS CCF forecasts are produced, it is appropriate to use the sets of 60 and 87 forecasts, as opposed to the sets of 80 and 100 which include forecasts made at odd times and (in the earlier set of 80) are missing a possibly significant number of non-WXSIM forecasts. The most recent set - containing 50 forecasts - had rather early start times anyway, and was not further reduced by the small number of late morning or early evening forecasts. This combined set of 197 forecasts is also appropriate because it covers three different winters. A weighted average (50:87:60 recent:medium:old) seems appropriate also, not only because there is more recent data, but also

because, with slight changes in the model and it's method of use, the recent data is more representative of the state of

WXSIM today. Note that I have chosen to use the 'corrected' verification temperatures for the older data because I honestly feel that ATL's new ASOS temperatures were too low during that period.

Table 19: Mean Absolute, Root-Mean-Square, and Net Errors (Combined Data):

12 hr 24 hr 36 hr 48 hr AVG RMS NET

NGM 3.29 3.40 3.69 4.34 3.68 4.73 +1.54

AVN 3.54 3.28 3.90 3.81 3.63 4.65 -0.23

NWS 3.04 2.93 3.37 3.77 3.28 4.21 +0.53

WXSIM 2.43 3.12 3.45 3.85 3.21 4.06 -0.11

This strongly suggests that WXSIM is the best source at 12 hours, and is about as good as or better than the other models out through 48 hours, though the NWS was the best at 24-48 hours. Also worth noting is that AVN seems to be nearly as good at 48 hours as it is at 12, and that NGM's accuracy falls off sharply at 48 hours. NGM is noticeably warm-biased relative to AVN, and WXSIM had the smallest overall net error.

As for a breakdown between AM and PM forecasts (weighted 20:44:29 recent:medium:old, with a mean initialization time of 4:38 AM and 30:43:31 recent:medium:old, with a mean initialiation time of 4:05 PM, respectively):

Table 20:

93 AM Forecasts: 12-48 hour average MAE, (MAE of 12 hour max):

NGM 3.66 (3.47) AVN 3.71 (4.07) NWS 3.27 (3.08) WXSIM 3.35 (2.77)

104 PM Forecasts: 12-48 hour avgerage MAE, (MAE of 12 hour min):

NGM 3.69 (3.12) AVN 3.56 (3.05) NWS 3.28 (2.99) WXSIM 3.09 (2.12)

Note again that WXSIM is more sensitive to initialization time than are the others, with its most significant advantage occurring with 12 hour overnight low forecasts.

One last bit of analysis is in order. There were a number of forecasts in the first two data sets made at odd times, especially around 10 AM and 10 PM, potentially appropriate for updates. The question is whether WXSIM has any special value for updating forecasts at such times.

To investigate this, I matched 15 early AM forecasts (including 3 'midnight' ones from the previous day, actually based on 11:53 PM surface data) with later AM forecasts for the same days. Nine of these pairs came from the old data and 6 from the new. Average initialization times were 3:59 PM and 9:47 AM for the early and late forecasts, respectively. I also matched 18 early PM forecasts (8 from old data, 10 from new) with later PM forecasts. The average initialization times here were 4:01 PM and 10:11 PM.

While keeping in mind the rather small sample size here, the results are (overall MAE, then 12 hour MAE, then overall net error):

AM Forecasts

Original: 2.88, 2.40, 0.22

Update: 2.57, 2.00, -0.20

PM Forecasts

Original: 2.99, 2.11, 0.63

Update: 2.92, 1.67, 0.03

Unless these are statistical flukes, the improvement is small but real. There is a tendency towards updates being a bit colder than original runs. I am currently looking into this, and suspect due to a (correctable) miscalibration of the

'midpoints' times in the program (when the temperature should be near it's daily mean). More work in this area might improve the update ability still further.

It might seem odd that successful updates failed to result in better performance relative to NWS. One reason for this are that a few late forecasts had no earlier matches, and hence affected the overall results without entering into the update data just analyzed.

Conclusions

I believe that the data above strongly suggest that WXSIM can be a valuable tool in operational temperature forecasting, especially (but not exclusively) for '12 hour' forecasts. In addition, the ability to run multiple scenarios (not studied here) and run last-minute updates necessitated by sudden changes in wind, cloud cover, or precipitation make it even more useful.

One might object that the program's partially subjective nature (due to the capability for user interaction during the run) render the above data less valid. Against this I would object that, while I am the program's author and perhaps most skilled user at this point, a competent meteorologist (I'm only an amatuer) with access to a wider range of real time data could get it to work even better.

In the meantime, I am commited to making the program even better. I have had some input from a few users who have done much smaller-scale and somewhat different types of studies at other sites, with very encouraging results, but I don't have sufficient details of the data or methodologies employed to present them here. I would welcome results from any registered users who wishes to take the time to make a careful study of how the program works at other customized sites.

Tom Ehrensperger

July 27, 1999

----------------------------------------------------------------------------------------------------------------------------- --------------------

-------------------------------

Appendix (August 1, 2004):

I collected 126 additional forecasts (48 AM and 78 PM), made during the period March 24 to January 26, 2003. These forecasts, as before, are for Atlanta Hartsfield Airport (ATL). This time, the temperature forecasts extend out to 72 hours (3 lows and 3 highs). Also recorded were NWS Coded City Forecasts, NGM MOS, AVN MOS (both "old" and

"new" - or GFS, versions, until the old version was apparently discontinued after April 30), The Weather Channel, and my own forecasts (made after inspecting all the other products).

A fairly standard procedure was used, in which METAR data, MAPS soundings, NGM and ETA FOUS, and 191 km

AVN (from NOAA's READY site) were imported and used in the forecast run. Manual changes were sometimes made to haze and wind direction. Occasionally, other features such as Recent Precip and Recent Temperatures were used. A few times, MOS forecast data for other cities was used for advection data after wind shifts. The first 25 forecasts were made with Versions 8.3.x, during which no changes were made which would affect forecast output. The rest were made with Versions 8.4.x and 8.5, which incorporated improvements to the program, inspired partly by early results of the study. In principle, accuracy would have been greater had the first 25 forecast been made with the later versions of the program, since the changes should generally produce improvement. In practice, with this sample size, I don't think the difference is statistically significant, so I have included all 126.

The AM forecasts were made based on data from about 3 AM to 8 AM (average 6:27) local (Eastern) standard time and run within about 45 minutes or so, using whatever data was available at that point. Likewise, the forecasts recorded from other sources were the latest available. This always meant NGM MOS from 00Z. AVN MOS was either from

00Z or 06Z. The NWS and Weather Channel forecasts were those just released, generally around 4 or 5 AM.

Similarly, the PM forecasts were based on data from generally between 3 and 5 PM (average 3:49), and were compared to 12Z NGM MOS, 12Z or 18Z AVN MOS, and NWS and TWC forecasts released about 4 to 5 PM. My own personal forecasts were recorded just after the WXSIM runs, giving me a chance to take all these sources into consideration.

I have left "old" AVN MOS out of this analysis, because it was available for only about the first half of the data. At the time it was discontinued, it was the least accurate source overall and, significantly, was less accurate than the "new" version.

Here are the overall results:

Mean Absolute and Net Errors:

12 hr 24 hr 36 hr 48 hr 60 hr 72 hr 12-48 12-48 hr avg hr net

NGM 2.88 2.94 2.90 3.29 3.00 1.21

GFS 2.27 2.43 2.67 3.15 3.30 2.63 -0.27

NWS 2.17 2.33 2.48 2.91 3.47 2.47 0.18

TWC 1.99 2.04 2.33 3.11 4.00 3.59 2.37 -0.14

WXS 1.83 2.49 2.95 3.42 4.03 3.92 2.67 0.10

ME 1.55 2.05 2.29 2.98 3.31 3.31 2.22 0.06

WXSIM is clearly the best source (except for myself) at 12 hours. The advantage over the NWS and TWC humans is pretty much lost at 24 hours, but WXSIM does retain slight superiority over the NGM and AVN MOS products out through 36 hours. Beyond that, WXSIM, while still respectable, is the least accurate source.

Download