Figure 1: Previous day’s temperature (persistence) used as a forecast of 24hr temperature, Salt Lake City airport data, 1979 to 2001. Red line is the fit for the central tendency (mean) using standard linear linear regression; middle black line is the fit of the median (0.5 quantile); upper black 0.9 quantile; lower black 0.1 quantile. Notice the similarity but noticeable divergence of the median and mean fits for larger temperatures. Notice also the heteroscedastic behavior of the persistence fitting, which is seen by the convergence of the 0.1 and 0.9 quantile lines for higher temperatures. Figure 2: Time-series of the daily uncalibrated 15-member ensemble temperature forecasts (colors) versus the observation (black) at station KSLC over the period of 19902001 for: a) 24hr lead-time January forecasts; b) 24hr July; c) 360hr January; d) 360hr July. Note the strong underbias of the forecasts for both seasons and lead-times. (Red oval in panel b discussed in text and Figure 7.) Figure 3: Rank histograms for the same data shown in Figure 2, but for the complete data set (period of 1979-2001) although sub-sampled to remove temporal autocorrelations (see text). Red dotted lines show 95% confidence limits for a perfectly calibrated forecast. Note the strong underbias of the forecasts for both seasons and lead-times. Figure 4: schematic of the logistic regression ensemble fitting procedure: step 1 – prescribe climatological temperature thresholds to estimate for (99 chosen); step 2 – fit LR model and generate out-of-sample conditional probabilities (CDF) of being less than or equal to each threshold; step 3 – use CDF to estimate (linearly-interpolate) evenlyspaced 15 member ensemble for each day, each lead-time; final result is a “sharper” posterior forecast PDF than the climatological prior, but used as an independent regressor set in the QR procedure. Figure 5: Schematic of the QR post-processing procedure. See text for details. Figure 6: Same as Figure 2, but for the spread-interval-post-processed time-series. See text for details. (Red oval in panel b discussed in text and Figure 7.) Figure 7: Rank histograms of July 24-hr lead-time 15-member (16 interval) postprocessed ensemble using logistic regression (LR) and 2mo training periods (panel a), dispersion-selected quantile regression (QR) and 2mo training periods (panel b), LR and 22mo training periods (panel c), and QR and 22mo training periods (panel d). Red dotted lines show 95% confidence limits for a perfectly calibrated forecast (upper line in panel b and c not shown). Figure 8: A kernel fitting creates a PDF out of the original uncalibrated (black line) and calibrated (blue) 24-hr ensemble forecast for one day (July 3, 1995), as highlighted by the red oval in Figure 6 panel b. Comparison to the observation (red) shows the bias shift and increase in dispersion that calibration performs. Also shown is the tail of the climatological PDF (dashed), showing the forecasts for this anomalously cold event are significantly sharper than climatology. Is the ungaussian behavior of the calibrated forecast PDF consistent across other forecasts? Figs: -- add obs noise to raw forecasts rank hist only (say, .5deg), commenting on trying to inflate dispersion (don’t do for postprocess since calib accounted for need for addition obs spread) -- convert rank histograms to skill scores w/ error bars, including error bars for perfect and no-skill forecast, then define “potential signal to noise” as perf-noskill / 95% confid of perfect forecast, point out this should be >> 1 to be useful, then 1) generate SS w/ error bars; 2) compare PDF’s of perfect forecasts w/ calib using ROC and 3) KS tests -- 2D skill plots with error bars – jan only, 2-yr & 22-yr on same plot: RMSE, brier (jan, lower 10%), RPSS, ROC -- 3D skill (training window vs fcst lead vs SS => gray scale (white being 0%, black 100%)) score plots using persistence/climatology/raw (whichever is stricter) as reference (state when each is used): raw, LR, QR, SS-QR for winter/summer for SS: RMSE, brier 10% (jan), 90% (july), RPSS, ROC, myscore (drop ones that are similar, and just describe); -- redo, but using raw ensemble as ref -- utility of usage for regressors (grouping all quantiles); bar plots by season and lead-time (1-day, 5-day, 10-day, 15-day) max window – then do same but using SS gray-scale plot format