Journal of Hydrology 407 (2011) 58–72 Contents lists available at ScienceDirect Journal of Hydrology journal homepage: www.elsevier.com/locate/jhydrol Systematic evaluation of autoregressive error models as post-processors for a probabilistic streamflow forecast system Martin Morawietz ⇑, Chong-Yu Xu, Lars Gottschalk, Lena M. Tallaksen Department of Geosciences, University of Oslo, P.O. Box 1047 Blindern, 0316 Oslo, Norway a r t i c l e i n f o Article history: Received 20 December 2010 Received in revised form 2 May 2011 Accepted 5 July 2011 Available online 12 July 2011 This manuscript was handled by Andras Bardossy, Editor-in-Chief, with the assistance of Luis E. Samaniego, Associate Editor Keywords: Probabilistic forecast Post-processor Hydrologic uncertainty Autoregressive error model Ranked probability score Bootstrap s u m m a r y In this study, different versions of autoregressive error models are evaluated as post-processors for probabilistic streamflow forecasts. The post-processors account for hydrologic uncertainties that are introduced by the precipitation–runoff model. The post-processors are evaluated with the discrete ranked probability score (DRPS), and a non-parametric bootstrap is applied to investigate the significance of differences in model performance. The results show that differences in performance between most model versions are significant. For these cases it is found that (1) error models with state dependent parameters perform better than those with constant parameters, (2) error models with an empirical distribution for the description of the standardized residuals perform better than those with a normal distribution, and (3) procedures that use a logarithmic transformation of the original streamflow values perform better than those that use a square root transformation. Ó 2011 Elsevier B.V. All rights reserved. 1. Introduction In recent years, the topic of probabilistic flow forecasting has gained increased attention in hydrological research and operational applications. The traditional method of flow forecasting has been based on using a deterministic meteorological forecast and transforming it through a deterministic hydrological model to attain a single deterministic flow value. However, it was recognised that such a forecast is often associated with considerable uncertainties, which need to be described as well in order to assist rational decision making. A catalyst in this development has been the development of meteorological ensemble forecasts (e.g. Molteni et al., 1996) that aim to describe the uncertainties of the meteorological forecasts. Many hydrological studies have focused on treating the input uncertainties of precipitation and temperature by using meteorological ensemble forecasts as inputs to hydrological models (see for example the review on ensemble flood forecasting by Cloke and Pappenberger (2009)). However, in order to obtain a proper probabilistic flow forecast, all relevant sources of uncertainty in the hydrological modelling process should be addressed, not only the input uncertainties of forecast precipitation and temperature. ⇑ Corresponding author. Tel.: +47 22854908; fax: +47 22854215. E-mail address: martin.morawietz@geo.uio.no (M. Morawietz). 0022-1694/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jhydrol.2011.07.007 Other uncertainties comprise the model uncertainty (model structure and parameters), uncertainty of the initial states of the hydrological model at the time of the forecast as well as uncertainties of observed precipitation and temperature that drive the hydrological model up to the point where the forecast starts (these latter uncertainties influence the uncertainties of initial states). Ideally, a probabilistic forecast system would treat all possible sources of uncertainty explicitly. However, this seems both theoretically and practically impossible (Cloke and Pappenberger, 2009). The complex interactions between the different sources of uncertainties and the character of the unknown make an explicit treatment impossible. Following the argument of Krzysztofowicz (1999), ‘‘. . . for the purpose of real-time forecasting it is infeasible, and perhaps unnecessary, to explicitly quantify every single source of uncertainty’’. Similarly, with respect to parameter uncertainty, Todini (2004) states ‘‘that in flood forecasting problems one is definitely not interested in a parameter sensitivity analysis, but mainly focused at assessing the uncertainty conditional to the chosen model with its assumed parameter values’’. A compromise between explicit and lumped treatment of the different sources of uncertainty is laid out in the framework for probabilistic forecast described by Krzysztofowicz (1999). He proposes to treat those input variables that have the greatest impact on the uncertainty of the forecast explicitly; that means probability distributions of these variables are used as inputs to M. Morawietz et al. / Journal of Hydrology 407 (2011) 58–72 the deterministic hydrological model. In the case of flow forecasting, these inputs variables are forecasts of precipitation and, in addition, forecasts of temperature in catchments where snow plays an important role for runoff formation. All other uncertainties are then treated together in a lumped form as hydrologic uncertainty with a so called hydrologic uncertainty processor, also called post-processor. When describing the hydrologic uncertainty through a hydrologic uncertainty processor for streamflow, the aim is to find the distribution of future observed streamflow at time t, Qobs(t), conditional on the simulated streamflow at time t, Qsim(t), that is attained when the true values of precipitation and temperature are used to drive the hydrological model. The distribution can be conditioned on additional variables, and, analogue to the use of observed river stage at time t = 0 for a hydrologic uncertainty processor for river stage forecasting (Krzysztofowicz and Kelly, 2000), the observed streamflow at time t = 0 when the forecast starts, Qobs(0), is used in a hydrologic uncertainty processor for streamflow. The conditional distribution that is sought-after is then u(Qobs(t)|Qsim(t), Qobs(0)). The approach investigated in this paper is a direct estimation of the distribution u(Qobs(t)|Qsim(t), Qobs(0)). The distribution is described with the help of a first order autoregressive error model 0 0 0 of the form dt ¼ adt1 þ ret , where dt is the model error of the deterministic hydrological model (observed minus simulated streamflow) at time t, a and r are the parameters of the error model, and et is the residual error described through a probability distribution j. Based on this equation, the distribution u of the observed streamflow at time t with time t 1 = 0 as the time where the forecast starts (time of the last observed data) is given as: uðQ obs ðtÞjQ sim ðtÞ; Q obs ð0Þ; Q sim ð0ÞÞ ðQ obs ðtÞ Q sim ðtÞÞ aðQ obs ð0Þ Q sim ð0ÞÞ ¼j r ð1Þ Note that in this description the distribution is conditioned on an additional variable, Qsim(0). A similar approach of a direct estimation of the distribution u(Qobs(t)|Qsim(t), Qobs(0)) by using an autoregressive model was proposed by Seo et al. (2006). However, in their formulation the transformed observed streamflow itself is the autoregressive variable, not its error. Their autoregressive model has then the simulated streamflow at time t, Qsim(t), as exogenous variable, but the model does not contain the simulated streamflow at time 0, Qsim(0). In general, autoregressive error models have been used in hydrology in many different contexts. For example, Engeland and Gottschalk (2002) and Kuczera (1983) use them in the framework of Bayesian parameter estimation, and Xu (2001) uses them to study the residuals of a water balance model. In the context of flow forecasting, they are used for updating of model outputs (e.g. Toth et al., 1999). However, when used for updating of model outputs only the deterministic part of the error model is applied for correcting a deterministic forecast to another deterministic forecast. This paper analyses the use of an autoregressive error model as hydrologic uncertainty processor, also called post-processor, where not only the deterministic component of the error model is used as for model updating, but the full distribution is applied. When using autoregressive models for the description of streamflow or streamflow errors, three aspects are important. 59 high and low flows, while Engeland and Gottschalk (2002) use a more detailed classification based on states of the variables temperature, precipitation and snow depth. (2) In order to make the residuals homoscedastic, a transformation is often applied to the original observed and simulated streamflow values. Common transformations are the logarithmic transformation (e.g. Engeland and Gottschalk, 2002) or the square root transformation (Xu, 2001), which are (apart from a linear shift) special cases of the Box-Cox transformation (Box and Cox, 1964). (3) The errors et of the autoregressive model are usually assumed to be normally distributed. However, when an autoregressive error model is used as a hydrologic uncertainty processor to generate probabilistic forecasts, a violation of the distributional assumptions will distort the results of such a forecast. A straightforward alternative to solve this problem is the use of an empirical distribution defined through the empirical standardized residuals of the calibration period. Application of this approach for a hydrologic post-processor has to our knowledge so far not been described in the literature. Another important aspect for a proper evaluation of the results is the assessment of the significance of differences found in the evaluation measures. Such an evaluation does not always have to be carried out through a formal analysis if sufficient experience with a certain subject can ensure that a subjective evaluation leads to a reliable assessment. However, as there is so far relatively little experience in hydrological research with forecast evaluation measures such as the discrete ranked probability score (DRPS), an explicit treatment of the uncertainty of these evaluation measures seems more appropriate. The flexible approach of bootstrap (Efron and Tibshirani, 1993) allows such an explicit evaluation of the uncertainty without the necessity of making distributional assumptions of the variable being evaluated. Based on these considerations, the main objective of this study is an evaluation of different versions of autoregressive error models as hydrologic uncertainty processors. The following aspects are investigated in particular: (1) Use of state dependent parameters versus state independent parameters. (2) Use of a logarithmic transformation of the original streamflow values versus a square root transformation. (3) Use of a standard normal distribution for the description of the standardized residuals versus an empirical distribution. In addition, the application of bootstrap to the forecast evaluation measures to evaluate the significance of the results is demonstrated, and a discussion on the discrete ranked probability score for the evaluation of probabilistic streamflow forecasts is included. The study was carried out by evaluating eight different autoregressive error models as hydrologic uncertainty processors. A wellknown precipitation–runoff model, the HBV model, was chosen as deterministic hydrological model to which the uncertainty processors were applied. The uncertainty processors were calibrated for 55 catchments in Norway and evaluated using the discrete ranked probability score in combination with a non-parametric bootstrap. 2. Methods (1) In the simplest application, the parameters of the autoregressive model are assumed to be fixed. However, several authors propose the use of different parameters for different hydrological or meteorological states. Lundberg (1982) and Seo et al. (2006), for example, use different parameters for 2.1. Deterministic hydrological model: HBV model The HBV model (Bergström, 1976, 1992) can be characterised as a semi-distributed conceptual precipitation–runoff model. It 60 M. Morawietz et al. / Journal of Hydrology 407 (2011) 58–72 distinguishes different elevation zones based on the hypsographic curve, and for these elevation zones temperature and precipitation are adjusted according to temperature and precipitation gradients and a temperature threshold to distinguish between rain and snow. Within each elevation zone, different land use zones are distinguished by allowing different parameters for certain model processes. Each zone runs individual routines for the snow and soil moisture routine. Since its original development in the early 1970s, the HBV model has been applied and modified in many different operational and research settings. The model version used in this study is the ‘‘Nordic’’ HBV model (Sælthun, 1996), which is used for operational flow forecasting at the Norwegian Water Resources and Energy Directorate (NVE). The model is run with daily time steps with mean daily temperature and accumulated daily precipitation as model inputs and mean daily streamflow as model output. The model was calibrated by NVE for 117 catchments (Lawrence et al., 2009). The calibration was carried out as an automated model calibration using the parameter estimation software PEST (Doherty, 2004), which is based on an implementation of the Levenberg–Marquardt method (Levenberg, 1944; Marquardt, 1963). From the 117 catchments, 55 catchments together with the calibrated HBV models were selected for this study based on a sufficiently long period of common data. For these catchments, the HBV model calibration period was 1981–2000 and the validation period were the combined periods of 1961–1980 and 2001–2006. The Nash–Sutcliffe efficiency coefficients NE (Nash and Sutcliffe, 1970) for the validation period range from 0.50 to 0.90. Twenty catchments have a good model performance (NE P 0.80), 26 catchments have an intermediate model performance (0.65 6 NE < 0.8), and 9 catchments have a relatively weak model performance (0.5 6 NE < 0.65). 2.2. Versions of post-processors for the HBV model The simulation errors of the deterministic precipitation–runoff model are described through an autoregressive error model: dt ¼ at dt1 þ rt et ð2Þ The simulation error dt is defined as the difference between the transformed observed streamflow, ot, and transformed simulated streamflow, st: dt ¼ ot st ð3Þ Parameters at and rt are the parameters of the error model, and et is the standardized residual error described through a random variable with the probability density function j. Solving the error model for the observed streamflow yields: ot ¼ st þ at ðot1 st1 Þ þ rt et ð4Þ The density of the observed streamflow ot conditional on st, ot–1 and st–1 is then equal to the density of the value et that corresponds to ot through Eq. (4): uðot jst ; ot1 ; st1 Þ ¼ jðet Þ with et ¼ ðot st Þ at ðot1 st1 Þ Table 1 Aspects investigated. (1) Parameters at and rt (2) Transformation type (3) Distribution j of the standardized residuals et State dependent (SD) Log transformation (Log) ot ¼ ln Q obs ðtÞ st ¼ ln Q sim ðtÞ Standard normal distribution (Norm) Square root transformation (Sqrt) pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ot ¼ Q obs ðtÞ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi st ¼ Q sim ðtÞ Empirical distribution (Emp) a þ bs for Q ðtÞ P q t sim thresh at ¼ aiðtÞ iðtÞ þ b st for Q sim ðtÞ < qthresh ln rt ¼ AiðtÞ þ Bst for Q sim ðtÞ P qthresh AiðtÞ þ B st for Q sim ðtÞ < qthresh State independent (SI) at, rt constant Table 2 Model versions investigated in the study; abbreviations are defined in Table 1. Version Label: Parameters.transformation.distribution 1 2 3 4 5 6 7 8 SD.Log.Norm SI.Log.Norm SD.Sqrt.Norm SI.Sqrt.Norm SD.Log.Emp SI.Log.Emp SD.Sqrt.Emp Si.Sqrt.Emp In this study, the post-processors are applied to HBV-model outputs that are produced without additional updating procedures like e.g. Kalman-filtering (Kalman, 1960) or variational data assimilation (Seo et al., 2003). Thus, updating is only done through the autoregressive error model itself. According to the classification of updating procedures by the World Meteorological Organization (1992), this updating is classified as an updating of output variables. In principle, the post-processors may also be applied to model outputs that include other updating procedures like updating of input variables, state variables or model parameters. In this case, the post-processors would need to be calibrated on model outputs that include these updating procedures. 2.2.1. State dependent parameter formulation As state dependent parameter formulation, a parameter description used at the Norwegian Water Resources and Energy Directorate (NVE) is applied (Langsrud et al., 1998). State dependence of the parameters is realized in three ways: (1) Firstly, the parameters at and rt of the autoregressive error model are formulated to be linearly dependent on the transformed simulated streamflow st: at ¼ aiðtÞ þ bst ln rt ¼ AiðtÞ þ Bst rt ð7Þ ð8Þ ð5Þ i.e. ðot st Þ at ðot1 st1 Þ uðot jst ; ot1 ; st1 Þ ¼ j rt ð6Þ Eq. (6) constitutes a post-processor. The aspects investigated are (1) parameter formulation, (2) transformation type and (3) distribution type (Table 1). Each of the three aspects has two possible realizations, and by combining these, in total eight model versions of post-processors are generated (Table 2). (2) Secondly, the parameters ai(t) and Ai(t) of the linear relations can assume different values, depending on the states defined through the variables observed temperature Tt, observed precipitation Pt and simulated snow water equivalent SWEt at time t. It is distinguished if the temperature is below or above 0 °C, if precipitation occurs or not, and if the snow water equivalent is above or below a certain threshold value swethresh; if the amounts of snow are below swethresh, the catchment is assumed to behave as a snow free catchment. 61 M. Morawietz et al. / Journal of Hydrology 407 (2011) 58–72 Table 3 States distinguished for different conditions of observed temperature Tt, observed precipitation Pt, simulated snow water equivalent SWEt, and simulated streamflow Qsim(t) at time t, and the corresponding parameters of the error model. Tt 6 0 °C Tt > 0 °C and Pt = 0 mm SWEt 6 swethresh Tt > 0 °C and Pt = 0 mm SWEt > swethresh Tt > 0 °C and Pt > 0 mm SWEt 6 swethresh Tt > 0 °C and Pt > 0 mm SWEt > swethresh States of simulated streamflow Qsim(t) P qthresh Qsim(t) < qthresh and 1 2 a1, A1 a2, A2 a1 ; A1 a2 ; A2 and 3 a3, A3 b, B a3 ; A3 and 4 a4, A4 a4 ; A4 and 5 a5, A5 a5 ; A5 3/M 2/M b⁄, B⁄ ε Fig. 1. Schematic diagram of the empirical distribution function F(e) of the standardized empirical residuals e. Summarizing points 1–3, the state dependent parameters at and rt are formulated as at ¼ aiðtÞ þ bst for Q sim ðtÞ P qthresh aiðtÞ þ b st ( ln rt ¼ for Q sim ðtÞ < qthresh AiðtÞ þ Bst for Q sim ðtÞ P qthresh AiðtÞ for Q sim ðtÞ < qthresh þ B st ð9Þ rt ð11Þ of the error model, obtained by solving Eq. (2) for et. The set of standardized empirical residuals ^em ; m 2 f1; 2; . . . ; Mg, is calculated from all days m = 1, . . . , M of the calibration period after the parameters of the error model have been estimated. The empirical distribution function F(e) (Fig. 1) of the standardized residual e is then defined as the step function FðeÞ ¼ M 1 X Ið^em 6 eÞ M m¼1 2.3. Estimation of the parameters of the error model 2.3.1. Models with state independent parameters For models with state independent parameters, Eq. (2) constitutes a simple linear regression model. Parameter a is estimated using ordinary least squares, and parameter r is estimated as the root mean square of the residuals of the regression model. 2.3.2. Models with state dependent parameters For models with state dependent parameters (Eqs. (9) and (10)), Eq. (2) can be rewritten as: ( dt ¼ 2.2.2. Empirical distribution function The empirical distribution function is based on the empirical standardized residuals dt at dt1 where I(A) is an indicator variable that is 1 if A is true and 0 if A is false. For values of e that lie outside the observed range of residuals, the non-exceedance probability is 0 or 1. No further attempt was made to smooth the distribution function or use some other plotting position formula to determine the points of the distribution function. Given the large number of points it is not expected that such refinements would lead to any measurable changes in the results. ð10Þ The threshold for snow water equivalent was chosen as follows. A catchment is assumed to behave as snow free when the simulated snow cover falls below 10%. The threshold swethresh is then determined as the average snow water equivalent that corresponds to a snow cover of 10%. As threshold qthresh to distinguish between high and low flows, the 75th percentile of the observed streamflow of the calibration period is used. ^et ¼ 1/M 0 Through the combination of the three variables, five different states i(t) e {1, 2, . . . , 5} are distinguished (Table 3). The classification into the five states is based on conceptual considerations about the presence or absence of different processes which may result in different error behaviour. For temperatures below zero, precipitation is accumulated in the snow storage and streamflow comes mainly from base flow. For temperatures above zero, different processes take place both in the model and in the real catchment, depending on if precipitation is present or not and if a snow pack is present or not. (3) Thirdly, two different sets of parameters aj, b, Aj, B, j=1, . . . , 5, are used, depending on whether the simulated streamflow at time t is above or below a flow threshold qthresh (Table 3). ( ....... i(t) (M−2)/M F (ε) Meteorological and snow states 1 (M−1)/M ð12Þ aiðtÞ dt1 þ bst dt1 þ expðAiðtÞ þ Bst Þet aiðtÞ dt1 þ b st dt1 þ expðAiðtÞ þ B st Þet for Q sim ðtÞ P qthresh for Q sim ðtÞ < qthresh ð13Þ The first two terms on the right hand side of Eq. (13) comprise the deterministic component of the error model, while the third term constitutes the stochastic component. The parameter estimation for this model type follows an iterative two-step procedure (Langsrud et al., 1998). In the first step, the parameters of the deterministic part of the error model, aj ; b; aj ; b , are estimated in a weighted linear regression. In the second step, the parameters of the stochastic component, Aj ; B; Aj ; B , are estimated in a generalized linear regression with logarithmic link function and Gamma distribution. Steps one and two are then repeated using the results from the second step to update the weights of the linear regression of the first step; repetitions are continued until the parameter estimates converge. A detailed description of the estimation procedure is given in Appendix A. 2.3.3. Practical adjustments for the parameter estimation 2.3.3.1. Exclusion of the smallest streamflow values for 15 catchments. For several catchments, the smallest streamflow values were removed from the data series due to problems with the logarithmic transformation. 62 M. Morawietz et al. / Journal of Hydrology 407 (2011) 58–72 (1) For 13 catchments, the streamflow series of observed or simulated streamflow contained instances (days) with zero values. However, the logarithm is only defined for values larger than zero. Therefore, instances with zero values were removed from the data series of the respective catchments. (2) For streamflow values Q that tend towards zero, Q ? 0, the log transformed values tend towards minus infinity, ln Q ? 1. As a result, infinitesimal differences between Qobs(t) and Qsim(t) in the original scale can become arbitrarily large in the transformed variable space. For the residuals et = rtet of an autoregressive error model, this may then lead to the opposite of what the transformation should achieve; instead of harmonizing the residuals, the variance of the residuals becomes arbitrarily large for Qobs(t) ? 0 or Qsim(t) ? 0. While for the majority of catchments, the logarithmic transformation worked well (see example plot of the squared residuals e2t versus ln Qsim(t) for catchment Losna, Fig. 2a), for 10 catchments an arbitrary increase of the squared residuals for the smallest streamflow values was found (e.g. catchment Reinsnosvatn, Fig. 2b). This led to a non-convergence of the calibration routine for the parameters Ai and B⁄ in these catchments. Based on visual inspection of the plots of the squared residuals e2t versus lnQsim(t), instances with very small values of observed or simulated streamflow were therefore removed from the respective streamflow series. Based on points (1) and (2) above, the streamflow series of 15 catchments were truncated by their smallest flow values. The catchments might also have been completely excluded from the study. However, this would have substantially reduced the number of catchments available. As the focus of this study is not on the smallest values of the flow spectrum but rather on the general range and high flows, it is considered that such a truncation is reasonable and that it is preferable to include these data series in the study. 2.3.3.2. Minimum number of data per class. For each of the 10 classes defined by meteorological and snow states and states of simulated streamflow (Table 3), a sufficient number of data should be available for estimation of the parameters of the respective class. Therefore a minimum requirement of 60 instances per class was introduced. In case the number is less, the respective class is merged with another class for estimation of common parameters. A merging scheme was developed based on the similarity of parameter estimates of the different classes. The merging scheme was developed from 35 catchments that had sufficient data in all 10 classes. In the final parameter estimation for the models with state dependent parameters, the complete number of 24 parameters was estimated in 43% of the estimations. For another 43%, the number of estimated parameters was reduced to 22. Twenty parameters were estimated in 11% of the estimations, and only in 3% of the estimations the number of estimated parameters was less than 20. 2.4. Evaluation of the error model 2.4.1. The discrete ranked probability score (DRPS) The discrete ranked probability score (DRPS; Murphy, 1971; Toth et al., 2003) was chosen as forecast evaluation measure. It is a summary score that allows an evaluation of the forecast with respect to a number of user specified thresholds. There are several definitions of the discrete ranked probability score. The essence of all definitions is the same, i.e. the DRPS evaluates the squared differences (pk ok)2 between the cumulative distribution function of the forecast, p, and the cumulative distribution function of a perfect forecast (observation), o, at some predefined thresholds, xk, k = 1, 2, . . . , K (Fig. 3). Table 4 gives an overview of different definitions found in the literature. The different formulations lead to differences in range and orientation of the score. The first definition defines the DRPS as the sum of the squared differences (Murphy, 1971; Wilks, 1995), leading to a range of [0, K]. The second definition uses the mean of the squared differences (e.g. Toth et al., 2003), resulting in a standardized range [0, 1]. For both definitions the orientation of the score is negative, i.e. the best score has the lowest value of zero. In the third definition, the score is inverted to a positive orientation. In addition, the range is adjusted by adding a constant of one. This third version was used in the original definition of the ranked probability score by Epstein (1969). Since the definitions of the DRPS given above are linear transformations of each other, the information conveyed through the different scores is essentially the same. So far it does not seem that one of the above definitions has become a definitive standard. For this publication, the definition according to the second equation is Fig. 2. Example plots of squared residuals e2t ¼ r2t e2t versus the logarithm of simulated streamflow ln Qsim(t) for selected catchments for models with state dependent parameters (SD) and logarithmic transformation (Log). (a) Catchment Losna; the squared residuals are relatively homoscedastic over the whole range of flow values. (b) Catchment Reinsnosvatn; strong deviation from homoscedastic behaviour for small streamflow values. 63 1.0 Table 4 Definitions of the discrete ranked probability score (DRPS) for a single event. p 0.8 o pk − o k No PK 2 1 K 3 1 K1 k¼1 ðpk ok Þ2 Range Orientation Source [0, K] Negative Murphy (1971) Wilks (1995) [0, 1] Negative Bougeault (2003)b Nurmi (2003) Toth et al. (2003) WWRP/WGNE Joint Working Group on Forecast Verification Research (2010) [0, 1] Positive Epstein (1969)c Stanski et al. (1989) 0.2 0.4 0.6 1 Formulaa PK 2 k¼1 ðpk ok Þ 0.0 Non−exceedance probability M. Morawietz et al. / Journal of Hydrology 407 (2011) 58–72 x1 x 2 x 3 ... x K− 2 x K−1 x K PK k¼1 ðpk ok Þ2 Fig. 3. Definition of the discrete ranked probability score. chosen. On the one hand, a standardized range seems preferable over a range that is dependent on the number of thresholds as in definition 1. On the other hand, definition 2 is more direct and intuitive than definition 3. The latter may be an important aspect when probabilistic forecasting and forecast verification should be communicated to a wider audience. Following Toth et al. (2003), the DRPS for one event is calculated as follows. K thresholds x1 < x2 < < xK for the continuous variable X are selected. These thresholds define the events Ak = {X 6 xk}, k = 1, 2, . . . , K, with the corresponding forecast probabilities p1, p2, . . . , pK. Analogue, for each event Ak a binary indicator variable ok is defined that indicates if the event Ak occurs or not, i.e. ok = 1 if Ak occurs, and ok = 0 otherwise. The DRPS for one event is then calculated as DRPSind ¼ K 1X ðp ok Þ2 K k¼1 k ð14Þ In this study, the continuous variable X is daily streamflow, and a single event is the daily streamflow occurring on a certain day in a certain catchment. To evaluate the performance of a probabilistic forecast model in one catchment over a certain time period, the average of the DRPSind values for the individual days n = 1, 2, . . . , N is calculated as DRPS ¼ N 1X DRPSindn N n¼1 ð15Þ Furthermore, to evaluate and compare the overall performance of different forecast models over a number of catchments, the average of the DRPS values for catchments c = 1, 2, . . . , C is calculated as DRPS ¼ C 1X DRPSc C c¼1 ð16Þ To actually calculate a DRPS for an event, the K thresholds x1, x2, . . . , xK have to be specified. Jaun and Ahrens (2009) in their evaluation study of a probabilistic hydrometeorological forecast system chose four thresholds (the 25th, 50th, 75th and 95th percentile of the historic streamflow record). However, when such a small number of thresholds is chosen, differences between the cumulative distribution functions of the forecast and the corresponding observation might not be captured adequately. Therefore, if the purpose of the study does not implicate the selection of certain specific thresholds, one should aim to select a relatively large number of thresholds that extend over the whole range of values. For this study, 99 thresholds were chosen as the 1st, 2nd, . . . , 99th percentile of the flow record of each catchment. Note that the selection of the thresholds on the basis of flow quantiles also allows for a direct comparison of DRPS values from different catchments and justifies the calculation of averages of DRPS values a The formulas may use symbols or formulations different from the respective sources in the last column, but are mathematically equivalent. b No explicit formula is given, but the definition is described in the text. c Mathematical equivalence of the original definition by Epstein (1969) with the formula given here is demonstrated by Murphy (1971). over several catchments as indicated by Eq. (16). The comparability over catchments is also reflected in the decomposition of the DRPS where the uncertainty component assumes the same value in all catchments when flow quantiles are selected as thresholds (see Section 2.4.3). 2.4.2. Significance of differences of the DRPS: bootstrap The overall evaluation of the eight error models is done by calculating the average discrete ranked probability score over the 55 catchments, DRPS, according to Eq. (16) for each model version. In order to assess if differences between the scores of the different model versions are significant, confidence intervals are estimated using a non-parametric bootstrap (Efron and Tibshirani, 1993). Because of the correlations between the DRPS values of different model versions for individual catchments (see Section 3.1, Figs. 5–7), a construction of confidence intervals for the DRPS values directly would not be helpful because the correlations would not allow a distinction of significant differences based on overlapping or non-overlapping confidence intervals. Instead, confidence intervals for the differences between DRPS values of different model versions are calculated as described in the following. Let xc, c = 1, 2, . . . , 55, be the DRPS values (Eq. (15)) for one version of the error model, say version 1, for the individual catchments 1, 2, . . . , 55. And let yc, c = 1, 2, . . . , 55, be the corresponding DRPS values for another version of the error model, say version 2. Then the mean discrete ranked probability scores over the 55 catchP55 1 ments, DRPS (Eq. (16)), are DRPSð1Þ ¼ 55 c¼1 xc and DRPSð2Þ ¼ P55 1 y for model versions 1 and 2, respectively. Let now c c¼1 55 zc = xc yc, c = 1, 2, . . . , 55, be the differences between the DRPS of the two model versions in the individual catchments. Then the difference of the mean values DRPSð1Þ and DRPSð2Þ is equal to the mean of the differences, z: DRPSð1Þ DRPSð2Þ ¼ 55 1 X zc ¼ z 55 c¼1 ð17Þ Thus, estimating a confidence interval for the difference of the mean values is equivalent to estimating a confidence interval for the mean of the differences. We now regard the differences zc, c = 1, 2, . . . , 55, as a sample of an unknown population Z, which reflects the distribution of differences that might occur in general. A confidence interval for the mean value of samples of size 55 from this unknown population is estimated with non-parametric bootstrap, using the sample of differences zc, c = 1, 2, . . . , 55, as a surrogate of the unknown population Z. The steps are as follows. 64 M. Morawietz et al. / Journal of Hydrology 407 (2011) 58–72 Fig. 4. Overview of the 55 catchments and streamflow stations in Norway. Eight catchments are nested in a larger catchment. (1) Generate a bootstrap sample by sampling 55 times with replacement from the original sample of zc values. The bootstrap sample consists of the 55 values zi , i=1, 2, . . . , 55. (Note: the indices i of the bootstrap sample have no relation to the catchment numbers c; each value zi can be equal to any of the values zc from the original sample.) (2) Calculate the mean value of the bootstrap sample, z , as z ¼ 55 1 X z 55 i¼1 i ð18Þ (3) Repeat steps 1 and 2 up to n-times, n being the number of repetitions. For each new bootstrap sample, a new mean value of the differences, zj , j=1, 2, . . . , n, is calculated according to Eq. (18). From the sample of bootstrap replicates of the mean values zj , j = 1, 2, . . . , n, a confidence interval for the mean of the differences can now be derived. The most straight forward method, a simple percentile mapping (Efron and Tibshirani, 1993), is used. That means the 95% confidence interval is estimated through the 2.5th and 97.5th percentile of the distribution of zj values as lower and upper limit of the confidence interval, respectively. If the value of zero lies outside the confidence interval, the difference between the two model versions is regarded as signifi- cantly different from zero. A number of n = 100,000 bootstrap replications was used for this study. 2.4.3. Decomposition of the DRPS Analogue to the Brier score (Brier, 1950), the DRPS can be decomposed into three components. The Brier score decomposition (Murphy, 1973) into reliability (REL), resolution (RES) and uncertainty (UNC) is based on the calculation of mean observed frequencies conditioned/stratified on different forecast probabilities. To avoid problems with sparseness and make the estimated conditional mean values less uncertain, it is recommendable to divide the interval of forecast probabilities [0, 1] into a finite set of nonoverlapping bins l = 1, . . . , L. When such stratification over bins of probabilities is used, the decomposition of the Brier score yields two extra components, the within-bin variance and within-bin covariance (Stephenson et al., 2008). Including these extra components into the resolution term, which then is labelled as generalized resolution (GRES), the decomposition of the Brier score can be written as (Stephenson et al., 2008) BSk ¼ RELk GRESk þ UNC k ð19Þ with RELk ¼ L 1X l o l Þ2 Nl ðp N l¼1 ð20Þ 65 M. Morawietz et al. / Journal of Hydrology 407 (2011) 58–72 0.01 0.04 0.07 (d) DRPS (SI) 0.01 0.01 DRPS (SD) 0.04 0.07 0.01 DRPS (SI) 0.07 Sqrt.Emp 0.07 Log.Emp 0.07 0.01 0.04 DRPS (SI) 0.07 0.04 0.01 DRPS (SI) (c) Sqrt.Norm 0.04 (b) Log.Norm 0.04 (a) 0.01 DRPS (SD) 0.04 0.07 0.01 DRPS (SD) 0.04 0.07 DRPS (SD) Fig. 5. DRPS values of models with state independent parameters (SI) versus DRPS values of models with state dependent parameters (SD) for the independent validation 1984–2005 (Val.2). Each point shows the relation of the scores of two different models in one catchment. (a) DRPS(SI.Log.Norm) versus DRPS(SD.Log.Norm). (b) DRPS(SI.Sqrt.Norm) versus DRPS(SD.Sqrt.Norm). (c) DRPS(SI.Log.Emp) versus DRPS(SD.Log.Emp). (d) DRPS(SI.Sqrt.Emp) versus DRPS(SD.Sqrt.Emp). 0.04 0.07 0.04 0.07 0.04 0.07 0.04 DRPS (Sqrt) 0.01 DRPS (Log) SI.Emp 0.01 0.07 (d) 0.04 DRPS (Sqrt) 0.01 DRPS (Log) SD.Emp 0.01 0.07 (c) 0.04 DRPS (Sqrt) 0.01 SI.Norm 0.01 0.07 (b) 0.04 DRPS (Sqrt) SD.Norm 0.01 (a) 0.07 0.01 DRPS (Log) 0.04 0.07 DRPS (Log) Fig. 6. DRPS values of models with square root transformation (Sqrt) versus DRPS values of models with log transformation (Log) for the independent validation 1984–2005 (Val.2). Each point shows the relation of the scores of two different models in one catchment. (a) DRPS(SD.Sqrt.Norm) versus DRPS(SD.Log.Norm). (b) DRPS(SI.Sqrt.Norm) versus DRPS(SI.Log.Norm). (c) DRPS(SD.Sqrt.Emp) versus DRPS(SD.Log.Emp). (d) DRPS(SI.Sqrt.Emp) versus DRPS(SI.Log.Emp). 0.04 0.07 DRPS (Emp) 0.04 0.07 DRPS (Emp) 0.04 0.07 DRPS (Emp) 0.07 0.04 DRPS (Norm) 0.01 SI.Sqrt 0.01 0.07 0.04 DRPS (Norm) 0.01 (d) SD.Sqrt 0.01 0.07 0.04 DRPS (Norm) 0.07 0.04 0.01 (c) SI.Log 0.01 (b) SD.Log 0.01 DRPS (Norm) (a) 0.01 0.04 0.07 DRPS (Emp) Fig. 7. DRPS values of models with normal distribution (Norm) versus DRPS values of models with empirical distribution (Emp) for the independent validation 1984–2005 (Val.2). Each point shows the relation of the scores of two different models in one catchment. (a) DRPS(SD.Log.Norm) versus DRPS(SD.Log.Emp). (b) DRPS(SI.Log.Norm) versus DRPS(SI.Log.Emp). (c) DRPS(SD.Sqrt.Norm) versus DRPS(SD.Sqrt.Emp). (d) DRPS(SI.Sqrt.Norm) versus DRPS(SI.Sqrt.Emp). P ð1=KÞ Kk¼1 BSk (Toth et al., 2003). Thus, a decomposition of the DRPS can be given as Nl L L X 1 X 1X l o Þ2 l Þ2 GRESk ¼ N l ðo ðp p N l¼1 N l¼1 j¼1 lj þ Nl L X 2X l Þðplj p l Þ ðolj o N l¼1 j¼1 ð1 o Þ UNC k ¼ o DRPS ¼ DREL DGRES þ DUNC ð21Þ ð22Þ BSk is the Brier score for the forecast of the event Ak as defined in Section 2.4.1 for one catchment averaged over a total number of N forecasts (days); Nl is the number of forecast probabilities that fall into the lth bin; plj, j = 1, . . . , Nl, are the forecast probabilities falling into the lth bin, and olj denote the binary observations (0 or 1) corresponding to the plj; the average of the forecast probabilities of the P l P l l ¼ ð1=N l Þ Nj¼1 l ¼ ð1=N l Þ Nj¼1 lth bin is calculated as p plj , and o olj is the corresponding average of the binary observations; P P l ¼ ð1=NÞ Ll¼1 Nj¼1 o olj is the overall mean of the binary observations, i.e. the climatological base rate. The DRPS from Eq. (15) can be formulated as the mean of the Brier scores over all thresholds k = 1, . . . , K, i.e. DRPS ¼ ð23Þ PK with DRPS reliability DREL ¼ ð1=KÞ k¼1 RELk , DRPS generalized resP olution DGRES ¼ ð1=KÞ Kk¼1 GRESk , and DRPS uncertainty PK DUNC ¼ ð1=KÞ k¼1 UNC k . To calculate the DRPS decomposition components in this study, L = 10 equally spaced bins were chosen. Based on the selection of the thresholds xk as quantiles of the flow distribution, the uncertainty component DUNC assumes the same value in all catchments. With xk as the 1st, . . . , 99th percentile, the theoretical value of the uncertainty component in this study is given as P ð1 o Þ ¼ 16:665=99 0:168 . DUNC ¼ ð1=99Þ o2f0:01;...;0:99g o 2.5. Catchments and data Fifty-five catchments distributed over the whole of Norway were selected (Fig. 4). The selection was based on a common period of data from 1.9.1961 to 31.12.2005. The catchments are spread 66 M. Morawietz et al. / Journal of Hydrology 407 (2011) 58–72 Table 5 Overview of the five validations carried out for each model version in each catchment. Label Validation type Period for validation Period for parameter estimation Cal.1 + 2 Cal.1 Cal.2 Val.1 Val.2 Dependent Dependent Dependent Independent Independent 1+2 1 2 1 2 1+2 1 2 2 1 (1962–2005) (1962–1983) (1984–2005) (1962–1983) (1984–2005) (1962–2005) (1962–1983) (1984–2005) (1984–2005) (1962–1983) applied as post-processors for longer lead times. This can then be done in two ways. Either the error model is applied recursively with one parameter set estimated for the basic time step; or the error model is applied with different parameter sets that are estimated for different intervals between t 1 and t. 3. Results 3.1. Model performances in the individual catchments relatively evenly over Norway with some clustering in the southeast. Catchment sizes vary from 6 to 15,450 km2. The majority of the catchments (45) are smaller than 1000 km2 and of these 21 are between 100 and 300 km2. The mean catchment elevations range from 181 to 1460 m above mean sea level. They are relatively evenly distributed over the range of elevations with some predominance in the interval 400–900 m. The mean annual runoff ranges from 370 to 3236 mm/year. Also the runoff values are relatively evenly distributed over the range of runoff values. The data used in the study are daily data of mean air temperature, accumulated precipitation and mean streamflow for the 55 catchments. Station measurements of mean daily temperature and accumulated daily precipitation have been interpolated to a 1 1 km grid by the Norwegian Meteorological Institute (Mohr and Tveito, 2008). Catchment values of mean daily temperature (Tt) and accumulated daily precipitation (Pt) are then extracted from the grid as mean values of the grid cells that lie within the boundaries of the respective catchments. The streamflow data are series of mean daily streamflow (Qobs(t)) from the Norwegian Water Resources and Energy Directorate. 2.6. Error model calibration and validation procedure The HBV model was run for the complete period in the 55 catchments, generating time series of simulated streamflow Qsim(t) and simulated snow water equivalent SWEt. The first four months of the model run were discarded as spin-up period and the remaining period (1.1.1962–31.12.2005) was kept to investigate the eight versions of the error model. For each version of the error model, three different parameter sets were estimated from three different periods in each of the 55 catchments: Period 1: 1.1.1962–31.12.1983. Period 2: 1.1.1984–31.12.2005. Period 1 + 2: 1.1.1962–31.12.2005. The eight versions were then evaluated in three dependent and two independent validations in each catchment (Table 5). In the dependent validations, the error model was applied to the same data that was used for estimation of the parameters of the error model. In the independent validations, the error model was applied to independent data that had not been used in the estimation of the parameters of the error model. For each day of a validation period, a probabilistic forecast was generated using Eq. (6). Based on the forecasts distributions u(ot|st, ot1, st1) and the actual observations ot, the discrete ranked probability score DRPS was calculated according to Eq. (15). This was done for all 55 catchments, for all 8 model versions (Table 3) and for all 5 validations (Table 5), i.e. altogether 2200 DRPS values were calculated. For this study, a time step of one day was used as interval between t 1 and t, corresponding to the basic time step of the precipitation–runoff model. In general, the error models may also be Figs. 5–7 show plots of the DRPS for the 55 catchments for one model version versus the DRPS for another model version for selected model combinations. All plots are for the independent validation for the period 1984–2005 (Val.2). The plots for the other four validations in Table 5 are similar to the corresponding plots of the independent validation 1984–2005 and are therefore not shown. Fig. 5 shows the four plots of state independent (SI) models versus the corresponding state dependent (SD) models, Fig. 6 shows the four plots of square root transformed (Sqrt) models versus the corresponding log transformed (Log) models, and Fig. 7 shows the four plots of models with normal distribution (Norm) versus the corresponding models with empirical distribution (Emp). The following characteristics can be seen: (a) A strong correlation between DRPS values of different model versions is visible in all plots. The correlation varies to some degree from plot to plot, but overall all plots show a distinct correlation. (b) The differences between DRPS values of different model versions in the same catchment (vertical deviations of the points from the 1:1 line) are in all plots considerably smaller than the range of DRPS values for one model version over the 55 catchments (maximum extent of the points in x or y direction). These two characteristics (a) and (b) show that a main influence on the performance of a probabilistic forecast that is based on the post-processing of a deterministic forecast with some kind of autoregressive error model lies in the performance of the deterministic forecast model in combination with the strength of the autoregressive behaviour of the errors of the deterministic forecast model. In a catchment where the deterministic forecast model performs well or the autocorrelation of the errors of the deterministic forecast is high, the performance of the probabilistic forecast in terms of the DRPS will in general be good, and the influence of the specific implementation of the post-processor is of minor importance if one compares the results with a catchment where the deterministic forecast model has a weak performance and the autocorrelation of the model errors is weak. This illustrates that the error model can not ‘cure’ a poor deterministic forecast model. Though it may produce a probabilistic forecast that is well calibrated, the resolution of the forecast, which has a main influence on the values of the DRPS (see Section 3.4.1), will always be poor. (c) As a third characteristic, the point clouds show a systematic shift from the 1:1 line in many of the plots. The shift is very distinct in some of the plots, for example Fig. 5b, while in other plots, for example Fig. 7b, the shift is not that obvious. Still, on closer inspection, most of the plots seem to exhibit a systematic shift. Such a systematic shift reflects an on average better performance of one version of the error model over another version with respect to the evaluation measure DRPS. 67 M. Morawietz et al. / Journal of Hydrology 407 (2011) 58–72 3.3. Significance of differences in model performances We will now look more specifically at the differences in model performance between the different model versions, namely the differences between (1) models with state dependent parameters (SD) and corresponding models with state independent parameters (SI), (2) models with square root transformation (Sqrt) and corresponding models with log transformation (Log) and (3) models with normal distribution (Norm) and corresponding models with empirical distribution (Emp). Figs. 9–11 show the differences between the average discrete ranked probability scores DRPS (Eq. (16)) of corresponding model versions according to the aspects 3, 1 and 2, respectively. Error bars indicate the 95% bootstrap confidence intervals. The results for the differences between models with normal distribution (Norm) and models with empirical distribution (Emp) (Fig. 9) are very clear. In all five validations, the differences between model versions with normal distribution (Norm) and the corresponding model versions with empirical distribution (Emp) are significantly different from zero, and all are above zero. This means that on average models that use an empirical distribution perform significantly better than models that use the normal dis- 0.032 0.030 DRPS 0.028 0.026 0.024 Normal Distribution (Norm) SI.Sqrt SI.Log SD.Sqrt SD.Log 0.022 To further investigate and compare the average performance of the eight error models, the average discrete ranked probability score over all catchments, DRPS, was calculated according to Eq. (16). Fig. 8 shows the values of DRPS for all five validations. For the three dependent validations (white backgrounds), it is apparent that the scores in period 2, Cal.2, are consistently better compared with corresponding models in period 1, Cal.1. The scores for the dependent validation for the complete period, Cal.1 + 2, lie between the scores of periods 1 and 2. Thus, the period of data has a clear impact on the performance of the error models in terms of the absolute values of DRPS. The differences between the DRPS values of period 1 and period 2 for the same model version can be much larger than differences between different model versions in the same period. The scores for the two independent validations (grey backgrounds) show a similar behaviour as the dependent validations of the corresponding periods in that the scores of period 2 are better for all model versions compared with period 1 for the corresponding error models. However, the differences are less pronounced than for the dependent validation. The better performance of period 2 over period 1 found for both the dependent and independent validation may be explained through the correlation found between performance of the postprocessors in terms of DRPS and the Nash–Sutcliffe efficiency coefficients of the underlying HBV models (see Section 3.4.2). For most of the catchments (45 out of 55), the HBV model performance is better in period 2 than in period 1, and this is correlated with better performance of the post-processors in terms of DRPS, which is then reflected in the average DRPS as well. When comparing scores between independent and dependent validations for the same period, there is no consistent change of the scores from the dependent to the independent validation. Though for period 2, the scores of the independent validation, Val.2, are a bit worse than the corresponding scores of the dependent validation, Cal.2, the performance of the independent validation for period 1, Val.1, is more or less the same as for the dependent validation, Cal.1. This indicates that none of the eight versions of the error model has a clear over-parameterization in the sense that the model performance would strongly deteriorate in periods with independent data. 0.034 3.2. Model performances averaged over all catchments Cal.1+2 Cal.1 Val.1 Empirical Distribution (Emp) SI.Sqrt SI.Log SD.Sqrt SD.Log Cal.2 Val.2 Period and Validation Type Fig. 8. Values of DRPS for the eight versions of the error model for the different periods with dependent (white background) and independent (grey background) validation. tribution. The degree of improvement in absolute values is largest for the SI.Sqrt.Norm model (magenta/grey1 diamonds in Fig. 9), which shows the worst model performance with respect to DRPS in all five validations compared with the other model versions (magenta/grey diamonds in Fig. 8). The second largest improvement is given for the SI.Log.Norm model (blue/black diamonds in Fig. 9), which scores second worst of all models with normal distribution (red/grey circles in Fig. 8). The results for the comparison of state dependent (SD) versus state independent (SI) models are also clear (Fig. 10). All differences are larger than zero and all differences except one (Log.Emp models in Val.2) are significantly different from zero. That means that model versions with state dependent parameters (SD) perform on average significantly better than the corresponding model versions with state independent parameters (SI). The largest improvement of DRPS (blue/black diamonds in Fig. 10) is again given for the version with the worst model performance in terms of DRPS, SI.Sqrt.Norm (magenta/grey diamonds in Fig. 8). The second largest improvement is given for SI.Sqrt.Emp (magenta/grey diamonds in Fig. 10), which has the second poorest model performance in the class of SI models (magenta/grey pluses in Fig. 8). The results for models with log transformation (Log) versus models with square root transformation (Sqrt) are more complex. For the SI models (diamonds in Fig. 11) all differences are positive and all except one (SI.Emp in Val.1) are significantly different from zero. That means that for state independent models the model versions that use log transformation perform on average better than models with square root transformation. Again, the improvement is largest for the SI.Sqrt.Norm model (blue/black diamonds in Fig. 11), the model with the worst model performance in terms of DRPS (magenta/grey diamonds in Fig. 8). For the SD models however (cirlces in Fig. 11) none of the differences is significantly different from zero. That means that no significant difference in model performance can be detected for log versus square root transformation in the case of state dependent models. 1 Colour for web version / black-and-white for print version. M. Morawietz et al. / Journal of Hydrology 407 (2011) 58–72 Cal.1 Val.1 Cal.2 Val.2 0.000 0.002 0.004 0.006 Cal.1+2 −0.002 D R P S (Sqrt) − D R P S (Log) −0.004 SD.Log SI.Log SD.Sqrt SI.Sqrt SD.Norm SI.Norm SD.Emp SI.Emp −0.004 0.004 0.002 0.000 −0.002 D R P S (Norm) − D R P S (Emp) 0.006 0.008 0.008 68 Cal.1+2 Period and Validation Type Val.1 Cal.2 Val.2 Period and Validation Type Fig. 9. Differences of values of DRPS between models with normal distribution (Norm) and models with empirical distribution (Emp) with 95% bootstrap confidence intervals; the x-axis distinguishes the different periods with dependent (white background) and independent (grey background) validation. Fig. 11. Differences of values of DRPS between models with square root transformation (Sqrt) and models with log transformation (Log) with 95% bootstrap confidence intervals; the x-axis distinguishes the different periods with dependent (white background) and independent (grey background) validation. 0.000 0.002 0.004 0.006 0.008 catchments show some deviation from the theoretical values. This is an artefact of the sampling uncertainty for the flow distributions in the different periods. To have a consistent basis for all DRPS calculations, the same thresholds xk were used in all DRPS calculations, based on the percentiles of the flow distribution of the complete period of data (period 1 + 2). As the flow distributions in the other periods are slightly different, uncertainty values for validations in period 1 and period 2 show slight deviations from the theoretical uncertainty value. When comparing the influence of the reliability DREL and generalized resolution DGRES on the final DRPS values, it is apparent that the main influence is given by the resolution component, while the reliability values DREL are all close to zero. −0.002 D R P S (Si) − D R P S (SD) Cal.1 −0.004 Log.Norm Sqrt.Norm Log.Emp Sqrt.Emp Cal.1+2 Cal.1 Val.1 Cal.2 Val.2 Period and Validation Type Fig. 10. Differences of values of DRPS between models with state independent parameters (SI) and models with state dependent parameters (SD) with 95% bootstrap confidence intervals; the x-axis distinguishes the different periods with dependent (white background) and independent (grey background) validation. 3.4. Decomposition of the DRPS and correlations with catchment parameters 3.4.1. Decomposition of the DRPS The decomposition of the DRPS in the individual catchments according to Eq. (23) is shown in Fig. 12 for the SD.Log.Norm model for the independent validation in period 1 (Val.1). The features highlighted below are representative for the behaviour of equivalent plots for the other model versions and other validations. The theoretical value of the uncertainty DUNC is indicated as a horizontal line. The uncertainty values calculated for the individual 3.4.2. Correlation between DRPS and catchment characteristics Correlations of the DRPS values in the individual catchments with the catchment characteristics area and runoff coefficient, as well as with the Nash–Sutcliffe efficiency coefficients of the HBV models, were investigated. Correlations with the runoff coefficients were found to be very week; the Pearson product-moment correlation coefficients for the different model versions and validations lie between 0.07 and 0.26. Moderate negative correlations were found for the correlation of DRPS with the logarithm of the catchment area; the correlation coefficients lie between 0.41 and 0.51. The strongest (negative) correlations were found for DRPS with the Nash–Sutcliffe efficiency coefficients; the correlation coefficients lie between 0.64 and 0.80. This reflects the generally increasing performance of the probabilistic forecasts in the individual catchments with increasing performance of the underlying deterministic precipitation–runoff model. 4. Discussion 4.1. Normal distribution versus empirical distribution of the standardized residuals The models using an empirical distribution function to describe the standardized residuals were found to perform significantly M. Morawietz et al. / Journal of Hydrology 407 (2011) 58–72 69 0.0 0.1 0.2 of the same empirical distribution for each day, though superior to the use of the standard normal distribution, is still sub-optimal. Identifying different distributions for different conditions might be more difficult as larger samples are necessary to clearly identify a distribution than to estimate regression parameters, and a further investigation of this aspect was beyond the scope of this study. Given the importance of using a correct distribution for generating a valuable (calibrated) probabilistic forecast, it is an aspect where further improvements of autoregressive hydrologic uncertainty processors could be made if different distributions were present and could be identified. −0.2 −0.1 4.2. State dependent parameters versus state independent parameters 0 10 20 DRPS DREL DUNC −DGRES 30 40 50 Catchment number Fig. 12. Decomposition of the DRPS values for the 55 catchments for the SD.Log.Norm model in the independent validation for period 1 (Val.1). better than corresponding models that use a standard normal distribution. This finding is not too surprising when looking at quantile–quantile plots of the standardized empirical residuals (Fig. 13). Those plots show a clear deviation of the distribution of the standardized empirical residuals from the standard normal distribution. Thus, when using a standard normal distribution one can not expect to receive optimal results, and an empirical distribution function proved to be superior. Still, some care has to be taken when assessing the approach of using an empirical distribution function as carried out in this study. When applying the same empirical distribution on each day, one assumes implicitly that the distribution of the standardized residuals is the same for each day. However, there is no theoretical basis that would warrant this assumption and it seems plausible that, in the same way that the regression parameters of the autoregressive model may be different for different conditions, the standardized residuals might be described through different distributions for different conditions as well. In this case, the use Considering the simplicity of the autoregressive model of Eq. (2) it seems reasonable to assume that the parameters of the autoregressive model might not be constant but vary depending on the states of the model or the environment. Indeed, the results showed clearly that the simple autoregressive models with state independent parameters have a significantly poorer performance than the corresponding models with state dependent parameters. The state dependent parameterization chosen for this study is relatively detailed. It is possible that a simpler classification scheme might lead to an equivalent performance, or that alternative types of classifications may lead to similar or even better performances. Further investigation of these aspects was beyond the scope of this study. However, it was clearly shown that for an autoregressive error model used as hydrologic uncertainty processor a significant improvement of the performance is achieved through a state dependent parameterization compared to a simple model with state independent parameters. 4.3. Log transformation versus square root transformation For state independent models it was found that the models with logarithmic transformation perform significantly better than models with square root transformation. An explanation for this behaviour can be found when looking at plots of the standardized residuals et versus simulated streamflow Qsim(t). For most catchments, plots of models with logarithmic transformation show a fairly homoscedastic behaviour, while plots of models with square root transformation reveal a systematic increase of the variance of the residuals with increasing streamflow values. Fig. 14 gives an example of the plots for the catchment Fustvatn. The x-axis in these plots is scaled according to the rank of the simulated streamflow. This is done to assure a constant density of the points in the x-dimension. Otherwise, if the density in x-direction is very inho- Fig. 13. Quantile–quantile plots for the empirical standardized residuals ^et assuming a standard normal distribution as theoretical distribution for the catchment Bulken (period 1.1.1962–31.12.2005 with parameters estimated from the same period). (a) Complete data. (b) The same plot as a, but only displaying the central 95% of the data, leaving out the 2.5% smallest and 2.5% highest values of ^ et . 70 M. Morawietz et al. / Journal of Hydrology 407 (2011) 58–72 Fig. 14. Plots of the standardized residuals et versus the rank of the simulated streamflow Qsim(t) for models with state independent parameters for the catchment Fustvatn. (a) Model with logarithmic transformation of the original streamflow values. (b) Model with square root transformation of the original streamflow values. mogeneous as it is usually the case for streamflow values on a linear or logarithmic scale, the visual impression might be distorted and a set of homoscedastic values may appear as strongly nonhomoscedastic. In Fig. 14a the standardized residuals for the state independent model with logarithmic transformation show a fairly homoscedastic behaviour, while in Fig. 14b the residuals for the state independent model with square root transformation show an increasing variance with increasing streamflow values. Thus, the assumption of a state independent variance r is less justified for the SI models with square root transformation, and this is reflected in their inferior DRPS compared to the SI models with logarithmic transformation. However, for models with state dependent parameters, there is no significant difference in the performance between models with logarithmic transformation and models with square root transformation. The formulation of rt as dependent on the simulated streamflow (Eq. (10)) and the other flexibilities introduced with the state dependent formulation, can account for the more nonhomoscedastic behaviour and other deficiencies that the SI models with square root transformation might have compared to the models with logarithmic transformation. The similar performance of Log and Sqrt models shows that the choice of transformation looses its importance for the models with state dependent parameters. It is likely that for a range of transformations of the Box–Cox type (Box and Cox, 1964) that lie in between the special cases of the logarithmic and square root transformation, the differences introduced by different transformations will be levelled out through the flexibility introduced with the state dependent formulation. 5. Summary and conclusions Eight different versions of autoregressive error models were investigated as hydrologic uncertainty processors for probabilistic streamflow forecasting. Evaluation with the discrete ranked probability score as forecast evaluation measure gave the following main findings. (1) The variance of DRPS values for the same model version over different catchments is larger than differences between different model versions in the same catchment. This reflects the strong dependence of the quality of the probabilistic forecast on the quality of the underlying (updated) deterministic forecast. (2) Given a certain catchment with its deterministic precipitation–runoff model, significant differences in model performance between different versions of the autoregressive hydrologic uncertainty processors could be detected: (a) Models with state dependent parameters perform significantly better than corresponding models with state independent parameters. (b) Models using an empirical distribution function to describe the standardized residuals perform significantly better than corresponding models using a standard normal distribution. (c) For models with state independent parameters, those with a logarithmic transformation of the original streamflow values perform significantly better than those with a square root transformation. However, for models with state dependent parameters, this significance disappears and there is no difference in the performance of the logarithmic versus the square root transformation. The explanation is found in the flexibility that is introduced with the state dependent formulation, which can account for and alleviate the more non-homoscedastic behaviour that is found for the square root transformation. The results give guidance when using an autoregressive error model as hydrologic uncertainty processor. If a simple model with constant parameters and the assumption of a standard normal distribution for the standardized residuals is chosen, the choice of transformation used to attain homoscedastic residuals is important, and the logarithmic transformation was in this study clearly superior over the square root transformation. If a more complex model is chosen, both use of an empirical distribution function and a formulation with state dependent parameters will lead to an improved performance of the uncertainty processor. The best model performance is attained for models that use both an empirical distribution and state dependent parameters. For this type of models and with the formulation of state dependence as used in this study, the transformation type has no longer a significant influence on the model performance. Aspects that might lead to further improvements not investigated in this study are: (1) The use of state dependent empirical distributions for the standardized residuals analogue to the use of state dependent parameters. 71 M. Morawietz et al. / Journal of Hydrology 407 (2011) 58–72 If the log-transformed values of the expected values of the squared residuals e2t are described through a linear predictor and the squared residuals themselves follow a Gamma distribution, then Eq. (A5) constitutes a generalized linear model (Faraway, 2006) with response variable e2t , predictor variable st, link function as the natural logarithm and Gamma distribution of the response variable. The parameters of the generalized linear model are estimated using iteratively reweighted least squares (IRWLS), which are equivalent to maximum likelihood estimates (Faraway, 2006; McCullagh and Nelder, 1989). Iteration: Once parameters Aj and B are estimated, the variance r2t is calculated for each time step from Eq. (10). Then steps one and two are iterated with the only modification that the parameters in step one are now estimated with a weighted linear regression using the reciprocal of the variance 1=r2t as weights. The rationale is that cases with a larger variance in the linear regression of step one should receive less weight than cases with a smaller variance. The iterations are stopped once convergence of the parameters is obtained. Praxis showed that the algorithm converges very quickly. Ten iterations were enough to assure convergence of the parameters in this study. (2) The investigation of alternative state dependent parameterization schemes. The scheme presented is relatively detailed. There might be less complex schemes that exhibit an equivalent performance as the one used in this study, or alternative schemes with better performances. In addition to the findings, the study discussed the use of the discrete ranked probability score as an evaluation measure that allows evaluation over the whole range of streamflow values (apart from a certain discretization) and a direct averaging of the scores over several catchment for an overall evaluation and comparison of different methods. Acknowledgements We thank the Norwegian Water Resources and Energy Directorate (NVE) for the provision of the data, the ‘‘Nordic’’ HBV model program and computing facilities. We further thank Thomas Skaugen and Elin Langsholt for their information on the ‘‘Nordic’’ HBV model code and the uncertainty procedures operational at NVE. We are also very grateful to Lukas Gudmundsson and Klaus Vormoor for their discussions and comments, and we want to thank two anonymous reviewers for their valuable comments, which helped to improve the paper. Appendix A. Parameter estimation for models with state dependent parameters The iterative two-step procedure for parameter estimation for model versions with state dependent parameters works as follows. Input data to the calibration procedure are time series of the variables dt, dt1, st, Tt, Pt, SWEt and Qsim(t), as defined in Section 2.2. In the beginning, the data set is divided into two subsets 1 and 2 for cases with Qsim(t) P qthresh and Qsim(t) < qthresh respectively. Parameters aj, b, Aj, B are derived from subset 1, and parameters aj ; b ; Aj ; B are derived from subset 2. In the following paragraphs, derivation of the parameters for subset 1 is described. Derivation of the parameters for subset 2 is analogue. For the sake of clarity, regression Eqs. (A1) and (A5) have been written without individual appearance of the parameters a1, . . . , a5 and A1, . . . , A5. However, individual appearance of the parameters as it is necessary for a formal derivation of the regression parameters can be achieved by introducing five indicator variables Ij;t ¼ 1 for iðtÞ ¼ j ; 0 for iðtÞ – j j ¼ 1; . . . ; 5 ðA6Þ where i(t) is the number of the meteorological and snow state as defined in Table 1. Defining further the five variables dj,t1 = Ij,t dt1, j = 1, . . . , 5, regression Eqs. (A1) and (A5) can be rewritten as dt ¼ a1 d1;t1 þ a2 d2;t1 þ a3 d3;t1 þ a4 d4;t1 þ a5 d5;t1 þ bzt ðA7Þ and lnðEðe2t ÞÞ ¼ 2A1 I1;t þ 2A2 I2;t þ 2A3 I3;t þ 2A4 I4;t þ 2A5 I5;t þ 2Bst Step 1: The deterministic part of the error model is a linear model with response variable dt and two predictor variables dt1 and zt = stdt1 dt ¼ aiðtÞ dt1 þ bzt ðA1Þ The parameters are estimated with ordinary least squares, which in the case of normal distribution of the residuals are equivalent to maximum likelihood estimates. Step 2: With the parameters estimated in the first step the residuals of the deterministic component are calculated as et ¼ dt aiðtÞ dt1 bzt ðA2Þ Under the assumption of normal distribution of the residuals et, the squared residuals e2t follow a Chi-square distribution with one degree of freedom, which is a special case of the Gamma distribution. Also, the expected value of the squared residuals is equal to the variance Eðe2t Þ ¼ r2t ðA3Þ With the relation of the standard deviation given in Eq. (10) it follows that: ln r2t ¼ 2 ln rt ¼ 2AiðtÞ þ 2Bst ðA4Þ Combining Eqs. (A3) and (A4) leads to lnðEðe2t ÞÞ ¼ 2AiðtÞ þ 2Bst ðA5Þ ðA8Þ respectively. Under the assumption that the standardized residuals et in Eq. (13) are standard normally distributed, the ordinary least squares estimates of step one and the iteratively reweighted least squares from step two are equivalent to Maximum likelihood estimates. If the distributional assumptions are violated the estimates are no longer strict maximum likelihood estimates. However, it is assumed that moderate violations of the distributional assumptions do not have a strong negative effect on the estimates of the parameters. References Bergström, S., 1976. Development and Application of a Conceptual Runoff Model for Scandinavian Catchments. SMHI RHO 7. Swedish Meteorological and Hydrological Institute, Norrköping, Sweden. Bergström, S., 1992. The HBV Model – its Structure and Applications. SMHI Reports Hydrology No. 4, Swedish Meteorological and Hydrological Institute, Norrköping, Sweden. Bougeault, P., 2003. The WGNE Survey of Verification Methods for Numerical Prediction of Weather Elements and Severe Weather Events. Météo-France, Toulouse, France. <http://www.bom.gov.au/bmrc/wefor/staff/eee/verif/ Bougeault/Bougeault_Verification-methods.htm> (accessed 26.11.10). Box, G.E.P., Cox, D.R., 1964. An analysis of transformations. J. Roy. Stat. Soc. Ser. B 26 (2), 211–252. Brier, G.W., 1950. Verification of forecasts expressed in terms of probability. Mon. Weather Rev. 78 (1), 1–3. doi:10.1175/1520-0493(1950)078<0001:VOFEIT> 2.0.CO;2. 72 M. Morawietz et al. / Journal of Hydrology 407 (2011) 58–72 Cloke, H.L., Pappenberger, F., 2009. Ensemble flood forecasting: a review. J. Hydrol. 375 (3–4), 613–626. doi:10.1016/j.jhydrol.2009.06.005. Doherty, J., 2004. PEST: Model Independent Parameter Estimation. User Manual, fifth ed. Watermark Numerical Computing, Brisbane, Australia. Engeland, K., Gottschalk, L., 2002. Bayesian estimation of parameters in a regional hydrological model. Hydrol. Earth Syst. Sci. 6 (5), 883–898. doi:10.5194/hess-6883-2002. Efron, B., Tibshirani, R., 1993. An Introduction to the Bootstrap. Chapman & Hall, New York, NY, US. Epstein, E.S., 1969. A scoring system for probability forecasts of ranked categories. J. Appl. Meteorol. 8 (6), 985–987. doi:10.1175/1520-0450(1969)008<0985: ASSFPF>2.0.CO;2. Faraway, J.J., 2006. Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models. Chapman & Hall/CRC, Boca Raton, FL, US. Jaun, S., Ahrens, B., 2009. Evaluation of a probabilistic hydrometeorological forecast system. Hydrol. Earth Syst. Sci. 13 (7), 1031–1043. doi:10.5194/hess-13-10312009. Kalman, R.E., 1960. A new approach to linear filtering and prediction problems. J. Basic Eng. 82 (1), 35–64. Krzysztofowicz, R., 1999. Bayesian theory of probabilistic forecasting via deterministic hydrological model. Water Resour. Res. 35 (9), 2739–2750. doi:10.1029/1999WR900099. Krzysztofowicz, R., Kelly, K.S., 2000. Hydrologic uncertainty processor for probabilistic river stage forecasting. Water Resour. Res. 36 (11), 3265–3277. doi:10.1029/2000WR900108. Kuczera, G., 1983. Improved parameter inference in catchment models: 1. Evaluating parameter uncertainty. Water Resour. Res. 19 (5), 1151–1162. doi:10.1029/WR019i005p01151. Langsrud, Ø., Frigessi, A., Høst, G., 1998. Pure Model Error of the HBV Model. HYDRA Note No. 4, Norwegian Water Resources and Energy Directorate, Oslo, Norway. Lawrence, D., Haddeland, I., Langsholt, E., 2009. Calibration of HBV Hydrological Models Using PEST Parameter Estimation. Report No. 1 – 2009, Norwegian Water Resources and Energy Directorate, Oslo, Norway. Levenberg, K., 1944. A method for the solution of certain problems in least squares. Quart. Appl. Math. 2, 164–168. Lundberg, A., 1982. Combination of a conceptual model and an autoregressive model for improving short time forecasting. Nord. Hydrol. 13 (4), 233–246. Marquardt, D., 1963. An algorithm for least-squares estimation of non-linear parameters. J. Soc. Ind. Appl. Math. 11 (2), 431–441. doi:10.1137/0111030. McCullagh, P., Nelder, J.A., 1989. Generalized Linear Models, second ed. Chapman & Hall, London, UK. Mohr, M., Tveito, O.E., 2008. Daily temperature and precipitation maps with 1 km resolution derived from Norwegian weather observations. In: Extended Abstract from 17th Conference on Applied Climatology, American Meteorological Society, 11–14 August, Whistler, BC, Canada. <http:// ams.confex.com/ams/pdfpapers/141069.pdf> (accessed 26.11.10). Molteni, F., Buizza, R., Palmer, T.N., Petroliagis, T., 1996. The ECMWF ensemble prediction system: methodology and validation. Quart. J Royal Meteorol. Soc. 122 (529), 73–119. doi:10.1002/qj.49712252905. Murphy, A.H., 1971. A note on the ranked probability score. J. Appl. Meteorol. 10 (1), 155–156. doi:10.1175/1520-0450(1971)010<0155:ANOTRP>2.0.CO;2. Murphy, A.H., 1973. A new vector partition of the probability score. J. Appl. Meteorol. 12 (4), 595–600. Nash, J.E., Sutcliffe, J.V., 1970. River flow forecasting through conceptual models: Part I – A discussion of principles. J. Hydrol. 10 (3), 282–290. doi:10.1016/00221694(70)90255-6. Nurmi, P., 2003. Recommendations on the Verification of Local Weather Forecasts (at ECMWF Member States). Consultancy Report, ECMWF Operations Department. <http://www.bom.gov.au/bmrc/wefor/staff/eee/verif/Rec_FIN_ Oct.pdf> (accessed 26.11.10). Seo, D.-J., Koren, V., Cajina, N., 2003. Real-time variational assimilation of hydrologic and hydrometeorological data into operational hydrologic forecasting. J. Hydrometeor. 4 (3), 627–641. Seo, D.-J., Herr, H.D., Schaake, J.C., 2006. A statistical post-processor for accounting of hydrologic uncertainty in short-range ensemble streamflow prediction. Hydrol. Earth Syst. Sci. Discuss. 3 (4), 1987–2035. doi:10.5194/hessd-3-19872006. Stanski, H.R., Wilson, L.J., Burrows, W.R., 1989. Survey of common verification methods in meteorology. WMO World Weather Watch Technical Report No. 8, WMO/TD No. 358, second ed. Atmospheric Environment Service, Downsview, Canada. <http://www.bom.gov.au/bmrc/wefor/staff/eee/verif/Stanski_et_al/ Stanski_et_al.html> (accessed 26.11.10). Stephenson, D.B., Coelho, C.A.S., Jolliffe, I.T., 2008. Two extra components in the Brier score decomposition. Weather Forecast. 23 (4), 752–757. doi:10.1175/ 2007WAF2006116.1. Sælthun, N.R., 1996. The ‘‘Nordic’’ HBV model. Norwegian Water Resources and Energy Administration Publication No. 7, Oslo, Norway. Todini, E., 2004. Role and treatment of uncertainty in real-time flood forecasting. Hydrological Processes 18 (14), 2743–2746. doi:10.1002/hyp.5687. Toth, E., Montanari, A., Brath, A., 1999. Real-time flood forecasting via combined use of conceptual and stochastic models. Phys. Chem. Earth, Part B 24 (7), 793–798. doi:10.1016/S1464-1909(99)00082-9. Toth, Z., Talagrand, O., Candille, G., Zhu, Y., 2003. Probability and ensemble forecasts. In: Jolliffe, I.T., Stephenson, D.B. (Eds.), Forecast Verification: a Practitioner’s Guide in Atmospheric Science. John Wiley & Sons, Chichester, UK, pp. 137–164. Wilks, D.S., 1995. Statistical Methods in Atmospheric Sciences: an Introduction. International Geophysics Series, vol. 59. Academic Press, San Diego, CA, US. World Meteorological Organization, 1992. Simulated Real-time Intercomparison of Hydrological Models. Operational Hydrology Report No. 38, WMO Publication No. 779, World Meteorological Organization, Geneva, Switzerland. WWRP/WGNE Joint Working Group on Forecast Verification Research, 2010. Forecast verification: Issues, methods and FAQ. <http://www.cawcr.gov.au/ projects/verification/> (accessed 26.11.10). Xu, C.-Y., 2001. Statistical analysis of parameters and residuals of a conceptual water balance model – methodology and case study. Water Resour. Manage. 15 (2), 75–92. doi:10.1023/A:1012559608269.