BAYESIAN PROCESSOR OF OUTPUT: PROBABILITY OF PRECIPITATION OCCURRENCE By Roman Krzysztofowicz University of Virginia Charlottesville,Virginia and Coire J. Maranzano Johns Hopkins University Baltimore, Maryland Research Paper RK–0601 http://www.faculty.virginia.edu/rk/ January 2006 Revised October 2006 c 2006 by R. Krzysztofowicz and C.J. Maranzano Copyright ° ———————————————– Corresponding author address: Professor Roman Krzysztofowicz, University of Virginia, P.O. Box 400747, Charlottesville,VA 22904–4747. E-mail: rk@virginia.edu ABSTRACT The Bayesian Processor of Output (BPO) is a theoretically-based technique for probabilistic forecasting of weather variates. The first version of the BPO described herein is for a binary predictand; it is illustrated by producing the probability of precipitation (PoP) occurrence forecast. This PoP is a posterior probability obtained through Bayesian fusion of a prior (climatic) probability and a realization of predictors output from a numerical weather prediction (NWP) model. The strength of the BPO derives from (i) the theoretic structure of the forecasting equation (which is Bayes theorem), (ii) the flexibility of the meta-Gaussian family of likelihood functions (which allows any form of the marginal distribution functions of predictors, and a non-linear and heteroscedastic dependence structure between predictors), (iii) the simplicity of estimation, and (iv) the effective use of asymmetric samples (typically, a long climatic sample of the predictand and a short operational sample of the NWP model output). Modeling and estimation of the BPO are explained in a setup parallel to that of the Model Output Statistics (MOS) technique used operationally by the National Weather Service. The performance of the prototype BPO system is compared with the performance of the operational MOS system in terms of calibration and informativeness on two samples (estimation and validation). These preliminary results highlight the advantages of the BPO in terms of (i) performance for a specific location (and hence a user), (ii) efficiency of extracting predictive information from the NWP model output (fewer predictors needed), and (iii) parsimony of the predictors (no need for experimentation to find suitable transformations of the NWP model output). Potential implications for operational forecasting and ensemble processing are discussed. ii TABLE OF CONTENTS ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Towards Bayesian Forecasting Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 BPO for Binary Predictand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2. BAYESIAN THEORY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Variates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3 Input Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.4 Theoretic Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3. META-GAUSSIAN MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.1 Input Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2 Forecasting Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.3 Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.4 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4. EXAMPLE WITH ONE PREDICTOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.1 Prior Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.2 Conditional Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.3 Informativeness of Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.4 Posterior Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.5 Another Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.6 Binary-Continuous Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.7 Monotonicity of Likelihood Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5. EXAMPLE WITH TWO PREDICTORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.1 Conditional Correlation Coefficients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16 5.2 Conditional Dependence Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.3 Conditional Dependence Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.4 Second Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5.5 Predictors Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 iii 6. MOS SYSTEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 6.1 Forecasting Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 6.2 Grid-Binary Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 6.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 6.4 Predictors Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 7. COMPARISON OF BPO WITH MOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 7.1 System Versus Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 7.2 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 7.3 Comparative Verifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 7.4 Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 8. SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 8.1 Bayesian Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 8.2 Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 8.3 Potential Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 APPENDIX A: NUMERICAL APPROXIMATION TO Q−1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 iv 1. INTRODUCTION 1.1 Towards Bayesian Forecasting Techniques Rational decision making by industries, agencies, and the public in anticipation of heavy precipitation, snow storm, flood, or other disruptive weather phenomenon, requires information about the degree of certitude that the user can place in a weather forecast. It is vital, therefore, to advance the meteorologist’s capability of quantifying forecast uncertainty to meet the society’s rising expectations for reliable information. Our objective is to develop and test a coherent set of theoretically-based techniques for probabilistic forecasting of weather variates. The basic technique, called Bayesian Processor of Output (BPO), processes output from a numerical weather prediction (NWP) model and optimally fuses it with climatic data in order to quantify uncertainty about a predictand. As is well known, Bayes theorem provides the optimal theoretic framework for fusing information from different sources and for obtaining the probability distribution of a predictand, conditional on a realization of predictors, or conditional on an ensemble of realizations. The “optimality” of Bayes theorem for fusing information, or updating uncertainty, or revising probability, rests on logical and mathematical arguments (see, for example, Savage, 1954; DeGroot, 1970; de Finetti, 1974). These arguments have long ago been adopted by engineers and decision theorists for information, or signal, or forecast processing, and for decision making based on forecasts (see, for example, Edwards et al., 1968; Sage and Melsa, 1971; Krzysztofowicz, 1983; Alexandridis and Krzysztofowicz, 1985). Introducing what we would call today a Bayesian processor of forecast for a binary predictand, DeGroot (1988) explains: The argument in favor of the Bayesian approach proceeds in two steps: (1) The quantitative assessment of uncertainty is in itself a sterile exercise unless that 1 assessment is to be used to make decisions. (2) The Bayesian approach provides the only coherent methodology for decision making under uncertainty. Lindley (1987), defending “the inevitability of probability” as a measure of uncertainty, presents logical arguments and a succinct verdict: “Most intelligent behavior is simply obeying Bayes theorem. Any other procedure is incoherent.” The challenge lying before us is to develop and test Bayesian procedures suitable for operational forecasting in meteorology. 1.2 BPO for Binary Predictand The present article describes the BPO for a binary predictand. This BPO is illustrated by producing the probability of precipitation (PoP) occurrence forecast. The overall setup for the illustration is parallel to the operational setup for the Model Output Statistics (MOS) technique (Glahn and Lowry, 1972) used in operational forecasting by the National Weather Service (NWS). In the currently deployed AVN-MOS system (Antolik, 2000), the predictors for the MOS forecasting equations are based on output fields from the Global Spectral Model run under the code name AVN. The performance of the operational AVN-MOS system is the primary benchmark for evaluation of the performance of the BPO. The article is organized as follows. Section 2 presents the gist of the Bayesian theory of forecasting for a binary predictand. Section 3 details the input elements, the forecasting equation, and the basic properties of the BPO. Section 4 presents a tutorial example of the BPO for PoP using a single predictor. Section 5 presents a tutorial example of the BPO for PoP using two predictors. The prototype BPO system is compared and contrasted with the operational MOS system in terms of the structure of the forecasting equations in Section 6, and in terms of performance on matched verifications in Section 7. Section 8 summarizes implications of these comparisons and potential advantages of the BPO. 2 2. BAYESIAN THEORY 2.1 Variates Let V be the predictand — a binary variate serving as the indicator of some future event, such that V = 1 if and only if the event occurs, and V = 0 otherwise; its realization is denoted v, where v ∈ {0, 1}. Let Xi be the predictor — a variate whose realization xi is used to forecast V . Let X = (X1 , ..., XI ) be the vector of I predictors; its realization is denoted x = (x1 , ..., xI ). Each Xi (i = 1, ..., I) is assumed to be a continuous variate — an assumption that simplifies the presentation but can be relaxed if necessary. 2.2 Samples Suppose the forecasting problem has already been structured, and the task is to develop the forecasting equation in a setup similar to that of the MOS technique (Antolik, 2000). In the examples throughout the article, the event to be forecasted is the occurrence of precipitation (accumulation of at least 0.254 mm of water) in Buffalo, New York, during the 6-h period 1200–1800 UTC, beginning 60 h after the run of the AVN model at 0000 UTC. The predictors are the variates whose realizations are output from the AVN model. Forecasts are to be made every day in the cool season (October–March). Let {v} denote the climatic sample of the predictand. The climatic sample comes from the database of the National Climatic Data Center (NCDC). This database contains hourly precipitation observations in Buffalo from over 56 years; however, the record is heterogeneous and must be processed in order to obtain a homogeneous sample. To avoid this task, only observations recorded by the Automated Surface Observing System (ASOS) are included in the prior sample. In effect, it is a 7-year long sample extending from 1 January 1997 through 31 December 2003. Each day provides one realization. The sample size for the cool season is M = 1132. 3 Let {(x, v)} denote the joint sample of the predictor vector and the predictand. The joint sample comes from the database that the Meteorological Development Laboratory (MDL) used to estimate the operational forecasting equations of the AVN-MOS system. It is a 4-year long sample extending from 1 April 1997 through 31 March 2001. The sample size for the cool season is N = 698. The point of the above example is that typically the joint sample is much shorter than the climatic sample: N << M. Classical statistical methods, such as the MOS technique, deal with this sample asymmetry by simply ignoring the long climatic sample. In effect, these methods ignore vast amounts of information about the predictand. In contrast, the BPO uses both samples; it extracts information from each sample and then optimally fuses information according to the laws of probability. (Pooling of samples from different months and stations in order to increase the sample size is a separate issue.) 2.3 Input Elements With P denoting the probability and p denoting a generic density function, define the following objects. g = P (V = 1) is the prior probability of event V = 1; it is to be estimated from the climatic sample {v}. Probability g quantifies the uncertainty about the predictand V that exists before the NWP model output is available. Equivalently, it characterizes the natural variability of the predictand. fv (x) = p(x|V = v) for v = 0, 1; function fv is the I-variate density function of the predictor vector X, conditional on the hypothesis that the event is V = v. The two conditional density functions, f0 and f1 , are to be estimated from the joint sample {(x, v)}. For a fixed realization X = x, object fv (x) is the likelihood of event V = v. 4 Thus (f0 , f1 ) comprises the family of likelihood functions. This family quantifies the stochastic dependence between the predictor vector X and the predictand V . Equivalently, it characterizes the informativeness of the predictors with respect to the predictand. (The informativeness is defined in Section 4.3.) 2.4 Theoretic Structure The probability g and the family of likelihood functions (f0 , f1 ) carry information about the prior uncertainty and the informativeness of the predictors into the Bayesian revision procedure. The expected density function κ of the predictor vector X is given by the total probability law: κ(x) = f0 (x)(1 − g) + f1 (x)g, (1) and the posterior probability π = P (V = 1|X = x) of event V = 1, conditional on a realization of the predictor vector X = x, is given by Bayes theorem: π= f1 (x)g . κ(x) (2) By inserting (1) into (2), one obtains an alternative expression: ∙ ¸−1 1 − g f0 (x) π= 1+ , g f1 (x) (3) where (1 − g)/g is the prior odds against event V = 1, and f0 (x)/f1 (x) is the likelihood ratio against event V = 1. Equation (3) defines the theoretic structure of the BPO for a binary predictand. Inasmuch as Eqs. (1) and (2) follow directly from the axioms of probability theory (sans additional assumptions), Eq. (3) is the most general solution for the conditional probability π. In that sense, it provides the optimal theoretic framework for fusing model output (which supplies a value of x) with climatic data (which supply a value of g). 5 3. META-GAUSSIAN MODEL To implement the BPO, a flexible and convenient model is needed for each multivariate conditional density function, f0 and f1 . We employ the meta-Gaussian model developed by Kelly and Krzysztofowicz (1994, 1995, 1997) and used successfully in probabilistic river stage forecasting (Krzysztofowicz and Herr, 2001; Krzysztofowicz, 2002) and probabilistic rainfall modeling (Herr and Krzysztofowicz, 2005). 3.1 Input Elements A multivariate meta-Gaussian distribution is constructed from specified marginal distributions, a correlation matrix, and the Gaussian dependence structure. To obtain expressions for f0 and f1 , this construction must be replicated twice; every element of the construction must be duplicated, with one copy being conditioned on event V = 0 and another copy being conditioned on event V = 1. Accordingly, the input elements are defined as follows. fiv (xi ) = p(xi |V = v) for i = 1, ..., I; v = 0, 1. For fixed i ∈ {1, ..., I} and v ∈ {0, 1}, function fiv is the marginal density function of the predictor Xi , conditional on the hypothesis that the event is V = v. For a fixed realization Xi = xi of the predictor, object fiv (xi ) is the marginal likelihood of event V = v. Fiv (xi ) = P (Xi ≤ xi |V = v) for i = 1, ..., I; v = 0, 1. For fixed i ∈ {1, ..., I} and v ∈ {0, 1}, function Fiv is the marginal distribution function of the predictor Xi , conditional on the hypothesis that the event is V = v; this Fiv corresponds to fiv . γ ijv = Cor(Zi , Zj |V = v) for i = 1, ..., I − 1; j = i + 1, ..., I; v = 0, 1. This is the Pearson’s product-moment correlation coefficient between the standard normal predictors Zi and Zj , conditional on the hypothesis that the event is V = v. The standard normal predictor Zi , conditional on event V = v, is obtained from the original predictor Xi , conditional on event 6 V = v, through the normal quantile transform (NQT): Zi = Q−1 (Fiv (Xi )), i = 1, ..., I; v = 0, 1; where Q is the standard normal distribution function, and Q−1 is the inverse of Q. The conditional correlation coefficients are arranged in two conditional correlation matrices Γv = [γ ijv ], v = 0, 1, whose elements have the following properties: γ iiv = 1 for i = 1, ..., I; −1 < γ ijv < 1 for i 6= j; and γ ijv = γ jiv for i, j = 1, ..., I. It follows that matrix Γv has dimension I × I; is square, symmetric, and positive definite; and is uniquely determined by its I(I − 1)/2 upper diagonal elements. 3.2 Forecasting Equation When each of the two multivariate conditional density functions, f0 and f1 , is meta-Gaussian, the BPO defined by Eq.(3) takes the following form. Given a prior probability g of event V = 1, and given a realization x = (x1 , ..., xI ) of the predictor vector, the posterior probability of event V = 1 is specified by the equation " I Y fi0 (xi ) 1−g π = 1+ λ(x) g f (x ) i=1 i1 i #−1 , (4) where λ is the likelihood ratio weighting function defined by the equation λ(x) = r ∙ ¸ ¢ 1 ¡ T −1 det Γ1 T T −1 T exp − z0 Γ0 z0 − z0 z0 − z1 Γ1 z1 + z1 z1 , det Γ0 2 (5) and where the mapping of the vector x = (x1 , ..., xI ) into two vectors z0 = (z10 , ..., zI0 ) and z1 = (z11 , ..., zI1 ) is defined by the NQT: ziv = Q−1 (Fiv (xi )), i = 1, ..., I; v = 0, 1. 7 (6) In numerical calculations, Q−1 is approximated by a rational function (Abramowitz and Stegun, 1972), which is reproduced in Appendix A. Equation (4) reveals that the posterior probability is determined by the product of the prior odds (1 − g)/g against event V = 1, the marginal likelihood ratios fi0 (xi )/fi1 (xi ) against event V = 1, and the likelihood ratio weight λ(x). The marginal likelihood ratio function fi0 /fi1 carries information from predictor Xi ; the likelihood ratio weighting function λ accounts for the conditional dependence among the predictors X1 , ..., XI . If the predictors X1 , ..., XI are independent, conditional on event V = 0 and conditional on event V = 1, then each of the two conditional correlation matrices, Γ0 and Γ1 , simplifies to the identity matrix; consequently λ(x) = 1 at every point x, and the multivariate likelihood ratio f0 (x)/f1 (x) simplifies to the product of the marginal likelihood ratios fi0 (xi )/fi1 (xi ) for i = 1, ..., I. 3.3 Basic Properties The meta-Gaussian model for f0 and f1 , which is embedded in Eqs. (4) – (6), offers these properties. 1. The marginal conditional distribution function Fiv of predictor Xi may take any form; this form may be different for each i ∈ {1, ..., I} and each v ∈ {0, 1}; the marginal conditional density function fiv is simply derived from Fiv . 2. The two transforms (the NQTs) for each predictor Xi are uniquely specified once its marginal conditional distribution functions, Fi0 and Fi1 , have been estimated. 3. The conditional dependence structure among the predictors X1 , ..., XI is pairwise; the degree of dependence is quantified by the conditional correlation matrix Γv . 4. The conditional dependence structure between any two predictors Xi and Xj , i 6= j, may 8 be non-linear (in the conditional mean) and heteroscedastic (in the conditional variance). 5. The probabilistic forecast (the posterior probability π) is given by an analytic expression. Properties 1 and 4 imply the flexibility in fitting the model to data — an attribute necessary to produce forecasts that are well calibrated and most informative. Properties 2 and 3 imply the simplicity of estimation. Property 5 implies the computational efficiency — an important attribute for operational forecasting. 3.4 Model Validation The meta-Gaussian model can be validated on a given joint sample — an advantage because one can gain additional insight into the BPO and the data. First, each marginal conditional distribution function Fiv should be tested for the goodness of fit to the empirical distribution function of Xi , conditional on V = v. Second, the two conditional dependence structures should be validated based on the following fact (Kelly and Krzysztofowicz, 1997): conditional on V = v (v = 0, 1), the joint distribution of X1 , ..., XI is meta-Gaussian if and only if the joint distribution of Z1 , ..., ZI is Gaussian. The NQT guarantees that the marginal distribution of each Zi is standard normal. Therefore, the validation amounts to testing the hypothesis that the distribution of each pair (Zi , Zj ), for i = 1, ..., I − 1 and j = i + 1, ..., I, is bivariate standard normal. This test can be broken down into three tests of the following requirements: 1. Linearity — the regression of Zi on Zj must be linear. 2. Homoscedasticity — the variance of the residual Θij = Zi − γ ijv Zj must be independent of Zj . 3. Normality — the distribution of Θij must be normal with mean 0 and variance 1 − γ 2ijv . Inasmuch as these are requirements of a linear model, testing procedures are well known. 9 4. EXAMPLE WITH ONE PREDICTOR When there is only one predictor, the BPO is given by Eq.(3), with the vector x being replaced by the variable x which denotes a realization of predictor X. In effect, three elements are needed for forecasting: a prior probability g, and two univariate conditional density functions, f0 and f1 . 4.1 Prior Probability The prior probability g is estimated for the cool season from the climatic sample. It is g = 0.27, the value to be used for forecasting every day during the cool season. In general, when the climatic sample is large, the prior probability g could be estimated for a subseason, even for a day, by applying a moving window to the climatic sample. For instance, Table 1 shows that g varies from month to month. Thus, using a given g every day during a month and changing g from month to month would improve the calibration of the forecast probability within each month. (This statement is justified because the expectation of the posterior probability equals the prior probability.) Despite this potential advantage, all examples reported herein use g for the cool season because this parallels the setup of the operational MOS system (whose equations are estimated for the cool season) and because the available validation sample (from 2 1/2 years) is too short to verify the calibration of forecasts for each month. Overall, the prior probability has four attributes important for application. (i) It may be location-specific and season-specific (or day-specific) and thereby can capture the “micro-climate”. (ii) It may be estimated from a large climatic sample. (iii) It is independent of the choice of the predictors and the length of the NWP model output available for estimation (the size of the joint sample). (iv) It need not be re-estimated when the NWP model changes; thus it ensures a stable calibration of the forecast probabilities for as long as the climate remains stationary. 10 4.2 Conditional Density Functions The single predictor X is the mean relative humidity of a variable depth layer from sigma 1.0 to sigma 0.44 at 60 h after the 0000 UTC model run (for short, mean relative humidity at 60 h). Its two conditional density functions, f0 and f1 , are for the cool season; they are derived from the corresponding conditional distribution functions, F0 and F1 . The procedure for modeling and estimation of F0 and F1 is as follows. 1. The joint sample {(x, v)} of 698 realizations is stratified into two subsamples: {(x, 0)} containing 518 realizations and {(x, 1)} containing 180 realizations. 2. From each subsample, an empirical distribution function of X is constructed (Fig. 1). 3. A parametric model for Fv is chosen, its parameters are estimated, and its goodness-of-fit to the empirical distribution function is evaluated. Here, both F0 and F1 are ratio type II log-Weibull; each is defined on the interval (0, 100) and is specified by two parameters (Fig. 1). The above procedure has been automated by creating a catalog of parametric models and by developing algorithms for estimation of the parameters and choice of the best model. The catalog includes expressions for the distribution functions and for the density functions. In effect, once a parametric model for Fv is chosen, the expression for fv is known. Fig. 2 shows f0 and f1 . 4.3 Informativeness of Predictor A predictor X used in the BPO is characterized in terms of its informativeness. Intuitively, the informativeness of predictor X may be visualized by judging the degree of separation between the two conditional density functions, f0 and f1 , shown in Fig. 2: the larger the separation, the more informative the predictor. Formally, the informativeness of predictor X is characterized by the Receiver Operating Characteristic (ROC) — a graph of the probability of detection P (D|x) versus the probability of false alarm P (F |x) for all x. When the likelihood ratio L(x) = f0 (x)/f1 (x) 11 is a strictly monotonic function of x, the ROC may be constructed directly from the conditional distribution functions F0 and F1 : if L = f0 /f1 is strictly increasing, then P (D|x) = F1 (x), P (F |x) = F0 (x); (7a) if L = f0 /f1 is strictly decreasing, then P (D|x) = 1 − F1 (x), P (F |x) = 1 − F0 (x). (7b) Given f0 and f1 shown in Fig. 2, Eq. (7b) holds; the resultant ROC is shown in Fig. 3. Clearly, the mean relative humidity X is an informative predictor of the precipitation occurrence indicator V , as the ROC lies decisively above the diagonal line (which characterizes an uninformative predictor); but X is far from being a perfect predictor of V , as the ROC passes far from the upper left corner of the graph (which characterizes a perfect predictor). When there are two or more alternative predictors, they can be compared (and possibly ranked) in terms of a binary relation of informativeness. This relation derives from the Bayesian theory of sufficient comparisons, the essence of which is as follows (Blackwell, 1951, 1953; Krzysztofowicz and Long, 1990, 1991). Let Xi and Xj be two alternative predictors of V . Suppose a rational decision maker will use the probabilistic forecast of V from the BPO in a Bayesian decision procedure with a prior probability g and a loss function l. Let VAi (g, l) denote the value of a probabilistic forecast generated by predictor Xi (as defined in the Bayesian theory). Definition. Predictor Xi is said to be more informative than predictor Xj if and only if the value of a forecast generated by predictor Xi is at least as high as the value of a forecast generated by predictor Xj , for every prior probability g and every loss function l; formally, if and only if VAi (g, l) > VAj (g, l), 12 for every g, l. Inasmuch as any two rational decision makers may employ different prior probabilities and different loss functions, the condition “for every g, l” is synonymous with the statement “for every rational decision maker”. Blackwell (1953) proved the following. Theorem. Predictor Xi is more informative than predictor Xj if and only if the ROC of Xi is superior to the ROC of Xj . The binary relation of informativeness establishes a quasi order on a set of predictors X1 , ..., XI . The quasi order is reflexive and transitive, but is not complete. That is, there may exist two predictors such that neither is more informative than the other, which is the case when one ROC crosses the other. Also there may exist two predictors that are equally informative, which is the case when one ROC is identical to the other. In summary, an advantage of the BPO is that its elements, F0 and F1 , enable us to characterize the informativeness of a predictor for a given predictand. When two or more predictors are available, they can easily be compared (and possibly ranked) in terms of the informativeness relation. 4.4 Posterior Probability Once the three elements (g, f0 , f1 ) are specified, the posterior probability π of precipitation occurrence may be calculated from Eq. (3), given any value x of the mean relative humidity output from the AVN model. Figure 4 shows the plot of π versus x for three values of g. Regardless of the value of g, the posterior probability π is an increasing, non-linear, irreflexive function of x. The basic shape of this function is determined by the conditional density functions f0 , f1 . The prior probability g scales (nonlinearly) this basic shape. This has two practical implications. First, it illustrates the assertion made in Section 4.1 that the role of the prior probability g is to calibrate the forecast probability. Thus by estimating g from a large climatic sample (and by 13 properly modeling and estimating f0 and f1 ), the meteorologist can ensure the necessary condition for the forecast probability to be well calibrated against the climatic probability of precipitation occurrence at a specific location and within a specific season. Second, even though the conditional density functions f0 , f1 remain fixed during a season (here the cool season), the prior probability g may change from month to month (or some other subseason). Consequently, the posterior probability π can be calibrated against the climatic probability for each month (or subseason) rather than for the 6-month long cool season. 4.5 Another Predictor Different predictors behave differently. That is why each predictor should be modeled individually, and the catalog of parametric models from which the conditional distribution functions F0 , F1 are drawn should be large enough to afford flexibility. To underscore this point, let us model another predictor: the relative vorticity on the isobaric surface of 850 hPa at 63 h after the 0000 UTC model run (for short, 850 hPa relative vorticity at 63 h). Figure 5 shows the empirical conditional distribution functions and the parametric conditional distribution functions F0 , F1 ; here, F0 is Weibull and F1 is log-logistic; each is defined on the interval (−5, ∞) and is specified by two parameters. Figure 6 shows the conditional density functions f0 , f1 . Figure 7 shows the plot of π versus x for three values of g. Clearly, this predictor behaves differently than the previous one. A comparison of the ROCs (Fig. 3) reveals that the mean relative humidity at 60 h is approximately more informative than the 850 hPa relative vorticity at 63 h for predicting precipitation occurrence during the period 60–66 h. (The adverb “approximately” is inserted because one ROC crosses the other near the left end.) Would a combination of the two predictors be more informative than either predictor alone? This question is answered at the end of Section 5. 14 4.6 Binary-Continuous Predictor Another informative predictor X of precipitation occurrence is an estimate of the total precipitation amount during a specified period, output from the NWP model with a specified lead time. Typically, X is a binary-continuous variate: its takes on value zero on some days, and positive values on other days. Thus, the sample space of X is the interval [0, ∞), and the probability distribution of X assigns a nonzero probability to the event X = 0 and spreads the complementary probability over the interval (0, ∞) according to some density function. When the probability of event X = 0 is small, X may be modeled approximately as a continuous variate. When the probability of event X = 0 is large, X should be modeled as a binary-continuous variate in order to extract from it all information. The BPO can be suitably modified to incorporate a binary-continuous predictor, alone or in combination with other continuous predictors. The case with a single binary-continuous predictor is described by Maranzano and Krzysztofowicz (2004). 4.7 Monotonicity of Likelihood Ratio When there exists a physical or a logical requirement for the posterior probability π to be a monotone function of the predictor value x, as is the case in Figs. 4 and 7, this requirement can be enforced via the likelihood ratio function L = f0 /f1 . As may be inferred from Eq. (3), if L(x) decreases with x, then π increases with x; if L(x) increases with x, then π decreases with x. A monotonicity requirement may not be satisfied automatically by L simply because f0 and f1 are obtained without any constraint on their ratio f0 /f1 . Thus, when a monotonicity requirement exists, it is necessary to check that L satisfies it. Algorithms have been developed to perform this checking and to force a monotonicity requirement on L. 15 5. EXAMPLE WITH TWO PREDICTORS 5.1 Conditional Correlation Coefficients Let X1 denote the mean relative humidity predictor analyzed in Section 4.2, and let X2 denote the relative vorticity predictor analyzed in Section 4.5. The analyses of individual predictors supply the conditional distribution functions (F10 , F11 ; F20 , F21 ) and the conditional density functions (f10 , f11 ; f20 , f21 ). In order to obtain the BPO with two predictors (X1 , X2 ), it is necessary to estimate two conditional correlation coefficients (γ 120 , γ 121 ) from which the two conditional correlation matrices, Γ0 and Γ1 , are constructed. The estimation procedure, applicable to any number of predictors I > 2, is as follows. 1. The joint sample {(x1 , ..., xI , v)} is stratified into two conditional joint samples {(x1 , ..., xI , 0)} and {(x1 , ..., xI , 1)} according to the value of the precipitation indicator v. Every step that follows is performed twice, for v = 0 and v = 1. 2. Each conditional joint realization (x1 , ..., xI , v) is processed through the NQT ziv = Q−1 (Fiv (xi )), i = 1, ..., I, to obtain a transformed conditional joint realization (z1v , ..., zIv ). 3. The transformed conditional joint sample {(z1v , ..., zIv )} is used to estimate the conditional Pearson’s product-moment correlation coefficients γ ijv for i = 1, ..., I − 1; j = i + 1, ..., I. When applied to the joint sample at hand, the above estimation procedure yields γ 120 = 0.577 and γ 121 = 0.596. Thereby all input elements have been estimated, and the BPO is ready for forecasting. 5.2 Conditional Dependence Measures Under the I-variate meta-Gaussian density function fv , the parameter γ ijv characterizes the stochastic dependence between Xi and Xj , conditional on the hypothesized precipitation event 16 V = v. For the purpose of interpretation, γ ijv may be transformed into the Spearman’s rank correlation coefficient ρijv between Xi and Xj , conditional on the hypothesized precipitation event V = v. The transformation is given by (Kelly and Krzysztofowicz, 1997) ρijv = (6/π) arcsin (γ ijv /2). (8) In the present example, ρ120 = 0.559 and ρ121 = 0.578. From the estimates of γ 120 and γ 121 (or ρ120 and ρ121 ) one can infer that the mean relative humidity X1 and the 850 hPa relative vorticity X2 are stochastically dependent, conditional on the predictand V , and that the degree of dependence is somewhat stronger when precipitation occurs, V = 1, than when precipitation does not occur, V = 0. 5.3 Conditional Dependence Structures The purpose of the NQT is to transform a given dependence structure of the predictors into the Gaussian dependence structure. To learn the dependence structure and to judge the performance of the NQT, scatterplots of the conditional joint samples are examined. There are two scatterplots of the original sample points (x1 , x2 ), conditional on V = 0 and V = 1 (Figs. 8a and 8b). Each exhibits a non-Gaussian dependence structure: the scatters are not elliptic, especially the one conditional on V = 1, and the right-most points form a vertical frontier — an implication of X1 being bounded above by 100%. Likewise, there are two scatterplots of the transformed sample points (z1v , z2v ), for v = 0 and v = 1 (Figs. 8c and 8d). In each case, the scatter is elliptic, and the hypothesis of the Gaussian dependence structure cannot be refuted. Thus the NQT performs well. When the number of predictors I > 2, the analysis of the scatterplots should be performed for every pair of variates (Xi , Xj ), i = 1, ..., I − 1; j = i + 1, ..., I. Pairwise analyses are sufficient to validate the I-variate meta-Gaussian dependence structure. 17 5.4 Second Example The event to be forecasted is the occurrence of precipitation in Quillayute, Washington, during the 24-h period 1200–1200 UTC, beginning 36 h after the 0000 UTC model run in the warm season (April–September). Let X1 denote the relative humidity on the isobaric surface of 850 hPa at 36 h, estimated by the AVN model. Let X2 denote the relative vorticity on the isobaric surface of 850 hPa at 36 h, estimated by the AVN model. The scatterplots are shown in Fig. 9. As in the first example, the NQT performs well: each of the two non-Gaussian dependence structures of the original sample points (especially the one in Fig. 9b) is transformed into the Gaussian dependence structure. What makes this example different from the previous one is the vastly different degrees of conditional dependence: X1 and X2 are (approximately) independent (ρ120 = 0.011), conditional on precipitation nonoccurrence, V = 0; X1 and X2 are positively dependent (ρ121 = 0.358), conditional on precipitation occurrence, V = 1. The BPO takes the two conditional correlation coefficients explicitly into account, but nonBayesian techniques (such as MOS regression and logistic regression) fail to do so. When ρ120 and ρ121 are significantly different, this may be one of the reasons for the superior performance of the BPO. 5.5 Predictors Selection For every predictand, 34 potential predictors are defined by appropriately concatenating five variables (total precipitation amount, mean relative humidity, relative vorticity, relative humidity, and vertical velocity), three lead times, and four isobaric surfaces. From this set, the best combination of no more than five predictors is selected. The selection is accomplished via an algorithm that (i) maximizes RS (the area under the ROC, defined in Section 7.1) subject to the constraint that an additional predictor must increase RS by at least a specified threshold, (ii) employs objective op- 18 timization and heuristic search, and (iii) estimates the parameters of the BPO and the performance score RS from a given joint sample (an estimation sample — here from 4 years). In the examples for Buffalo with one predictor, X1 (mean relative humidity at 60 h) or X2 (850 hPa relative vorticity at 63 h), and with two predictors, X1 and X2 , the scores are as follows: RS(X1 ) = 0.818, RS(X2 ) = 0.742, RS(X1 , X2 ) = 0.825. Although the combination of two predictors (X1 , X2 ) outperforms each of the single predictors, X1 and X2 , the gain is below a threshold of significance. Thus, given only these two potential predictors, it is best to select the single predictor X1 . 19 6. MOS SYSTEM 6.1 Forecasting Equation The primary benchmark for evaluation of the BPO is the MOS system (Glahn and Lowry, 1972; Antolik, 2000) currently used in operational forecasting by the NWS. For a binary predictand, the MOS forecasting equation has the general form π = a0 + I P ai ti (xi ), (9) i=1 where ti is some transform determined experientially for each predictor Xi (i = 1, ..., I), and a0 , a1 , ..., aI are regression coefficients. The predictand and the predictors are defined at a station. For the predictand defined in Section 2.2, the MOS utilizes five predictors: 1. Total precipitation amount during 6-h period, 60–66 h; cutoff 2.54 mm. 2. Total precipitation amount during 3-h period, 60–63 h; cutoff 0.254 mm. 3. Relative humidity at the pressure level of 700 hPa at 66 h; cutoff 70%. 4. Relative humidity at the pressure level of 850 hPa at 60 h; cutoff 90%. 5. Vertical velocity at the pressure level of 850 hPa at 57 h; cutoff –0.2. 6.2 Grid-Binary Transform In some cases, a predictor enters Eq. (9) untransformed, i.e., ti (xi ) = xi . In the present case, each predictor is subjected to a grid-binary transformation, which is specified in terms of a heuristic algorithm (Jensenius, 1992). The algorithm takes the gridded field of predictor values and performs on it three operations: (i) mapping of each gridpoint value into "1" or "0", which indicates the exceedance or nonexceedance of a specified cutoff level; (ii) smoothing of the resultant binary field; and (iii) interpolation of the gridpoint values to the value ti (xi ) at a station. It follows that the transformed predictor value ti (xi ) at a station depends upon the original predictor values at all grid points in a vicinity. Thus when viewed as a transform of the original predictor Xi into a 20 grid-binary predictor ti (Xi ) at a fixed station, the transform ti is nonlinear and nonstationary (from one forecast time to the next). The grid-binary predictor ti (Xi ) is dimensionless and its sample space is the closed unit interval [0,1]. 6.3 Estimation The regression coefficients in Eq. (9) are estimated from a joint sample {(t1 (x1 ), ..., tI (xI ), v)} of realizations of the transformed predictors and the predictand. Like the sample for the BPO, this sample includes all daily realizations in the cool season (October – March) in 4 years. Unlike the sample for the BPO, this sample includes not only the realizations at the Buffalo station, but the realizations at all stations within the region to which Buffalo belongs. The pooling of station samples into a regional sample is needed to ensure a “stable” estimation of the MOS regression coefficients (Antolik, 2000). The estimates obtained by the MDL are: a0 = 0.23806, a1 = 0.33791, a3 = 0.15049, a4 = 0.15344, a2 = 0.10016, a5 = −0.21371. These estimates are assumed to be valid for every station within the region. 6.4 Predictors Selection For every predictand, there are about 176 potential predictors. The main reason for this number being about five times larger than 34 in BPO is that MOS employs the grid-binary predictors: for each variable there are several cutoff levels, each of which generates a new predictor. The best predictors are selected sequentially according to the maximum variance reduction criterion of linear regression and the stopping criterion whereby an additional predictor must reduce variance by at least a specified threshold. Up to 15 predictors can be selected. 21 7. COMPARISON OF BPO WITH MOS 7.1 System Versus Technique There is a fundamental distinction between a forecasting technique and a forecasting system, which for our purposes is this. A forecasting technique is essentially a forecasting equation with a generic statistical interpretation, Eqs. (4)–(6) for BPO and Eq. (9) for MOS. A forecasting system is a conjunction of a forecasting technique and a processing software that an organization employs to process real-time data into operational forecasts. For instance, any comparison involving the MOS technique, as defined by Eq. (9) but outside its processing software, would be a sterile experiment, unrepresentative of the actual MOS system of the NWS. For, as explained in Section 6.2, the grid-binary transformations are an intrinsic, though often overlooked, part of that system: they require processing of the entire gridded fields of model outputs, they cannot be reproduced except through software, and they cannot be executed on data from an isolated station or an isolated grid point at which comparison of techniques might be undertaken. Whereas it is of scientific interest to compare the BPO technique against the MOS technique and other traditional statistical techniques — several such comparisons have already been performed and will be reported in future publications — it is far more important to mission-oriented agencies to compare the prototype BPO system with the operational MOS system. In his review, C. Doswell concurred: “... it probably would be revealing to compare forecasts generated by the BPO method against the real operational MOS ... it would be a more convincing ‘yardstick’ for comparison and contrast”. 7.2 Performance Measures It is apparent that each system, BPO and MOS, processes information in a totally different manner. The objective of the following experiment is to compare the two systems with respect to 22 the efficiency of extracting the predictive information from the same data record — the archive of the AVN model output. Towards this end, two comparative verifications of forecasts are performed based on two input samples: (i) the estimation joint sample {(x, v)} from 4 years (April 1997 – March 2001); this is the same joint sample that was used for estimation of the family of likelihood functions (f0 , f1 ) of the BPO; and (ii) the validation joint sample {(x, v)} from 2 1/2 years (April 2001 – September 2003); this joint sample is used solely for validation. Given an input sample (either the estimation sample or the validation sample), each system (BPO and MOS) is used to calculate the forecast probability π based on every realization of its predictors. (The MOS forecasts calculated from the validation sample are actually the operational AVN-MOS forecasts produced by the NWS during 2 1/2 years; we simply re-calculated them.) Then the joint sample {(π, v)} of realizations of the forecast probability and the predictand is used to calculate the following performance measures. The calibration function (CF) — a graph of the conditional probability η(π) = P (V = 1|Π = π) versus the forecast probability π. The receiver operating characteristic (ROC) — a graph of the probability of detection versus the probability of false alarm. The calibration score (CS) — the Euclidean distance (the square root of the expected quadratic difference) between the line of perfect calibration and the calibration function: © ª1 CS = E([Π − η(Π)]2 ) 2 ; 0 ≤ CS ≤ 1. The ROC score (RS) — the area under the ROC (calculated from a piecewise linear estimate of the ROC using the trapezoidal rule); 1/2 ≤ RS ≤ 1. Some basic facts pertaining to this performance measure are as follows: (i) System A is more informative than system B if and only if the ROC of A is superior to the ROC of B. (ii) If system A is more informative than system B, 23 then the RS of A is not smaller than the RS of B. 7.3 Comparative Verifications Complete results are presented for the 6-h forecast period, 60–66 h after the model run. The BPO uses one predictor (mean relative humidity at 60 h, as detailed in Section 4); the MOS uses five predictors (as detailed in Section 6). Figure 11 shows the CF and the CS from every verification. Both BPO and MOS exhibit stable calibration across the two samples, the estimation sample and the validation sample. The MOS probabilities smaller than 0.4 are well calibrated, but those greater than 0.4 are poorly calibrated on both samples. The BPO probabilities are well calibrated on both samples. Based on the CS from the validation sample, BPO is calibrated better than MOS, by 0.034 on average (on the probability scale). Figure 12 shows the ROC and the RS from every verification. Both BPO and MOS exhibit stable informativeness across the two samples, the estimation sample and the validation sample. For each sample, the two ROCs cross each other. Thus neither system is more informative than the other. For each sample, the RS of BPO is slightly higher than the RS of MOS. A summary of results is presented for three forecast periods, 6-h, 12-h, 24-h, each beginning 60 h after the model run. Table 2 lists the predictors used by the BPO; Tables 3 and 4 report the scores from verifications on the estimation samples and on the validation samples. In all six cases, BPO is calibrated significantly better than MOS: the CS of BPO is at least 50% smaller than the CS of MOS. In five out of six cases, the RS of BPO is slightly higher than the RS of MOS. Finally, there is a consistent difference in terms of the number of “optimal predictors” selected for each system during its development: BPO uses 1–2 predictors, which are always extracted directly from the output fields of the AVN model; MOS uses 4–5 predictors, most of which are 24 obtained through grid-binary transformations of the output fields of the AVN model (Section 6.2). 7.4 Explanations Calibration. Why is it that BPO is calibrated significantly better than MOS? Why is it that MOS is poorly calibrated, contrary to the verification results of past studies? The explanation is twofold. First, as elaborated in Sections 4.1 and 4.4, the theoretic structure of the BPO forecasting equation (3) ensures the necessary condition for the forecast probability to be well calibrated against the prior (climatic) probability g input into the equation for a specific location and season. The ad-hoc structure of the MOS forecasting equation (9) does not offer this property. Second, the good calibrations of MOS reported in past studies (e.g., Murphy and Brown, 1984; Antolik, 2000) may have been the artifact of the analyses. For these studies did not verify the calibration of MOS at any specific location (which is of import to the users of forecasts at that location), but instead pooled the verification samples from many locations into one national sample from which verification statistics were calculated. If the prior probability and the degree of calibration varied across locations, then the verification statistics obtained from a pooled sample did not pertain to any location and therefore would be misleading to users. Informativeness. Why is it that MOS needs two to four additional predictors to barely match the informativeness of BPO? The explanation once again is twofold. First, the laws of probability theory, from which the BPO is derived, ensure the optimal structure of the BPO forecasting equation (3). The structure of the MOS forecasting equation (9) is different. Therefore, given any single predictor, the BPO system, if properly operationalized, can never be less informative than the MOS system (or any other non-Bayesian system for that matter). To make up for the non-optimal theoretic structure, a non-Bayesian system needs additional 25 predictors (which are conditionally informative in that system). Second, the grid-binary transform (Jensenius, 1992) was invented to improve the calibration of the MOS system. But by mapping an original predictor (which is binary-continuous or continuous) into a binary predictor, this transform also removes part of predictive information contained in the original predictor. In the examples reported herein, two to four additional predictors are needed to make up for the lost information and the nonoptimal structure of the MOS forecasting equation. To dissect the predictive performance of the grid-binary transform, each system, MOS and BPO, was estimated and evaluated twice: first, utilizing an original predictor, and next utilizing the grid-binary transformation of that predictor. There were two findings. (i) The use of the grid-binary transform in the MOS leads to a compromise: the transform improves the CS but deteriorates the RS. (ii) The use of the grid-binary transform in the BPO is unnecessary for calibration (because the BPO automatically calibrates the posterior probability against the specified prior probability) and is detrimental for informativeness (because it removes part of the predictive information contained in the original predictor). 26 8. SUMMARY 8.1 Bayesian Technique 1. The BPO for a binary predictand described herein is the first technique of its kind for probabilistic forecasting of weather variates: it produces the posterior probability of an event through Bayesian fusion of a prior (climatic) probability and a realization of predictors output from a NWP model. 2. The BPO implements Bayes theorem, which provides the correct theoretic structure of the forecasting equation, and employs the meta-Gaussian family of multivariate density functions, which provides a flexible and convenient parametric model. It can be estimated effectively from asymmetric samples — the climatic sample of the predictand (which is typically long), and the joint sample of the predictor vector and the predictand (which is typically short). 3. The development of the BPO has focused on quality of modeling and simplicity of estimation. The BPO allows (i) the marginal conditional distribution functions of the predictors to be of any form (as typically they are non-Gaussian), and (ii) the conditional dependence structure between any two predictors to be non-linear and heteroscedastic (as typically is the case in meteorology). Despite this flexibility, the BPO requires the estimation of only distribution parameters and correlation coefficients. And the entire process of selecting predictors, choosing parametric distribution functions, and estimating parameters can be automated. 8.2 Preliminary Results 1. The PoP produced by the prototype BPO system is better calibrated than, and at least as informative as, the PoP produced by the operational MOS system for a specific location (and hence for a specific user). 2. The BPO utilizing one or two predictors performs, in terms of both calibration and infor- 27 mativeness, at least as well as the MOS utilizing four or five predictors. This suggests that BPO is more efficient than MOS in extracting predictive information from the output of a NWP model. 3. Every predictor in the BPO is a direct model output (interpolated to the station), whereas most predictors in the MOS are grid-binary predictors whose definitions require subjective experimentation (to set the cutoff levels and smoothing parameters) and algorithmic processing of the entire output fields (to calculate the predictor values). Thus in terms of the definitions of predictors, the BPO is more parsimonious than the MOS. 8.3 Potential Implications 1. Inasmuch as the grid-binary predictors can be dispensed with because only the basic and derived predictors need be considered by the BPO, the set of potential predictors for the BPO is about 5 times smaller than the set of potential predictors for the MOS. Consequently, the overall effort needed to select the most informative subset of predictors can be reduced substantially. 2. With fewer predictors (say between one and four for BPO, instead of between four and fifteen for MOS), an extension of the BPO to processing an ensemble of the NWP model output will present a less demanding task (in terms of data storage requirements and computing requirements) than it would be if an extension of the MOS technique were attempted. 28 Acknowledgments. This material is based upon work supported by the National Science Foundation under Grant No. ATM-0135940, “New Statistical Techniques for Probabilistic Weather Forecasting”. The Meteorological Development Laboratory of the National Weather Service provided the AVN-MOS database and the MOS forecasting equations for comparative verifications. The collaboration of Drs. Harry R. Glahn and Paul Dallavalle in this regard is much appreciated; the advice of Mark S. Antolik on accessing and interpreting the data is gratefully acknowledged. 29 APPENDIX A NUMERICAL APPROXIMATION TO Q−1 Two approximations to the inverse of the standard normal distribution function Q−1 can be found in Abramowitz and Stegun (1972, Chapter 26, p. 933). For operational forecasting, the 3-term rational function approximation is sufficiently accurate, and we reproduce it below at the request of a reviewer. Given the probability p, the standard normal quantile z = Q−1 (p) is approximated by −ẑ if 0 < p ≤ 0.5, ẑ if 0.5 ≤ p < 1, where ẑ = t − a0 + a1 t ; 1 + b1 t + b2 t2 ⎧ √ ⎪ ⎨ −2 ln p t= p ⎪ ⎩ −2 ln (1 − p) a0 = 2.30753, a1 = 0.27061, if 0 < p ≤ 0.5, if 0.5 ≤ p < 1; b1 = 0.99229, b2 = 0.04481. The error of this approximation is |ẑ − z| < 3 × 10−3 . 30 REFERENCES Abramowitz, M., and I.A. Stegun (eds.), 1972: Handbook of Mathematical Functions, Dover, Mineola, New York. Alexandridis, M.G., and R. Krzysztofowicz, 1985: Decision models for categorical and probabilistic weather forecasts. Applied Mathematics and Computation, 17, 241–266. Antolik, M.S., 2000: An overview of the National Weather Service’s centralized statistical quantitative precipitation forecasts. Journal of Hydrology, 239(1–4), 306–337. Blackwell, D., 1951: Comparison of experiments. Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, J. Neyman (ed.), University of California Press, Berkeley, pp. 93–102. Blackwell, D., 1953: Equivalent comparisons of experiments. Annals of Mathematical Statistics, 24, 265–272. de Finetti, B., 1974: Theory of Probability, vol. 1, John Wiley and Sons, New York. DeGroot, M.H., 1970: Optimal Statistical Decisions, McGraw-Hill, New York. DeGroot, M.H., 1988: A Bayesian view of assessing uncertainty and comparing expert opinion. Journal of Statistical Planning and Inference, 20, 295–306. Edwards, W., L.D. Phillips, W.L. Hays, and B.C. Goodman, 1968: Probabilistic information processing systems: Design and evaluation. IEEE Transactions on Systems Science and Cybernetics, SSC-4 (3), 248–265. Glahn, H.R., and D.A. Lowry, 1972: The use of model output statistics (MOS) in objective weather forecasting. Journal of Applied Meteorology, 11(8), 1203–1211. 31 Herr, H.D., and R. Krzysztofowicz, 2005: Generic probability distribution of rainfall in space: The bivariate model. Journal of Hydrology, 306(1–4), 234–263. Jensenius, J.S., Jr., 1992: The use of grid-binary variables as predictors for statistical weather forecasting. Preprints of the 12th Conference on Probability and Statistics in the Atmospheric Sciences, Toronto, Ontario, American Meteorological Society, pp. 225–230. Kelly, K.S., and R. Krzysztofowicz, 1994: Probability distributions for flood warning systems. Water Resources Research, 30(4), 1145–1152. Kelly, K.S., and R. Krzysztofowicz, 1995: Bayesian revision of an arbitrary prior density. Proceedings of the Section on Bayesian Statistical Science, American Statistical Association, 50–53. Kelly, K.S., and R. Krzysztofowicz, 1997: A bivariate meta-Gaussian density for use in hydrology. Stochastic Hydrology and Hydraulics, 11(1), 17–31. Krzysztofowicz, R., 1983: Why should a forecaster and a decision maker use Bayes theorem. Water Resources Research, 19(2), 327–336. Krzysztofowicz, R., 1999: Bayesian forecasting via deterministic model. Risk Analysis, 19(4), 739–749. Krzysztofowicz, R., 2002: Bayesian system for probabilistic river stage forecasting. Journal of Hydrology, 268(1–4), 16–40. Krzysztofowicz, R., and H.D. Herr, 2001: Hydrologic uncertainty processor for probabilistic river stage forecasting: Precipitation-dependent model. Journal of Hydrology, 249(1–4), 46–68. Krzysztofowicz, R., and D. Long, 1990: Fusion of detection probabilities and comparison of multisensor systems. IEEE Transactions on Systems, Man, and Cybernetics, 20(3), 665–677. 32 Krzysztofowicz, R., and D. Long, 1991: Forecast sufficiency characteristic: Construction and application. International Journal of Forecasting, 7(1), 39–45. Lindley, D.V., 1987: The probability approach to the treatment of uncertainty in artificial intelligence and expert systems. Statistical Science , 2(1), 17–24. Maranzano, C.J., and R. Krzysztofowicz, 2004: Bayesian processor of output for probabilistic forecasting of precipitation occurrence. Preprints of the 17th Conference on Probability and Statistics in the Atmospheric Sciences, Seattle, Washington, American Meteorological Society; paper number 4.3. McCullagh, P., and J.A. Nelder, 1989: Generalized Linear Models, 2nd ed., Chapman and Hall, Boca Raton. Murphy, A.H., and B.G. Brown, 1984: A comparative evaluation of objective and subjective weather forecasts in the United States. Journal of Forecasting , 3, 369–393. Sage, A.P. and J.L. Melsa, 1971: Estimation Theory with Applications to Communications and Control, McGraw-Hill, New York. Savage, L.J., 1954: The Foundations of Statistics, John Wiley and Sons, New York. 33 Table 1. Sample sizes and estimates of the prior probability g of precipitation occurrence; 6-h period 1200–1800 UTC; Buffalo, NY. Month Cool Oct Nov Dec Jan Feb Mar Season Size 184 180 197 205 175 191 1132 g 0.23 0.29 0.28 0.27 0.25 0.29 0.27 34 Table 2. Predictors used in the BPO system for PoP forecasts; forecast periods beginning 60 h after the 0000 UTC model run; cool season; Buffalo, NY. Period Predictor 6-h (60–66) Mean relative humidity at 60 h 12-h (60–72) Total precipitation during 60–72 h Mean relative humidity at 60 h 24-h (60–84) Total precipitation during 60–72 h Mean relative humidity at 72 h 35 Table 3. Comparison of the BPO system with the MOS system for PoP forecasts; verification on the estimation sample (Apr. 1997 – Mar. 2001); forecast periods beginning 60 h after the 0000 UTC model run; cool season; Buffalo, NY. Period 6-h (60–66) 12-h (60–72) 24-h (60–84) System Number of Predictors Calibration Score, CS ROC Score, RS Sample Size BPO 1 0.031 0.818 698 MOS 5 0.085 0.815 713 BPO 2 0.051 0.835 688 MOS 4 0.105 0.828 703 BPO 2 0.049 0.780 667 MOS 5 0.112 0.781 682 36 Table 4. Comparison of the BPO system with the MOS system for PoP forecasts; verification on the validation sample (Apr. 2001 – Sept. 2003); forecast periods beginning 60 h after the 0000 UTC model run; cool season; Buffalo, NY. Period 6-h (60–66) 12-h (60–72) 24-h (60–84) System Number of Predictors Calibration Score, CS ROC Score, RS Sample Size BPO 1 0.031 0.815 363 MOS 5 0.065 0.808 363 BPO 2 0.042 0.829 360 MOS 4 0.092 0.814 360 BPO 2 0.063 0.810 348 MOS 5 0.133 0.807 348 37 FIGURE CAPTIONS Figure 1. Empirical distribution functions and parametric distribution functions Fv of the mean relative humidity at 60 h, X, output from the AVN model, conditional on precipitation event V = v (precipitation nonoccurrence, v = 0; precipitation occurrence, v = 1); 6-h forecast period 1200–1800 UTC, beginning 60 h after the 0000 UTC model run; cool season; Buffalo, NY. Figure 2. Conditional density functions fv corresponding to the parametric conditional distribution functions Fv (v = 0, 1) shown in Fig. 1. Figure 3. Receiver operating characteristics of two predictors: mean relative humidity at 60 h, and 850 hPa relative vorticity at 63 h, for forecasting precipitation occurrence during 6-h period 1200–1800 UTC, beginning 60 h after the 0000 UTC model run; cool season; Buffalo, NY. Figure 4. Posterior probability π of precipitation occurrence as a function of the mean relative humidity at 60 h, x, output from the AVN model, for three values of the prior probability g; 6-h forecast period 1200–1800 UTC, beginning 60 h after the 0000 UTC model run; cool season; Buffalo, NY. Figure 5. Empirical distribution functions and parametric distribution functions Fv of the 850 hPa relative vorticity at 63 h, X, output from the AVN model, conditional on precipitation event V = v (precipitation nonoccurrence, v = 0; precipitation occurrence, v = 1); 6-h forecast period 1200–1800 UTC, beginning 60 h after the 0000 UTC model run; cool season; Buffalo, NY. Figure 6. Conditional density functions fv corresponding to the parametric conditional distribution functions Fv (v = 0, 1) shown in Fig. 5. 38 Figure 7. Posterior probability π of precipitation occurrence as a function of the 850 hPa relative vorticity at 63 h, x, output from the AVN model, for three values of the prior probability g; 6-h forecast period 1200–1800 UTC, beginning 60 h after the 0000 UTC model run; cool season; Buffalo, NY. Figure 8. Scatterplots of the 850 hPa relative vorticity at 63 h, X2 , versus the mean relative humidity at 60 h, X1 , conditional on: (a) precipitation nonoccurrence, V = 0; (b) precipitation occurrence, V = 1; and scatterplots of the standard normal predictors Z2 and Z1 , conditional on: (c) precipitation nonoccurrence, V = 0; (d) precipitation occurrence, V = 1; 6-h forecast period 1200–1800 UTC, beginning 60 h after the 0000 UTC model run; cool season; Buffalo, NY. Figure 9. Scatterplots of the 850 hPa relative vorticity at 36 h, X2 , versus the 850 hPa relative humidity at 36 h, X1 , conditional on: (a) precipitation nonoccurrence, V = 0; (b) precipitation occurrence, V = 1; and scatterplots of the standard normal predictors Z2 and Z1 , conditional on: (c) precipitation nonoccurrence, V = 0; (d) precipitation occurrence, V = 1; 24-h forecast period 1200–1200 UTC, beginning 36 h after the 0000 UTC model run; warm season; Quillayute, WA. Figure 10. Calibration functions of the BPO using one predictor and of the MOS using five predictors; PoP forecasts for 6-h period 1200–1800 UTC, beginning 60 h after the 0000 UTC model run; cool season; Buffalo, NY: (a) BPO — estimation sample, (b) MOS — estimation sample, (c) BPO — validation sample, (d) MOS — validation sample. (Above each point is the number of forecasts used to estimate that point.) 39 Figure 11. Receiver operating characteristics of the BPO using one predictor and of the MOS using five predictors; PoP forecasts for 6-h period 1200–1800 UTC, beginning 60 h after the 0000 UTC model run; cool season, Buffalo, NY: (a) estimation sample, (b) validation sample. 40 1.0 v=0 v=1 0.9 0.8 P(X ≤ x | V = v) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 10 20 30 40 50 60 70 80 90 100 Mean Relative Humidity x [%] Figure 1. Empirical distribution functions and parametric distribution functions Fv of the mean relative humidity at 60 h, X, output from the AVN model, conditional on precipitation event V = v (precipitation nonoccurrence, v = 0; precipitation occurrence, v = 1); 6-h forecast period 1200–1800 UTC, beginning 60 h after the 0000 UTC model run; cool season; Buffalo, NY. 41 0.04 v=0 v=1 Density fv(x) 0.03 0.02 0.01 0.00 10 20 30 40 50 60 70 80 90 100 Mean Relative Humidity x [%] Figure 2. Conditional density functions fv corresponding to the parametric conditional distribution functions Fv (v = 0, 1) shown in Fig. 1. 42 1.0 0.9 Probability of Detection 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Mean Relative Humidity 850 Relative Vorticity 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Probability of False Alarm Figure 3. Receiver operating characteristics of two predictors: mean relative humidity at 60 h, and 850 hPa relative vorticity at 63 h, for forecasting precipitation occurrence during 6-h period 1200–1800 UTC, beginning 60 h after the 0000 UTC model run; cool season; Buffalo, NY. . 43 Posterior Probability π = P(V = 1 | X = x) 1.0 Prior Probability 0.9 g = 0.75 g = P(V = 1) 0.8 0.7 0.50 0.6 0.5 0.4 0.25 0.3 0.2 0.1 0.0 10 20 30 40 50 60 70 80 90 100 Mean Relative Humidity x [%] Figure 4. Posterior probability π of precipitation occurrence as a function of the mean relative humidity at 60 h, x, output from the AVN model, for three values of the prior probability g; 6-h forecast period 1200–1800 UTC, beginning 60 h after the 0000 UTC model run; cool season; Buffalo, NY. 44 1.0 0.9 0.8 P(X ≤ x | V = v) 0.7 0.6 0.5 0.4 0.3 0.2 v=0 v=1 0.1 0.0 -5 -3 -1 1 3 5 7 9 11 850 Relative Vorticity x Figure 5. Empirical distribution functions and parametric distribution functions Fv of the 850 hPa relative vorticity at 63 h, X, output from the AVN model, conditional on precipitation event V = v (precipitation nonoccurrence, v = 0; precipitation occurrence, v = 1); 6-h forecast period 1200–1800 UTC, beginning 60 h after the 0000 UTC model run; cool season; Buffalo, NY. 45 0.3 v=0 v=1 Density fv(x) 0.2 0.1 0.0 -5 -3 -1 1 3 5 7 9 11 850 Relative Vorticity x Figure 6. Conditional density functions fv corresponding to the parametric conditional distribution functions Fv (v = 0, 1) shown in Fig. 5. 46 Posterior Probability π = P(V = 1 | X = x) 1.0 0.9 g = 0.75 0.8 0.7 0.50 0.6 0.5 0.4 0.25 0.3 Prior Probability 0.2 g = P(V = 1) 0.1 0.0 -5 -3 -1 1 3 5 7 9 11 850 Relative Vorticity x Figure 7. Posterior probability π of precipitation occurrence as a function of the 850 hPa relative vorticity at 63 h, x, output from the AVN model, for three values of the prior probability g; 6-h forecast period 1200–1800 UTC, beginning 60 h after the 0000 UTC model run; cool season; Buffalo, NY. 47 11 v=0 (b) ρ120 = 0.559 9 7 7 5 3 1 -1 -3 ρ121 = 0.578 5 3 1 -1 -5 10 3.5 20 30 40 50 60 70 80 Mean Relative Humidity x1 v=0 90 100 10 2.5 2.5 1.5 1.5 0.5 -0.5 -2.5 -2.5 -1.5 -0.5 0.5 1.5 z10 = Q-1(F10(x1)) 2.5 -3.5 -3.5 3.5 v=1 90 100 γ121 = 0.596 -0.5 -1.5 -2.5 30 40 50 60 70 80 Mean Relative Humidity x1 0.5 -1.5 -3.5 -3.5 20 (d) 3.5 γ120 = 0.577 z21 = Q-1(F21(x2)) z20 = Q-1(F20(x2)) v=1 -3 -5 (c) 11 9 850 Relative Vorticity x2 850 Relative Vorticity x2 (a) -2.5 -1.5 -0.5 0.5 1.5 z11 = Q-1(F11(x1)) 2.5 3.5 Figure 8. Scatterplots of the 850 hPa relative vorticity at 63 h, X2 , versus the mean relative humidity at 60 h, X1 , conditional on: (a) precipitation nonoccurrence, V = 0; (b) precipitation occurrence, V = 1; and scatterplots of the standard normal predictors Z2 and Z1 , conditional on: (c) precipitation nonoccurrence, V = 0; (d) precipitation occurrence, V = 1; 6-h forecast period 1200–1800 UTC, beginning 60 h after the 0000 UTC model run; cool season; Buffalo, NY. 48 (a) v=0 (b) ρ120 = 0.011 9 7 7 5 3 1 -1 -3 ρ121 = 0.358 5 3 1 -1 -5 10 3.5 20 30 40 50 60 70 80 850 Relative Humidity x1 v=0 90 100 10 (d) 3.5 γ120 = 0.012 2.5 2.5 1.5 1.5 z21 = Q-1(F21(x2)) z20 = Q-1(F20(x2)) v=1 -3 -5 (c) 11 9 850 Relative Vorticity x2 850 Relative Vorticity x2 11 0.5 -0.5 -2.5 -2.5 -1.5 -0.5 0.5 1.5 z10 = Q-1(F10(x1)) 2.5 -3.5 -3.5 3.5 40 50 60 70 80 850 Relative Humidity x1 v=1 90 100 γ121 = 0.373 -0.5 -1.5 -2.5 30 0.5 -1.5 -3.5 -3.5 20 -2.5 -1.5 -0.5 0.5 1.5 z11 = Q-1(F11(x1)) 2.5 3.5 Figure 9. Scatterplots of the 850 hPa relative vorticity at 36 h, X2 , versus the 850 hPa relative humidity at 36 h, X1 , conditional on: (a) precipitation nonoccurrence, V = 0; (b) precipitation occurrence, V = 1; and scatterplots of the standard normal predictors Z2 and Z1 , conditional on: (c) precipitation nonoccurrence, V = 0; (d) precipitation occurrence, V = 1; 24-h forecast period 1200–1200 UTC, beginning 36 h after the 0000 UTC model run; warm season; Quillayute, WA. 49 (b) 1.0 0.9 0.8 Conditional Probability P(V = 1 | Π = π) Conditional Probability P(V = 1 | Π = π) (a) 32 34 0.7 42 53 0.6 0.5 0.4 60 67 0.3 101 0.2 172 0.1 -137 CS = 0.031 0.0 (c) 0.9 50 0.8 0.7 67 31 0.6 0.5 0.4 135 0.3 92 0.2 171 0.1 167 CS = 0.085 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Forecast Probability π Forecast Probability π (d) 1.0 0.9 0.8 Conditional Probability P(V = 1 | Π = π) Conditional Probability P(V = 1 | Π = π) 1.0 28 0.7 25 0.6 40 0.5 37 0.4 0.3 42 54 0.2 94 0.1 CS = 0.031 43 0.0 1.0 0.9 0.8 46 0.7 0.6 40 0.5 0.4 58 0.3 79 0.2 0.1 80 60 CS = 0.065 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Forecast Probability π Forecast Probability π Figure 10. Calibration functions of the BPO using one predictor and of the MOS using five predictors; PoP forecasts for 6-h period 1200–1800 UTC, beginning 60 h after the 0000 UTC model run; cool season; Buffalo, NY: (a) BPO — estimation sample, (b) MOS — estimation sample, (c) BPO — validation sample, (d) MOS — validation sample. (Above each point is the number of forecasts used to estimate that point.) 50 (a) 1.0 0.9 Probability of Detection 0.8 0.7 0.6 0.5 0.4 0.3 Estimation Sample BPO RS = 0.818 MOS RS = 0.815 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Probability of False Alarm (b) 1.0 0.9 Probability of Detection 0.8 0.7 0.6 0.5 0.4 0.3 Validation Sample BPO RS = 0.815 MOS RS = 0.808 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Probability of False Alarm Figure 11. Receiver operating characteristics of the BPO using one predictor and of the MOS using five predictors; PoP forecasts for 6-h period 1200–1800 UTC, beginning 60 h after the 0000 UTC model run; cool season, Buffalo, NY: (a) estimation sample, (b) validation sample. 51