Delivering Integrated, Sustainable, Water Resources Solutions Institute for Water Resources 2010 Choosing a Probability Distribution Charles Yoe, Ph.D. “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Probability x Consequence • Quantitative risk assessment requires you to use probability • Sometimes you will estimate the probability of an event • Sometimes you will use distributions to – Describe data – Model variability – Represent our uncertainty • What distribution do you use? “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Probability—Language of Random Variables • Constant • Variables • Some things vary predictably • Some things vary unpredictably • Random variables • It can be something known but not known by us “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Checklist for Choosing a Distributions From Some Data 1. 2. Can you use your data? Understand your variable a) Source of data b) Continuous/discrete c) Bounded/unbounded d) Meaningful parameters a) e) Do you know them? (1st or 2nd order) Univariate/multivariate 3. 4. 5. 6. 7. 8. 9. “ Building Strong “ Look at your data— plot it Use theory Calculate statistics Use previous experience Distribution fitting Expert opinion Sensitivity analysis Delivering Integrated, Sustainable, Water Resources Solutions First! • Do you have data? • If so, do you need a distribution or can you just use your data? • Answer depends on the question(s) you’re trying to answer as well as your data “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Use Data • If your data are representative of the population germane to your problem use them • One problem could be bounding data – What are the true min & max? • Any dataset can be converted into a – Cumulative distribution function – General density function “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Fitting Empirical Distribution to Data • If continuous & reasonably extensive • May have to estimate minimum & maximum • Rank data x(i) in ascending order • Calculate the percentile for each value • Use data and percentiles to create cumulative distribution function “ Building Strong “ Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Data Cumulative Probability Value F(x) = i/19 0 0.9 3.6 5.0 6.0 11.7 16.2 16.5 22.2 22.7 23.2 24.5 24.9 25.8 33.3 33.4 34.7 40.2 44.2 60.0 0 0.053 0.105 0.158 0.211 0.263 0.316 0.368 0.421 0.474 0.526 0.579 0.632 0.684 0.737 0.789 0.842 0.895 0.947 1 Delivering Integrated, Sustainable, Water Resources Solutions When You Can’t Use Your Data • Given wide variety of distributions it is not always easy to select the most appropriate one – Results can be very sensitive to distribution choice • Using wrong assumption in a model can produce incorrect results=>poor decisions=> undesirable outcomes “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Understand Your Data • What is source of data? – – – – – – – Experiments Observation Surveys Computer databases Literature searches Simulations Test case Understand your variable The source of the data may affect your decision to use it or not. “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions •Barges in a tow •Houses in floodplain •People at a meeting •Results of a diagnostic test •Casualties per year •Relocations and acquisitions Type of Variable? •Average number of barges per tow •Weight of an adult striped bass •Sensitivity or specificity of a diagnostic test •Transit time •Expected annual damages •Duration of a storm •Shoreline eroded •Sediment loads • Is your variable discrete or continuous ? • Do not overlook this! – Discrete distributions- take one of a set of identifiable values, each of which has a calculable probability of occurrence – Continuous distributions- a variable that can take any value within a defined range Understand your variable “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions What Values Are Possible? • Is your variable bounded or unbounded? – Bounded-value confined to lie between two determined values – Unbounded-value theoretically extends from minus infinity to plus infinity – Partially bounded-constrained at one end (truncated distributions) • Use a distribution that matches Understand your variable “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Continuous Distribution Examples • Unbounded – Normal – t – Logistic • Left Bounded – – – – – Chi-square Exponential Gamma Lognormal Weibull Understand your variable • Bounded – – – – – – “ Building Strong “ Beta Cumulative General/histogram Pert Uniform Triangle Delivering Integrated, Sustainable, Water Resources Solutions Discrete Distribution Examples • Unbounded • Bounded – None • Left Bounded – Poisson – Negative binomial – Geometric Understand your variable – – – – “ Building Strong “ Binomial Hypergeometric Discrete Discrete Uniform Delivering Integrated, Sustainable, Water Resources Solutions Are There Parameters • Does your variable have parameters that are meaningful? – Parametric--shape is determined by the mathematics describing a conceptual probability model • Require a greater knowledge of the underlying – Non-parametric—empirical distributions for which the mathematics is defined by the shape required • Intuitively easy to understand • Flexible and therefore useful Understand your variable “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Choose Parametric Distribution If • Theory supports choice • Distribution proven accurate for modelling your specific variable (without theory) • Distribution matches any observed data well • Need distribution with tail extending beyond the observed minimum or maximum Understand your variable “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Choose Non-Parametric Distribution If • • • • Theory is lacking There is no commonly used model Data are severely limited Knowledge is limited to general beliefs and some evidence Understand your variable “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Parametric and Non-Parametric • • • • • • Normal Lognormal Exponential Poisson Binomial Gamma Understand your variable • • • • Uniform Pert Triangular Cumulative “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Do You Know the Parameters? • Probability distribution with precisely known parameters (N(100,10)) is called a 1st order distribution • Probability distribution with some uncertainty about its parameters (N(m,s)) is called a 2nd order distribution • Risknormal(risktriang(90,100,103),riskuniform(8,11)) Understand your variable “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Is It Dependent on Other Variables • Univariate and multivariate distributions – Univariate--describes a single parameter or variable that is not probabilistically linked to any other in the model – Multivariate--describe several parameters that are probabilistically linked in some way • Engineering relationships are often multivariate Understand your variable “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Continuing Checklist for Choosing a Distributions 3. 4. 5. 6. 7. 8. 9. Look at your data—plot it Use theory Calculate statistics Use previous experience Distribution fitting Expert opinion Sensitivity analysis “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Plot--Old Faithful Eruptions • What do your data look like? • You could calculate Mean & SD and assume its normal • Beware, danger lurks • Always plot your data “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Which Distribution? • Examine your plot • Look for distinctive shapes of specific distributions – – – – – Single peaks Symmetry Positive skew Negative values Gamma, Weibull, beta are useful and flexible forms “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Theory-Based Choice • Most compelling reason for choice • Formal theory – Central limit theorem • Theoretical knowledge of the variable – Behavior – Math—range • Informal theory – Sums normal, products lognormal – Study specific – Your best documented thoughts on subject “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Calculate Statistics • Summary statistics may provide clues • Normal – Low coefficient of variation – Equal mean and median • Exponential has positive skew – Equal mean and standard deviation • Consider outliers “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Outliers • Extreme observations can drastically influence a probability model • No prescriptive method for addressing them • If observation is an error remove it • If not what is data point telling you? – What about your world-view is inconsistent with this result? – Should you reconsider your perspective? – What possible explanations have you not yet considered? “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Outliers (cont) • Your explanation must be correct, not merely plausible – Consensus is poor measure of truth • If you must keep it and can't explain it – Use conventional practices and live with skewed consequences – Choose methods less sensitive to such extreme observations (Gumbel, Weibull) “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Previous Experience • Have you dealt with this issue successfully before? Have others? • What did other analyses or risk assessments use? • What does the literature reveal? “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Goodness of Fit • Provides statistical evidence to test hypothesis that your data could have come from a specific distribution • H0 these data come from an “x” distribution • Small test statistic and large p mean accept H0 • It is another piece of evidence not a determining factor “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions GOF Tests • Chi-Square Test – Most common— discrete & continuous – Data are divided into a number of cells, each cell with at least five – Usually 50 observations or more • Kolomogorov-Smirnov Test – More suitable for small samples than ChiSquare – Better fit for means than tails • Andersen-Darling Test – Weights differences between theoretical and empirical distributions at their tails greater than at their midranges – Desirable when better fit at extreme tails of distribution are desired “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Kolmogorov-Smirnov Statistic Normal(25.2290, 4.9645) 1.0 0.8 0.6 0.4 0.2 < 5.0% 90.0% 17.06 40 35 30 25 20 15 10 5 0.0 5.0% > • Blue = data • Red = true/hypothetical • Find biggest difference between the two • K-S statistic is largest difference consistent with your 33.39 –n –α “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Defining Distributions w/ Expert Opinion • • • • • Data never collected Data too expensive or impossible Past data irrelevant Opinion needed to fill holes in sparse data New area of inquiry, unique situation that never existed “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions What Experts Estimate • The distribution itself – Judgment about distribution of value in population – E.g. population is normal • Parameters of the distribution – E.g. mean is x and standard deviation is y “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Modeling Techniques • Disaggregation (Reduction) • Subjective Probability Elicitation • PDF or CDF • Parametric or Non-parametric distributions “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Elicitation Techniques Needed • Literature shows we do not assess subjective probabilities well • In part due to heuristics we use – Representativeness – Availability – Anchoring and adjustment • There are methods to counteract our heuristics and to elicit our expert knowledge “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Sensitivity Analysis • Unsure which is the best distribution? • Try several – If no difference you are free to use any one – Significant differences mean doing more work “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Take Away Points • Choosing the best distribution is where most new risk assessors feel least comfortable. • Choice of distribution matters. • Distributions come from data and expert opinion. • Distribution fitting should never be the basis for distribution choice. “ Building Strong “ Delivering Integrated, Sustainable, Water Resources Solutions Questions? Charles Yoe, Ph.D. cyoe1@verizon.net “ Building Strong “