Choosing a Probability Distribution

advertisement
Delivering Integrated, Sustainable,
Water Resources Solutions
Institute for Water Resources
2010
Choosing a Probability Distribution
Charles Yoe, Ph.D.
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Probability x Consequence
• Quantitative risk assessment requires you to
use probability
• Sometimes you will estimate the probability
of an event
• Sometimes you will use distributions to
– Describe data
– Model variability
– Represent our uncertainty
• What distribution do you use?
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Probability—Language of
Random Variables
• Constant
• Variables
• Some things vary predictably
• Some things vary unpredictably
• Random variables
• It can be something known but not known by
us
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Checklist for Choosing a Distributions
From Some Data
1.
2.
Can you use your data?
Understand your variable
a) Source of data
b) Continuous/discrete
c) Bounded/unbounded
d) Meaningful
parameters
a)
e)
Do you know them? (1st
or 2nd order)
Univariate/multivariate
3.
4.
5.
6.
7.
8.
9.
“ Building Strong “
Look at your data—
plot it
Use theory
Calculate statistics
Use previous
experience
Distribution fitting
Expert opinion
Sensitivity analysis
Delivering Integrated, Sustainable,
Water Resources Solutions
First!
• Do you have data?
• If so, do you need a distribution or can
you just use your data?
• Answer depends on the question(s) you’re
trying to answer as well as your data
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Use Data
• If your data are representative of the
population germane to your problem use
them
• One problem could be bounding data
– What are the true min & max?
• Any dataset can be converted into a
– Cumulative distribution function
– General density function
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Fitting Empirical Distribution to Data
• If continuous & reasonably
extensive
• May have to estimate
minimum & maximum
• Rank data x(i) in
ascending order
• Calculate the percentile for
each value
• Use data and percentiles
to create cumulative
distribution function
“ Building Strong “
Index
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Data
Cumulative
Probability
Value
F(x) = i/19
0
0.9
3.6
5.0
6.0
11.7
16.2
16.5
22.2
22.7
23.2
24.5
24.9
25.8
33.3
33.4
34.7
40.2
44.2
60.0
0
0.053
0.105
0.158
0.211
0.263
0.316
0.368
0.421
0.474
0.526
0.579
0.632
0.684
0.737
0.789
0.842
0.895
0.947
1
Delivering Integrated, Sustainable,
Water Resources Solutions
When You Can’t Use Your Data
• Given wide variety of distributions it is not
always easy to select the most appropriate
one
– Results can be very sensitive to distribution
choice
• Using wrong assumption in a model can
produce incorrect results=>poor decisions=>
undesirable outcomes
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Understand Your Data
• What is source of data?
–
–
–
–
–
–
–
Experiments
Observation
Surveys
Computer databases
Literature searches
Simulations
Test case
Understand your variable
The source of the data may
affect your decision to use
it or not.
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
•Barges in a tow
•Houses in floodplain
•People at a meeting
•Results of a diagnostic test
•Casualties per year
•Relocations and acquisitions
Type of Variable?
•Average number of barges per tow
•Weight of an adult striped bass
•Sensitivity or specificity of a diagnostic test
•Transit time
•Expected annual damages
•Duration of a storm
•Shoreline eroded
•Sediment loads
• Is your variable discrete or continuous ?
• Do not overlook this!
– Discrete distributions- take one of a set of
identifiable values, each of which has a
calculable probability of occurrence
– Continuous distributions- a variable that can
take any value within a defined range
Understand your variable
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
What Values Are Possible?
• Is your variable bounded or unbounded?
– Bounded-value confined to lie between
two determined values
– Unbounded-value theoretically extends
from minus infinity to plus infinity
– Partially bounded-constrained at one
end (truncated distributions)
• Use a distribution that matches
Understand your variable
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Continuous Distribution Examples
• Unbounded
– Normal
– t
– Logistic
• Left Bounded
–
–
–
–
–
Chi-square
Exponential
Gamma
Lognormal
Weibull
Understand your variable
• Bounded
–
–
–
–
–
–
“ Building Strong “
Beta
Cumulative
General/histogram
Pert
Uniform
Triangle
Delivering Integrated, Sustainable,
Water Resources Solutions
Discrete Distribution Examples
• Unbounded
• Bounded
– None
• Left Bounded
– Poisson
– Negative binomial
– Geometric
Understand your variable
–
–
–
–
“ Building Strong “
Binomial
Hypergeometric
Discrete
Discrete Uniform
Delivering Integrated, Sustainable,
Water Resources Solutions
Are There Parameters
• Does your variable have parameters that are
meaningful?
– Parametric--shape is determined by the
mathematics describing a conceptual probability
model
• Require a greater knowledge of the underlying
– Non-parametric—empirical distributions for which
the mathematics is defined by the shape required
• Intuitively easy to understand
• Flexible and therefore useful
Understand your variable
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Choose Parametric Distribution If
• Theory supports choice
• Distribution proven accurate for modelling
your specific variable (without theory)
• Distribution matches any observed data
well
• Need distribution with tail extending
beyond the observed minimum or
maximum
Understand your variable
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Choose Non-Parametric Distribution If
•
•
•
•
Theory is lacking
There is no commonly used model
Data are severely limited
Knowledge is limited to general beliefs
and some evidence
Understand your variable
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Parametric and Non-Parametric
•
•
•
•
•
•
Normal
Lognormal
Exponential
Poisson
Binomial
Gamma
Understand your variable
•
•
•
•
Uniform
Pert
Triangular
Cumulative
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Do You Know the Parameters?
• Probability distribution with precisely
known parameters (N(100,10)) is called a
1st order distribution
• Probability distribution with some
uncertainty about its parameters (N(m,s))
is called a 2nd order distribution
• Risknormal(risktriang(90,100,103),riskuniform(8,11))
Understand your variable
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Is It Dependent on Other Variables
• Univariate and multivariate distributions
– Univariate--describes a single parameter or
variable that is not probabilistically linked to any
other in the model
– Multivariate--describe several parameters that are
probabilistically linked in some way
• Engineering relationships are often
multivariate
Understand your variable
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Continuing Checklist for Choosing a
Distributions
3.
4.
5.
6.
7.
8.
9.
Look at your data—plot it
Use theory
Calculate statistics
Use previous experience
Distribution fitting
Expert opinion
Sensitivity analysis
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Plot--Old Faithful Eruptions
• What do your data
look like?
• You could
calculate Mean & SD
and assume its
normal
• Beware, danger
lurks
• Always plot your
data
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Which Distribution?
• Examine your plot
• Look for distinctive shapes of specific
distributions
–
–
–
–
–
Single peaks
Symmetry
Positive skew
Negative values
Gamma, Weibull,
beta are useful
and flexible forms
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Theory-Based Choice
• Most compelling reason for choice
• Formal theory
– Central limit theorem
• Theoretical knowledge of the variable
– Behavior
– Math—range
• Informal theory
– Sums normal, products lognormal
– Study specific
– Your best documented thoughts on subject
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Calculate Statistics
• Summary statistics may provide clues
• Normal
– Low coefficient of variation
– Equal mean and median
• Exponential has positive skew
– Equal mean and standard deviation
• Consider outliers
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Outliers
• Extreme observations can drastically
influence a probability model
• No prescriptive method for addressing
them
• If observation is an error remove it
• If not what is data point telling you?
– What about your world-view is inconsistent
with this result?
– Should you reconsider your perspective?
– What possible explanations have you not yet
considered?
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Outliers (cont)
• Your explanation must be correct, not
merely plausible
– Consensus is poor measure of truth
• If you must keep it and can't explain it
– Use conventional practices and live with
skewed consequences
– Choose methods less sensitive to such
extreme observations (Gumbel, Weibull)
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Previous Experience
• Have you dealt with this issue
successfully before? Have others?
• What did other analyses or risk
assessments use?
• What does the literature reveal?
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Goodness of Fit
• Provides statistical evidence to test
hypothesis that your data could have come
from a specific distribution
• H0 these data come from an “x” distribution
• Small test statistic and large p mean accept
H0
• It is another piece of evidence not a
determining factor
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
GOF Tests
• Chi-Square Test
– Most common—
discrete & continuous
– Data are divided into a
number of cells, each
cell with at least five
– Usually 50
observations or more
• Kolomogorov-Smirnov
Test
– More suitable for small
samples than ChiSquare
– Better fit for means
than tails
• Andersen-Darling Test
– Weights differences
between theoretical
and empirical
distributions at their
tails greater than at
their midranges
– Desirable when better
fit at extreme tails of
distribution are desired
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Kolmogorov-Smirnov Statistic
Normal(25.2290, 4.9645)
1.0
0.8
0.6
0.4
0.2
<
5.0%
90.0%
17.06
40
35
30
25
20
15
10
5
0.0
5.0%
>
• Blue = data
• Red =
true/hypothetical
• Find biggest
difference between
the two
• K-S statistic is
largest difference
consistent with
your
33.39
–n
–α
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Defining Distributions w/ Expert Opinion
•
•
•
•
•
Data never collected
Data too expensive or impossible
Past data irrelevant
Opinion needed to fill holes in sparse data
New area of inquiry, unique situation that
never existed
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
What Experts Estimate
• The distribution itself
– Judgment about distribution of value in
population
– E.g. population is normal
• Parameters of the distribution
– E.g. mean is x and standard deviation is y
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Modeling Techniques
• Disaggregation (Reduction)
• Subjective Probability Elicitation
• PDF or CDF
• Parametric or Non-parametric
distributions
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Elicitation Techniques Needed
• Literature shows we do not assess
subjective probabilities well
• In part due to heuristics we use
– Representativeness
– Availability
– Anchoring and adjustment
• There are methods to counteract our
heuristics and to elicit our expert
knowledge
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Sensitivity Analysis
• Unsure which is the best distribution?
• Try several
– If no difference you are free to use any one
– Significant differences mean doing more work
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Take Away Points
• Choosing the best distribution is where
most new risk assessors feel least
comfortable.
• Choice of distribution matters.
• Distributions come from data and expert
opinion.
• Distribution fitting should never be the
basis for distribution choice.
“ Building Strong “
Delivering Integrated, Sustainable,
Water Resources Solutions
Questions?
Charles Yoe, Ph.D.
cyoe1@verizon.net
“ Building Strong “
Download