# Verified computation with probabilities Scott Ferson, Applied Biomathematics ```Verified computation
with probabilities
Scott Ferson, scott@ramas.com
Applied Biomathematics
Scientific Computing, Computer Arithmetic and Verified Numerical Computations
El Paso, Texas, 3 October 2008
Perspective
• Very elementary methods of interval analysis
– Low-dimensional, static
– Verified computing (but not roundoff error)
– Huge uncertainties
• Intervals combined with probability theory
– Total probabilities (events)
– Probability distributions (random variables)
Bounding probability is an old idea
•
•
•
•
•
•
Boole and de Morgan
Chebyshev and Markov
Borel and Fr&eacute;chet
Kolmogorov and Keynes
Berger and Walley
Williamson and Downs
Deterministic
calculation
Probabilistic
convolution
Second-order
probability
Interval
analysis
Probability
bounds analysis
Terminology
• Dependence = stochastic dependence
&raquo;
More general than repeated variables
• Independence = stochastic independence
• Best possible = tight (almost)
&raquo;
some elements in the set may not be possible
Incertitude
• Arises from incomplete knowledge
• Incertitude arises from
– limited sample size
– measurement uncertainty or surrogate data
• Reducible with empirical effort
Variability
• Arises from natural stochasticity
• Variability arises from
– scatter and variation
– spatial or temporal fluctuations
– manufacturing or genetic differences
• Not reducible by empirical effort
They must be treated differently
• Variability is randomness
Needs probability theory
• Incertitude is ignorance
Needs interval analysis
• Imprecise probabilities can do both at once
Risk assessment applications
• Environmental pollution
heavy metals, pesticides, PMx, ozone, PCBs, EMF, RF, etc.
• Engineered systems
traffic safety, bridge design, airplanes, spacecraft, nuclear plants
\$• Financial investments
portfolio planning, consultation, instrument evaluation
• Occupational hazards
manufacturing and factory workers, farm workers, hospital staff
• Food safety and consumer product safety
benzene in Perrier, E. coli in beef, salmonella in tomato, children’s toys
• Ecosystems and biological resources
endangered species, fisheries and reserve management
Probabilistic logic
Total probabilities (events)
• Fault or event trees
• Logical expressions (Hailperin 1986)
• Reliability analyses
–
–
–
–
Nuclear power plants
Aircraft safety system design
Gene technology release assessments
etc.
Interval arithmetic for probabilities
x + y = [x1 + y1, x2 + y2]
x  y = [x1  y2, x2  y1]
x  y = [x1  y1, x2  y2]
x  y = [x1  y2, x2  y1]
min(x, y) = [min(x1, y1), min(x2, y2)]
max(x, y) = [max(x1, y1), max(x2, y2)]
Rules are simpler because intervals confined to [0,1]
Probabilistic logic
•
•
•
•
•
•
Conjunctions (and)
Disjunctions (or)
Negations (not)
Exclusive disjunctions (xor)
Modus ponens (if-then)
etc.
Conjunction (and)
P(A |&amp;| B) = P(A)  P(B)
Example: P(A) = [0.3, 0.5]
P(B) = [0.1, 0.2]
A and B are independent
P(A |&amp;| B) = [0.03, 0.1]
Stochastic dependence
Independence
Probabilities are areas in the Venn diagrams
Fr&eacute;chet case
P(A &amp; B)=[max(0, P(A) + P(B)–1), min(P(A), P(B))]
• Makes no assumption about the dependence
• Rigorous (guaranteed to enclose true value)
• Best possible (cannot be any tighter)
Fr&eacute;chet examples
Examples: P(A) = [0.3, 0.5]
P(B) = [0.1, 0.2]
P(A &amp; B) = [0, 0.2]
P(C) = 0.29
certain
P(D) = 0.22
P(C &amp; D) = [0, 0.22] uncertain
}
Example: pump system
Switch
S1
Relay
K2
What’s the risk
the tank ruptures
under pumping?
Relay
K1
Timer
relay R
Motor
Pressure
switch S
From
reservoir
Pump
Pressure tank T
Outlet
valve
Fault tree
R
E5
K2
E2
E1
E4
or
E3
or
T
or
K1
or
S1
and
S
Boolean expression of “tank rupturing under pressure” E1
Unsure about the probabilities and their dependencies
Results
Points, independence
Mixed dependencies
Fr&eacute;chet
Intervals and Fr&eacute;chet
105
104
Probability of tank rupturing under pumping
103
Interval probabilities
• Allow verified computing of reliabilities
• Distinguish two forms of uncertainty
• Rigorous bounds always easy to get
• Best possible bounds may need mathematical
programming because of repeated variables
Probabilistic arithmetic
Typical problems
• Sometimes little or even no actual data
– Updating is rarely used
• Very simple arithmetic calculations
– Occasionally, finite element meshes or differential equations
• Usually small in number of inputs
– Nuclear power plants are the exception
• Results are important and often high profile
– But the approach is being used ever more widely
Example: pesticides &amp; farmworkers
• Total dose is decomposed by pathway
– Dermal exposure on hands
– Exposures to rest of body
– Inhalation
(concentration in air, exposure duration, breathing rate,
penetration factor, absorption efficiency)
• Takes account of related factors
– Acres, gallons per acre, mixing time
– Body mass, frequency of hand washing, etc.
Worst-case analysis
• Mix of deterministic and extreme values
• Actually a kind of interval analysis
• Says how bad it could it be, but not how
unlikely that outcome is
Probabilistic analysis
• State-of-the-art method
• Usually via Monte Carlo simulation
• Requires the full joint distribution
– All the distributions for every input variable
– All their intervariable dependencies
• Often we need to guess about a lot of it
What’s needed
• Reliable, conservative assessments of tail risks
• Using available information but without forcing
analysts to make unjustified assumptions
• Neither computationally expensive nor
intellectually taxing
Probability bounds analysis
• Marries intervals with probability theory
• Distinguishes variability and incertitude
• Solves many problems in uncertainty analysis
– Input distributions unknown
– Imperfectly known correlation and dependency
– Large measurement error, censoring, small sample sizes
– Model uncertainty
Calculations
• All standard mathematical operations
–
–
–
–
–
–
Arithmetic operations (+, , &times;, &divide;, ^, min, max)
Logical operations (and, or, not, if, etc.)
Transformations (exp, ln, sin, tan, abs, sqrt, etc.)
Backcalculation (tolerance solutions) (deconvolutions, updati
Magnitude comparisons (&lt;, ≤, &gt;, ≥, )
Other operations (envelope, mixture, etc.)
• Faster than Monte Carlo
• Good solutions often easy to compute
Probability box (p-box)
Cumulative probability
Interval bounds on an cumulative distribution function (CDF)
1
0
0.0
1.0
X
2.0
3.0
Duality
• Bounds on the probability at a value
Chance the value will be 15 or less is between 0 and 25%
• Bounds on the value at a probability
95th percentile is between 40 and 70
Cumulative
Probability
1
0
0
20
40
X
60
80
Uncertain numbers
Cumulative probability
Probability
distribution
Probability
box
Interval
1
1
1
0
0
0
0
10
20
30
40
10
20
30
40
10
20
30
1
Cumulative Probability
Cumulative Probability
Probability bounds arithmetic
A
0
0
1
2
3
4
5
6
1
B
0
0
2
4
6
What’s the sum of A+B?
8
10
12
14
Cartesian product
A+B
A[1,3]
p1 = 1/3
A[2,4]
p2 = 1/3
A[3,5]
p3 = 1/3
B[2,8]
q1 = 1/3
A+B[3,11]
prob=1/9
A+B[4,12]
prob=1/9
A+B[5,13]
prob=1/9
B[6,10]
q2 = 1/3
A+B[7,13]
prob=1/9
A+B[8,14]
prob=1/9
A+B[9,15]
prob=1/9
B[8,12]
q3 = 1/3
A+B[9,15]
prob=1/9
A+B[10,16]
prob=1/9
A+B[11,17]
prob=1/9
independence
Cumulative probability
A+B under independence
1.00
0.75
0.50
0.25
0.00
0
3
6
9
A+B
12
15
18
Generalization of methods
• When inputs are distributions, the answers
conform with probability theory
• When inputs are intervals, it agrees with
interval analysis
Where do we get p-boxes?
•
•
•
•
•
Assumption
Modeling
Robust Bayes analysis
Constraint propagation
Data with incertitude
– Measurement error
– Sampling error
– Censoring
A tale of two data sets
Skinny (n = 6)
0
2
4
x
6
Puffy (n = 9)
8
10
0
2
4
x
6
8
10
Cumulative probability
Empirical distributions
1
1
Skinny
Puffy
0
0
0
2
4
6
x
8
10
0
2
4
6
x
8
10
Cumulative probability
Fitted distributions
1
1
Skinny
Puffy
0
0
0
x
10
20
0
x
10
20
Statisticians often ignore the incertitude, or treat it as though it were (uniform) variability.
Constraint propagation
1
1
.5
0
min
0
max
1
0
1
00
min
median
max
min
m ode
max
1
min
m ean
0
max
(lognormal
( lo
gnor malwith
with
interval
parameters)
inter
val par
ame ters)
.2
.4
1
.6
.8
1
0
mean, sd
Maximum entropy erases uncertainty
(lognormal with
interval parameters)
Example: PCBs and duck hunters
Location: Massachusetts and Connecticut
Receptor: Adult human hunters of waterfowl
Contaminant: PCBs (polychorinated biphenyls)
Exposure route: dietary consumption of
contaminated waterfowl
Based on the assessment for non-cancer risks from PCB to adult hunters who consume
contaminated waterfowl described in Human Health Risk Assessment: GE/Housatonic River
Site: Rest of River, Volume IV, DCN: GE-031203-ABMP, April 2003, Weston Solutions (West
Chester, Pennsylvania), Avatar Environmental (Exton, Pennsylvania), and Applied
Biomathematics (Setauket, New York).
Hazard quotient
EF  IR  C  1  LOSS 
HQ 
AT  BW  RfD
EF = mmms(1, 52, 5.4, 10) meals per year
// exposure frequency, censored data, n = 23
IR = mmms(1.5, 675, 188, 113) grams per meal // poultry ingestion rate from EPA’s EFH
C = [7.1, 9.73] mg per kg
// exposure point (mean) concentration
LOSS = 0
// loss due to cooking
AT = 365.25 days per year
// averaging time (not just units conversion)
BW = mixture(BWfemale, BWmale)
// Brainard and Burmaster (1992)
BWmale = lognormal(171, 30) pounds
// adult male n = 9,983
BWfemale = lognormal(145, 30) pounds
// adult female n = 10,339
RfD = 0.00002 mg per kg per day
// reference dose considered tolerable
Exceedance risk = 1 - CDF
Inputs
1
1
EF
0
0
1
IR
10 20 30 40 50 60
meals per year
0
0
1
200
400
600
grams per meal
males
females
0
0
C
BW
100
200
pounds
300
0
0
10
mg per kg
20
Automatically verified results
1
Exceedance risk
mean
standard deviation
median
95th percentile
range
[3.8, 31]
[0, 186]
[0.6, 55]
[3.5 , 384]
[0.01, 1230]
0
0
500
HQ
1000
distribution shape comprehensively
• Neither sensitivity studies nor secondorder Monte Carlo can really do this
• Maximum entropy hides uncertainty
Stochastic dependence
Dependence
• Not all variables are independent
– Body size and skin surface area
– Common-cause variables
– Default risks for mortgages
• Known dependencies should be modeled
• What can we do when we don’t know them?
• Sensitivity analyses usually used
– Vary correlation coefficient between 1 and +1
• But this underestimates the true uncertainty
– Example: suppose X, Y ~ uniform(0,25) but we
don’t know the dependence between X and Y
Unknown dependence
Cumulative probability
1
Fr&eacute;chet
X,Y ~ uniform(0,25)
0
0
10
20
30
X+Y
40
50
Varying correlation between 1 and +1
1
Cumulative probability
Pearson
X,Y ~ uniform(0,25)
0
0
10
20
30
X+Y
40
50
Unknown but positive dependence
Cumulative probability
1
Positive
X,Y ~ uniform(0,25)
0
0
10
20
30
X+Y
40
50
• Can’t be studied with sensitivity analysis since
it’s an infinite-dimensional problem
• Fr&eacute;chet bounding lets you be sure
• Intermediate knowledge can be exploited
• Dependence can have large or small effect
Backcalculation
Backcalculation
• Generalization of tolerance solutions
• E.g., from p-boxes for A and C, finds B such
that A+B  C
• Needed for planning environmental cleanups,
designing structures, etc.
Backcalculation with p-boxes
A = normal(5, 1)
C = {0  C, median  1.5, 90th%ile  35, max  50}
1
1
02
A
3
4
5
C
6
7
8
0
0 10 20 30 40 50 60
• Basically reverses the forward convolution
• Any distribution totally inside B is sure to
satisfy the constraint
• Many possible B’s
1
0-10 0
B
10 20 30 40 50
Check it by plugging it back in
A + B = C*  C
1
C*
0
-10
0
10
20
C
30
40
50
60
Backcalculation
• Backcalculation wider under independence
(narrower under Fr&eacute;chet)
• Monte Carlo methods don’t generally work
except in a trial-and-error approach
• Precise distributions can’t express the target
Conclusions
Probability vs. intervals
• Probability theory
– Handles likelihoods and dependence well
– Has an inadequate model of ignorance
– Lying: saying more than you really know
• Interval analysis
– Handles epistemic uncertainty (ignorance) well
– Inadequately models frequency and dependence
– Cowardice: saying less than you know
Probability bounds analysis
• Generalizes them to escape the limits of each
• Makes verified calculations about probabilities
– Using whatever knowledge is available
– Without requiring unjustified assumptions
• Well developed methodology
– But plenty of interesting and important questions
remain for study
Diverse applications
•
•
•
•
•
•
•
Human health risk analyses
Conservation biology extinction/reintroduction
Wildlife contaminant exposure analyses
Chemostat dynamics
Global warming forecasts
Design of spacecraft
Safety of engineered systems (e.g., bridges)
What p-boxes can’t do
• Give best-possible bounds on non-tail risks
• Conveniently get best-possible bounds when
dependencies are subtle
• Show what’s most likely within the box
Acknowledgments
Lev Ginzburg
R&uuml;diger Kuhn
David Myers
Bill Oberkampf
Janos Hajagos
Dan Berleant
Chris Paredis
NIH
Sandia National Labs
NASA
Applied Biomathematics
Electric Power Research Institute
End
Research topics
•
•
•
•
•
•
•
•
•
Incorporation of ancillary information
Propagation through black boxes
Handling subtle dependencies
Computing non-tail risks
Combination with fuzzy numbers
Decision theory
Info-gap models
Finite-element models
ODEs and PDEs
```