Verified computation with probabilities Scott Ferson, Applied Biomathematics

advertisement
Verified computation
with probabilities
Scott Ferson, scott@ramas.com
Applied Biomathematics
Scientific Computing, Computer Arithmetic and Verified Numerical Computations
El Paso, Texas, 3 October 2008
Perspective
• Very elementary methods of interval analysis
– Low-dimensional, static
– Verified computing (but not roundoff error)
– Huge uncertainties
• Intervals combined with probability theory
– Total probabilities (events)
– Probability distributions (random variables)
Bounding probability is an old idea
•
•
•
•
•
•
Boole and de Morgan
Chebyshev and Markov
Borel and Fréchet
Kolmogorov and Keynes
Berger and Walley
Williamson and Downs
Deterministic
calculation
Probabilistic
convolution
Second-order
probability
Interval
analysis
Probability
bounds analysis
Terminology
• Dependence = stochastic dependence
»
More general than repeated variables
• Independence = stochastic independence
• Best possible = tight (almost)
»
some elements in the set may not be possible
Incertitude
• Arises from incomplete knowledge
• Incertitude arises from
– limited sample size
– measurement uncertainty or surrogate data
– doubt about the model
• Reducible with empirical effort
Variability
• Arises from natural stochasticity
• Variability arises from
– scatter and variation
– spatial or temporal fluctuations
– manufacturing or genetic differences
• Not reducible by empirical effort
They must be treated differently
• Variability is randomness
Needs probability theory
• Incertitude is ignorance
Needs interval analysis
• Imprecise probabilities can do both at once
Risk assessment applications
• Environmental pollution
heavy metals, pesticides, PMx, ozone, PCBs, EMF, RF, etc.
• Engineered systems
traffic safety, bridge design, airplanes, spacecraft, nuclear plants
$• Financial investments
portfolio planning, consultation, instrument evaluation
• Occupational hazards
manufacturing and factory workers, farm workers, hospital staff
• Food safety and consumer product safety
benzene in Perrier, E. coli in beef, salmonella in tomato, children’s toys
• Ecosystems and biological resources
endangered species, fisheries and reserve management
Probabilistic logic
Total probabilities (events)
• Fault or event trees
• Logical expressions (Hailperin 1986)
• Reliability analyses
–
–
–
–
Nuclear power plants
Aircraft safety system design
Gene technology release assessments
etc.
Interval arithmetic for probabilities
x + y = [x1 + y1, x2 + y2]
x  y = [x1  y2, x2  y1]
x  y = [x1  y1, x2  y2]
x  y = [x1  y2, x2  y1]
min(x, y) = [min(x1, y1), min(x2, y2)]
max(x, y) = [max(x1, y1), max(x2, y2)]
Rules are simpler because intervals confined to [0,1]
Probabilistic logic
•
•
•
•
•
•
Conjunctions (and)
Disjunctions (or)
Negations (not)
Exclusive disjunctions (xor)
Modus ponens (if-then)
etc.
Conjunction (and)
P(A |&| B) = P(A)  P(B)
Example: P(A) = [0.3, 0.5]
P(B) = [0.1, 0.2]
A and B are independent
P(A |&| B) = [0.03, 0.1]
Stochastic dependence
Independence
Probabilities are areas in the Venn diagrams
Fréchet case
P(A & B)=[max(0, P(A) + P(B)–1), min(P(A), P(B))]
• Makes no assumption about the dependence
• Rigorous (guaranteed to enclose true value)
• Best possible (cannot be any tighter)
Fréchet examples
Examples: P(A) = [0.3, 0.5]
P(B) = [0.1, 0.2]
P(A & B) = [0, 0.2]
P(C) = 0.29
certain
P(D) = 0.22
P(C & D) = [0, 0.22] uncertain
}
Example: pump system
Switch
S1
Relay
K2
What’s the risk
the tank ruptures
under pumping?
Relay
K1
Timer
relay R
Motor
Pressure
switch S
From
reservoir
Pump
Pressure tank T
Outlet
valve
Fault tree
R
E5
K2
E2
E1
E4
or
E3
or
T
or
K1
or
S1
and
S
Boolean expression of “tank rupturing under pressure” E1
Unsure about the probabilities and their dependencies
Results
Points, independence
Mixed dependencies
Fréchet
Intervals and Fréchet
105
104
Probability of tank rupturing under pumping
103
Interval probabilities
• Allow verified computing of reliabilities
• Distinguish two forms of uncertainty
• Rigorous bounds always easy to get
• Best possible bounds may need mathematical
programming because of repeated variables
Probabilistic arithmetic
Typical problems
• Sometimes little or even no actual data
– Updating is rarely used
• Very simple arithmetic calculations
– Occasionally, finite element meshes or differential equations
• Usually small in number of inputs
– Nuclear power plants are the exception
• Results are important and often high profile
– But the approach is being used ever more widely
Example: pesticides & farmworkers
• Total dose is decomposed by pathway
– Dermal exposure on hands
– Exposures to rest of body
– Inhalation
(concentration in air, exposure duration, breathing rate,
penetration factor, absorption efficiency)
• Takes account of related factors
– Acres, gallons per acre, mixing time
– Body mass, frequency of hand washing, etc.
Worst-case analysis
• Traditional method
• Mix of deterministic and extreme values
• Actually a kind of interval analysis
• Says how bad it could it be, but not how
unlikely that outcome is
Probabilistic analysis
• State-of-the-art method
• Usually via Monte Carlo simulation
• Requires the full joint distribution
– All the distributions for every input variable
– All their intervariable dependencies
• Often we need to guess about a lot of it
What’s needed
• Reliable, conservative assessments of tail risks
• Using available information but without forcing
analysts to make unjustified assumptions
• Neither computationally expensive nor
intellectually taxing
Probability bounds analysis
• Marries intervals with probability theory
• Distinguishes variability and incertitude
• Solves many problems in uncertainty analysis
– Input distributions unknown
– Imperfectly known correlation and dependency
– Large measurement error, censoring, small sample sizes
– Model uncertainty
Calculations
• All standard mathematical operations
–
–
–
–
–
–
Arithmetic operations (+, , ×, ÷, ^, min, max)
Logical operations (and, or, not, if, etc.)
Transformations (exp, ln, sin, tan, abs, sqrt, etc.)
Backcalculation (tolerance solutions) (deconvolutions, updati
Magnitude comparisons (<, ≤, >, ≥, )
Other operations (envelope, mixture, etc.)
• Faster than Monte Carlo
• Guaranteed to bounds answer
• Good solutions often easy to compute
Probability box (p-box)
Cumulative probability
Interval bounds on an cumulative distribution function (CDF)
1
0
0.0
1.0
X
2.0
3.0
Duality
• Bounds on the probability at a value
Chance the value will be 15 or less is between 0 and 25%
• Bounds on the value at a probability
95th percentile is between 40 and 70
Cumulative
Probability
1
0
0
20
40
X
60
80
Uncertain numbers
Cumulative probability
Probability
distribution
Probability
box
Interval
1
1
1
0
0
0
0
10
20
30
40
10
20
30
40
10
20
30
1
Cumulative Probability
Cumulative Probability
Probability bounds arithmetic
A
0
0
1
2
3
4
5
6
1
B
0
0
2
4
6
What’s the sum of A+B?
8
10
12
14
Cartesian product
A+B
A[1,3]
p1 = 1/3
A[2,4]
p2 = 1/3
A[3,5]
p3 = 1/3
B[2,8]
q1 = 1/3
A+B[3,11]
prob=1/9
A+B[4,12]
prob=1/9
A+B[5,13]
prob=1/9
B[6,10]
q2 = 1/3
A+B[7,13]
prob=1/9
A+B[8,14]
prob=1/9
A+B[9,15]
prob=1/9
B[8,12]
q3 = 1/3
A+B[9,15]
prob=1/9
A+B[10,16]
prob=1/9
A+B[11,17]
prob=1/9
independence
Cumulative probability
A+B under independence
1.00
0.75
0.50
0.25
0.00
0
3
6
9
A+B
12
15
18
Generalization of methods
• When inputs are distributions, the answers
conform with probability theory
• When inputs are intervals, it agrees with
interval analysis
Where do we get p-boxes?
•
•
•
•
•
Assumption
Modeling
Robust Bayes analysis
Constraint propagation
Data with incertitude
– Measurement error
– Sampling error
– Censoring
A tale of two data sets
Skinny (n = 6)
0
2
4
x
6
Puffy (n = 9)
8
10
0
2
4
x
6
8
10
Cumulative probability
Empirical distributions
1
1
Skinny
Puffy
0
0
0
2
4
6
x
8
10
0
2
4
6
x
8
10
Cumulative probability
Fitted distributions
1
1
Skinny
Puffy
0
0
0
x
10
20
0
x
10
20
Statisticians often ignore the incertitude, or treat it as though it were (uniform) variability.
Constraint propagation
1
1
.5
0
min
0
max
1
0
1
00
min
median
max
min
m ode
max
1
min
m ean
0
max
(lognormal
( lo
gnor malwith
with
interval
parameters)
inter
val par
ame ters)
.2
.4
1
.6
.8
1
0
mean, sd
Maximum entropy erases uncertainty
(lognormal with
interval parameters)
Example: PCBs and duck hunters
Location: Massachusetts and Connecticut
Receptor: Adult human hunters of waterfowl
Contaminant: PCBs (polychorinated biphenyls)
Exposure route: dietary consumption of
contaminated waterfowl
Based on the assessment for non-cancer risks from PCB to adult hunters who consume
contaminated waterfowl described in Human Health Risk Assessment: GE/Housatonic River
Site: Rest of River, Volume IV, DCN: GE-031203-ABMP, April 2003, Weston Solutions (West
Chester, Pennsylvania), Avatar Environmental (Exton, Pennsylvania), and Applied
Biomathematics (Setauket, New York).
Hazard quotient
EF  IR  C  1  LOSS 
HQ 
AT  BW  RfD
EF = mmms(1, 52, 5.4, 10) meals per year
// exposure frequency, censored data, n = 23
IR = mmms(1.5, 675, 188, 113) grams per meal // poultry ingestion rate from EPA’s EFH
C = [7.1, 9.73] mg per kg
// exposure point (mean) concentration
LOSS = 0
// loss due to cooking
AT = 365.25 days per year
// averaging time (not just units conversion)
BW = mixture(BWfemale, BWmale)
// Brainard and Burmaster (1992)
BWmale = lognormal(171, 30) pounds
// adult male n = 9,983
BWfemale = lognormal(145, 30) pounds
// adult female n = 10,339
RfD = 0.00002 mg per kg per day
// reference dose considered tolerable
Exceedance risk = 1 - CDF
Inputs
1
1
EF
0
0
1
IR
10 20 30 40 50 60
meals per year
0
0
1
200
400
600
grams per meal
males
females
0
0
C
BW
100
200
pounds
300
0
0
10
mg per kg
20
Automatically verified results
1
Exceedance risk
mean
standard deviation
median
95th percentile
range
[3.8, 31]
[0, 186]
[0.6, 55]
[3.5 , 384]
[0.01, 1230]
0
0
500
HQ
1000
Uncertainty about distribution shape
• PBA propagates uncertainty about
distribution shape comprehensively
• Neither sensitivity studies nor secondorder Monte Carlo can really do this
• Maximum entropy hides uncertainty
Stochastic dependence
Dependence
• Not all variables are independent
– Body size and skin surface area
– Common-cause variables
– Default risks for mortgages
• Known dependencies should be modeled
• What can we do when we don’t know them?
Uncertainty about dependence
• Sensitivity analyses usually used
– Vary correlation coefficient between 1 and +1
• But this underestimates the true uncertainty
– Example: suppose X, Y ~ uniform(0,25) but we
don’t know the dependence between X and Y
Unknown dependence
Cumulative probability
1
Fréchet
X,Y ~ uniform(0,25)
0
0
10
20
30
X+Y
40
50
Varying correlation between 1 and +1
1
Cumulative probability
Pearson
X,Y ~ uniform(0,25)
0
0
10
20
30
X+Y
40
50
Unknown but positive dependence
Cumulative probability
1
Positive
X,Y ~ uniform(0,25)
0
0
10
20
30
X+Y
40
50
Uncertainty about dependence
• Can’t be studied with sensitivity analysis since
it’s an infinite-dimensional problem
• Fréchet bounding lets you be sure
• Intermediate knowledge can be exploited
• Dependence can have large or small effect
Backcalculation
Backcalculation
• Generalization of tolerance solutions
• E.g., from p-boxes for A and C, finds B such
that A+B  C
• Needed for planning environmental cleanups,
designing structures, etc.
Backcalculation with p-boxes
A = normal(5, 1)
C = {0  C, median  1.5, 90th%ile  35, max  50}
1
1
02
A
3
4
5
C
6
7
8
0
0 10 20 30 40 50 60
Getting the answer
• Basically reverses the forward convolution
• Any distribution totally inside B is sure to
satisfy the constraint
• Many possible B’s
1
0-10 0
B
10 20 30 40 50
Check it by plugging it back in
A + B = C*  C
1
C*
0
-10
0
10
20
C
30
40
50
60
Backcalculation
• Backcalculation wider under independence
(narrower under Fréchet)
• Monte Carlo methods don’t generally work
except in a trial-and-error approach
• Precise distributions can’t express the target
Conclusions
Probability vs. intervals
• Probability theory
– Handles likelihoods and dependence well
– Has an inadequate model of ignorance
– Lying: saying more than you really know
• Interval analysis
– Handles epistemic uncertainty (ignorance) well
– Inadequately models frequency and dependence
– Cowardice: saying less than you know
Probability bounds analysis
• Generalizes them to escape the limits of each
• Makes verified calculations about probabilities
– Using whatever knowledge is available
– Without requiring unjustified assumptions
• Well developed methodology
– But plenty of interesting and important questions
remain for study
Diverse applications
•
•
•
•
•
•
•
Human health risk analyses
Conservation biology extinction/reintroduction
Wildlife contaminant exposure analyses
Chemostat dynamics
Global warming forecasts
Design of spacecraft
Safety of engineered systems (e.g., bridges)
What p-boxes can’t do
• Give best-possible bounds on non-tail risks
• Conveniently get best-possible bounds when
dependencies are subtle
• Show what’s most likely within the box
Acknowledgments
Lev Ginzburg
Rüdiger Kuhn
Vladik Kreinovich
David Myers
Bill Oberkampf
Janos Hajagos
Dan Berleant
Chris Paredis
Mark Stadtherr
NIH
Sandia National Labs
NASA
Applied Biomathematics
Electric Power Research Institute
End
Research topics
•
•
•
•
•
•
•
•
•
Incorporation of ancillary information
Propagation through black boxes
Handling subtle dependencies
Computing non-tail risks
Combination with fuzzy numbers
Decision theory
Info-gap models
Finite-element models
ODEs and PDEs
Download