Probabilistic model checking for the Grid and on the Grid Marta Kwiatkowska

advertisement
Probabilistic model checking for the
Grid and on the Grid
Marta Kwiatkowska
School of Computer Science
www.cs.bham.ac.uk/~mzk
www.cs.bham.ac.uk/~dxp/prism
ARW, 29-30th July 2005
Overview
 e-Science and the Grid
– The promises and the challenges
 Probabilistic model checking
– Why needed?
– What does it involve?
– The PRISM model checker
 … for the Grid
– Collaborative protocols: thinkteam example
– Performance: QoS analysis of workstation cluster
 … on the Grid
– Parallel symbolic solution for Gauss-Seidel
 Conclusion
With thanks to…
 Main collaborators on probabilistic model checking
– Gethin Norman, Dave Parker, Jeremy Sproston, Christel Baier,
Roberto Segala, Michael Huth, Luca de Alfaro, Joost-Pieter
Katoen, Antonio Pacheco
 PRISM model checker, implementation & Grid-enabling
– Dave Parker, Andrew Hinton, Rashid Mehmood, Yi Zhang, Hakan
Younes, Stephen Gilmore, Michael Goldsmith, Conrado Daws,
Fuzhi Wang
 Case studies
– Dave Parker, Gethin Norman, Maurice ter Beek, Mieke Massink,
Diego Latella, Boudewijn Haverkort, Holger Hermanns and
Joost-Pieter Katoen
 And many more…
The e-Scientist of the future
Will this work?
The Internet
Remote access to high-performance computers, via Internet
Remote access to visualisation facilities, via Internet
Computational steering from anywhere, via PDA
Visualisation, on your laptop
Fast, online, accurate, …
What enables e-Science?
 Data sharing: distribution, query searches, curation
– Distributed databases
– Web-services
– Ontologies
 Collaboration: coordination, concurrency control
– Collaborative protocols
– Workflow
– Web-service orchestration
 Compute power: scheduling, load-balancing, performance
– Computer clusters
– Campus and intra-Grids
– Large-scale grids
The role of probability
 Enables performance modelling
– Pioneered by Erlang, in telecommunications, ca 1910
– Models: typically continuous time Markov chains
– Emphasis on steady-state and transient probabilities
 Also used in stochastic planning, to model uncertainty
– cf Bellman equations, ca 1950s
– Models: Markov decision processes
– Emphasis on finding optimum policies
 Our focus, probabilistic model checking
–
–
–
–
Distinctive, on automated verification for probabilistic systems
Temporal logic specifications, automata-theoretic techniques
Shared models
Exchanging techniques with the other two areas
Verification via model checking…
or falsification?
The model
or
Model Checker
send  deliver
Temporal logic specification


Error trace
Line
Line
Line
…
Line
Line
5: …
21: …
15: …
27: …
45: ...
Probabilistic model checking…
in a nutshell
0.4
0.3
Probabilistic model
send  P¸ 0.9(deliver)
Probabilistic temporal
logic specification


or
Probabilistic
Model Checker
or
The probability
State
State
State
…
State
State
5: 0.6789
6: 0.9789
7: 1.0
12: 0
13: 0.1245
Probabilistic model checking inputs…
 Models
–
–
–
–
discrete time Markov chains (DTMCs)
continuous time Markov chains (CTMCs)
Markov decision processes (MDPs)
(currently indirectly) probabilistic timed automata (PTAs)
 (Yes/No) temporal logic specification languages
– Probabilistic temporal logic PCTL (for DTMCs/MDPs)
– Continuous Stochastic Logic CSL (for CTMCs)
– Probabilistic timed computation tree logic PTCTL (for PTAs)
 Quantitative specification language variants
– Probability values for logics PCTL/CSL/PTCTL (for all models)
– Extension with expectation operator (for all)
– Extension with costs/rewards (for all)
Probabilistic models
 Discrete time
– Probabilistic transitions, represented as
discrete probability distribution
– Model types
• Discrete time Markov chains (DTMCs):
probabilistic choice only
• Markov decision processes (MDPs):
probabilistic choice and nondeterminism
p1
p2
pn . . .
i pi = 1
 Continuous passage of time, randomly
distributed delays, space
– Model types
• Continuous time Markov chains (CTMCs):
exponentially delays, no nondeterminism
• Probabilistic Timed Automata (PTAs): dense
time, admit nondeterminism
• Labelled Markov Processes (LMPs):
continuous space/time
time
s0+1 f(x)dx = 1
Discrete-Time Markov Chains (DTMCs)
 Features:
– Only probabilistic choice
in each state
 Formally, (S,s0,P,L):
fail
1
init
0.02
try
s0
1
– S finite set of states
– s0 initial state
s1
0.01
s3
0.97
succ
s2
– P: S £ S ! [0,1] probability matrix, s.t. s’ P(s,s’) = 1, all s
– L: S ! 2AP atomic propositions
 Unfold into infinite paths s0s1s2s3s4… s.t. P(si,si+1) > 0, all i
 Probability for finite paths, multiply along path
e.g. s0 s1 s1 s2 is 1 ¢ 0.01 ¢ 0.97 = 0.0097
1
Probability space
 Intuitively:
– Sample space = infinite paths Paths from s
– Event = set of paths
ss1s2…sk
– Basic event = cone
 Formally, (Paths, W, Pr)
– For finite path w = ss1…sn, define probability
P(w) =
1 if w has length one
P(s,s1) ¢ … ¢ P(sn-1,sn) otherwise
– Take W least s-algebra containing cones
C(w) = { p 2 Paths | w is prefix of p}
– Define Pr(C(w)) = P(w), all w
– Pr extends uniquely to measure on Paths
The probabilistic operator
 New probabilistic operator, a quantitative analogue of 8, 9
– e.g. send  P¸ 0.9(deliver)
“whenever a message is sent, the probability that it is eventually
delivered is at least 0.9”
– Takes path formulas a: a (eventually), a U b (until), etc
 Formally,
s ² P» p(a)
,
Pr { p 2 Paths j p ² a } » p
< 1 - p
S
a-paths
threshold level p
¸ p
Continuous Time Markov Chains (CTMCs)
 Features:
– Discrete states and
real time
– Exponentially
distributed random delays
empty
0
3
3
4
1
4
3
2
4
full
3
 Formally:
– Set of states S plus rates R(s,s’) > 0 of moving from s to s’
– Probability of moving from s to s’ by time t > 0 is 1 - e-R(s,s’)¢ t
– Transition rate matrix S £ S ! R¸0
 Unfold into infinite paths s0t0s1t1s2t2s3…
– probs (s’), probability of being in s’ in the long-run, starting in s
– probs (s’,t), probability of being in s’ at time instant t
 Cone construction generalises via time intervals
The logic CSL: syntax

Continuous Stochastic Logic [ASSB96,BKH99]
–
For CTMCs, based on PCTL, for example
•
•

P< 0.85(}<15 full), probability operator
“the probability of queue becoming full within 15 secs is < 0.85”
S< 0.01(down), steady-state operator
“in the long run, the probability the system is down is less than 1%”
The syntax of state and path formulas of CSL is:
f ::= true | a | f Æ f | :f | S» p(f) | P» p(a)
a ::= X f | f U· t f | f U f
where p 2 [0,1] is a probability bound, t 2 R¸0 and » 2 { <, >, … }

Extension with time intervals for until, cost/rewards and
expectation operator E» c(f)
CSL semantics
 Semantics of bounded until:
p ² f1 U· t f2
,
iff f2 satisfied at time instant t
along p = s0L and f1 satisfied at
all preceding time instants
 The added operators:
s ² S» p(f)
,
Ss’ ² f probs (s’) » p
where probs (s’) is prob. of being in
s’ in the long-run, having started in s
s ² P» p(a)
,
Pr { p 2 Paths j p ² a } » p
where Pr is probability measure on
paths as before
 Also quantitative form of formulas, instead of threshold
The logic CSL: model checking
 By induction on structure of formula, except for
–
S» p(f) and P» p(f1 U· t f2)
 The steady-state operator
– Reduces to solution of a linear equation system
 The time-bounded until
– Numerical [Baier et al, CONCUR’00]
• Reduces to transient analysis using uniformisation
• Transform CTMC by removing all outgoing transitions from states
satisfying f2 or :f1
• Then Pr { p 2 Paths j p ² f U· t f } = Ss’ ² f2 probs (s’,t)
– Sampling-based, statistical hypothesis testing [Younes et al,
CAV’03, TACAS’04]
• Reduces to simulation and estimation of satisfaction of time bound
Probabilistic model checking involves…
 Construction of models:
– DTMCs/CTMCs, MDPs, and PTAs (with digital clocks)
 Implementation of probabilistic model checking algorithms
– graph-theoretical algorithms, combined with
• (probabilistic) reachability
• qualitative model checking (for 0/1 probability)
– numerical computation – iterative methods
• quantitative model checking (plot probability values, expectations,
rewards, steady-state, etc, for a range of parameters)
• exhaustive
– sampling-based – simulation
• quantitative model checking as above, based on many simulation runs
• approximate
The PRISM probabilistic model checker
 History
– Implemented at the University of Birmingham
– First public release September 2001, ~7 years development
– Open source, GPL licence, available freely for research an
teaching
– >2500 downloads, many users worldwide, >80 papers using
PRISM, connection to several tools, >40 case studies and 6
flaws, …
 Approach
– Based on symbolic, BDD-based techniques
– Multi-Terminal BDDs, first algorithm [ICALP’97]
 www.cs.bham.ac.uk/~dxp/prism/
– For software, publications, case studies, taught courses using
PRISM, etc
Overview of PRISM
 Functionality
– Implements temporal logic probabilistic model checking
– Construction of models: discrete and continuous Markov chains
(DTMCs/CTMCs), and Markov decision processes (MDPs)
– High-level model description language, state-based, also PEPA
– Probabilistic temporal logics: PCTL and CSL
– Extension with costs/rewards, expectation operator
 Underlying computation combines graph-theoretical algorithms
– Reachability, qualitative model checking, BDD-based
with numerical computation – iterative methods
– Linear equation system solution - Jacobi, Gauss-Seidel, ...
– Uniformisation (CTMCs)
– Dynamic programming (MDPs)
– Explicit and symbolic (MTBDDs, etc.)
PRISM technicalities
 (New) Simulator and sampling-based model checking
– allows to “excute” the model step-by-step or randomly
– avoids state-space explosion, trading off accuracy
 GUI implementation
– integrated editor for PRISM language
– automatic graph plotting
 Support for “experiments”
– e.g. P=? [true U<=T error] for N=1..5,T=1..100
– repeats model checking for a range of model/formula
parameters
– plots on single graph
PRISM property specifications
 PCTL/CSL (true/false) formula examples:
– P<0.001 [ true U100 error ]
“the probability of the system reaching an error state within 100
time units is less than 0.001”
 Can also write query formulae:
– P=? [ true U10 terminate ]
“what is the probability that the algorithm terminates successfully
within 10 time units?”
 Instantaneous rewards, state-based, e.g. “queue size”:
– R=? [ I=T ], expected reward at time instant T?
 Cumulative rewards, state/transition, e.g. “power consumed”:
– R=? [ C<=T ], expected reward by time T?
– R=? [ S ], expected long-run reward per unit time?
PRISM real-world case studies
 CTMCs
–
–
–
–
–
thinkteam (by ter Beek, Massink & Latella) [DSVIS’05]
QoS of workstation cluster (by Haverkort et al) [SRDS’00]
Dependability of embedded controller [INCOM’04]
Dynamic power management [HLDVT’02, FAC 2005]
ERK pathway (by Calder, Hillston et al) [CMSB’05]
–
–
–
–
–
Bluetooth device discovery [ISOLA’04]
Crowds anonymity protocol (by Shmatikov) [CSFW’02, JSC 2003]
Randomised consensus [CAV’01,FORTE’02]
Contract signing protocols (by Norman & Shmatikov) [FASEC’02]
NAND multiplexing for nano (with Shukla) [VLSI’04,TCAD 2005]
 MDPs/DTMCs
 PTAs
– IPv4 ZeroConf dynamic configuration [FORMATS’03]
– Root contention in IEEE 1394 FireWire [FAC 2003, STTT 2004]
– IEEE 802.11 WLAN protocol (and Gopinath) [PROBMIV’02,CAV’05]
Screenshot: Graphs
Screenshot: Graphical input language
Case Study: thinkteam
 Groupware systems, e-Business
– For collaborative working, geographically distributed
 thinkteam (TT) is think3’s Product Data Management (PDM)
application
– Helps to capture, organise, automate and share engineering
product information
– Dispersed and asynchronous groupware system
– Performs controlled storage and retrieval of data (vaulting)
 thinkteam analysis [ter Beek, Massink & Latella]
– Functional verification, e.g.
• concurrency control, detected problem with release of data locks
• denial of service
– Stochastic, quantitative analysis of user interface (here), to
appear in [DSVIS’05]
The thinkteam architecture
Operations:
get, import, checkOut, unCheckOut, checkIn, checkInOut
thinkteam user interface
 thinkteam’s access to documents
– Currently based on retrial principle (no queue)
– Want to consider reservation system
 Modelling and quantitative analysis
–
–
–
–
–
Study effect of document access policies on users
Developed stochastic (CTMC) models in PEPA and PRISM
Analysed the retrial policy and waiting-list policy
Properties expressible in CSL
Used PRISM to obtain quantitative results
 Observations
– Retrial approach less convenient from user’s perspective
• Rapidly deteriorates with increasing number of competing users
Results: thinkteam
The chance that users in retry mode reach checkOut in 5 hours
P=?[true U<=5 (mode=“checkOut”){mode=“retry”}]
Results: thinkteam policies
Comparing effect of the retrial and waiting-list policies on users
(mu is 5 and 10 users respectively)
Case Study: Workstation cluster
 Two sub-clusters (N workstations in each cluster)
–
–
–
–
By Haverkort, Hermanns and Katoen, Proc. IEEE SRDS'00
See www.cs.bham.ac.uk/~dxp/prism/ for PRISM analysis
Star topology with a central switch
Components can break down, single repair unit
left
switch
left
sub-cluster
backbone
right
switch
right
sub-cluster
Case Study: Workstation cluster
 CTMC model
– rates inverse of average durations
– for example:
•
•
•
•
Backbone fails on average every 5000 hours
Backbone repair on average takes 8 hours
Workstation fails on average every 500 hours
Workstation repair on average takes 0.5 hours
 Minimum QOS:
at least 3/4*N workstations operational and connected via switches
 Premium QOS:
all (N) workstations operational and connected via switches
Results: Workstation cluster
The worst case probability of going from minimum to premium
QoS within T hours
P=?[true U<=T ”premium” {“minimum”}{min}]
Results: Workstation cluster
The probability it takes more than T time units to recover from
insufficient QoS
P=?[!”minimum” U>=T ”minimum” {!”minimum”}{max}]
Simulator output: Workstation cluster
Scalability with sampling-based method
Cell cycle control
PRISM model size limit
What we have learnt from practice
 Probabilistic model checking
– Is capable of finding ‘corner cases’ and ‘unusual trends’
– Good for worst-case scenarios, for all initial states
– Benefits from quantitative-style analysis for a range of
parameters
– Is limited by state space size
– Useful for real-world protocol analysis, power management,
performance, biological processes, …
 Simulation and sampling-based techniques
– Limited by accuracy of the results, not state-space explosion
– May need to rerun experiments for each possible start state,
not always feasible
– Statistical methods in conjunction with sampling help
– Nested formulas may be difficult
New directions and challenges
 Often not practical to analyse the full system beforehand
– Adaptive methods
• Genetic algorithm-based methods [Jarvis et al, 2005]
– Online methods
• Continuous approximations using ODEs [Gilmore et al, 2005]
• Applicable for sufficiently large numbers of entities
• Work with PEPA, derivation of ODEs
 Scalability challenge
–
–
–
–
–
–
State-space explosion has not gone away…
Parallelisation
Distributed computation
Compositionality
Abstraction
Approximate methods
Why Grid-enable?
 Experience with PRISM indicates
–
–
–
–
Symbolic representation very compact, >1010 states for CTMCs
Extremely large models feasible depending on regularity
Numerical computation often slow
Sampling-based computation can be even slower
 Can parallelise the symbolic approach
–
–
–
–
Facilitates extraction of the dependency information
Compactness enables storage of the full matrix at each node
Focus on steady-state solution for CTMCs, can generalise
Use wavefront techniques to parallelise
 Easy to distribute sampling-based computation
– Individual simulation runs are independent
Numerical Solution for CTMCs
 Steady-state probability distribution can be obtained by
solving linear equation system:
–
pQ = 0 with constraint Si pi = 1
 We consider the more general problem of solving:
– Ax = b,
– where A is nn matrix, b vector of length n
 Numerical solution techniques
– Direct, not feasible for very large models
– Iterative stationary (Jacobi, Gauss-Seidel), memory efficient
– Projection methods (Krylov, CGS, …), fastest convergence, but
require several vectors
 Transient probabilities similar
– Computed via an iterative method (uniformisation)
Gauss-Seidel
 Computes one matrix row at a time
 Updates ith element using most up-to-date values
 Computation for a single iteration, nn matrix:
1. for (0 · i · n-1)
2.
xi := (bi - 0· j· n-1, j i Aij ¢ xj) / Aii
 Can be reformulated in block form, NN blocks, length M
1. for (0 · p · N-1)
2. v := b(p)
3. for each block A(pq) with qp
4.
v := v - A(pq) x(q)
5. for (0· i· M-1,ij)
6.
x(p)i := (vi - S0· j· M A(pp)ij ¢ x(p)j ) / A(pp)ii
 Computes one matrix block at a time
Parallelising Gauss-Seidel
 Inherently sequential for dense matrices
–
Uses results from current and previous iterations
 Permutation has no effect on correctness of the result
–
Can be exploited to achieve parallelisation for certain sparse
matrix problems, e.g. Koester, Ranka & Fox 1994
 The block formulation helps, although
–
–
Requires row-wise access to blocks and block entries
Need to respect computational dependencies, i.e. when
computing vector block x(p), use values from current iteration
for blocks q < p, and from previous iteration for q > p
 Idea: propose to use wavefront techniques
–
–
Work with symbolic, MTBDD-based data structures
Extract dependency information and form execution schedule
MTBDD data structures
 Recursive, based on Binary Decision Diagrams (BDDs)
–
–
Stored in reduced form (DAG), with isomorphic subtrees
stored only once
Exploit regularity to obtain compact matrix storage
Matrices as MTBDDs
 Representation
–
–
Root represents the whole matrix
Leaves store matrix entries, reachable by following paths from
the root node
Matrices as MTBDDs
 Recursively descending through the tree
–
–
Divides the matrix into submatrices
One level, divide into two submatrices
Matrices as MTBDDs
 Recursively descending through the tree
–
–
Provides a convenient block decomposition
Two levels, divide into four blocks
A two-layer structure from MTBDDs
 Can obtain block decomposition, store as two sparse matrices
–
Enables fast row-wise access to blocks and block entries
(Parker’02,
Mehmood’05)
Wavefront techniques
 An approach to parallel programming, e.g. Joubert et al ‘98
–
–
Divide a computation into many tasks
Form a schedule for these tasks
 A schedule contains several wavefronts
–
–
Each wavefront comprises tasks that are algorithmically
independent of each other
I.e. correctness is not affected by the order of execution
 The execution is carried out from one wavefront to another
–
–
Tasks assigned according to the dependency structure
Each wavefront contains tasks that can be executed in parallel
 Our approach
–
–
Tasks are determined by matrix blocks
Fast extraction of dependency information from MTBDD matrix
A two-layer structure from MTBDDs
 Can obtain block decomposition, store as two sparse matrices
–
Enables fast row-wise access to blocks and block entries
(Parker’02,
Mehmood’05)
Dependency graph from MTBDD
 By traversal of top levels of MTBDD, as for top layer
Generating a Wavefront Schedule
 By colouring the dependency graph
 Can generate a schedule to let the computation perform
from one colour to another
A Wavefront Parallel Gauss-Seidel
 The computation ordering of Gauss-Seidel method is
determined by the colouring scheme
–
–
–
–
Each processor is responsible for the computation and storage of
every vector block associated with tasks assigned to it
The computation of Gauss-Seidel method progresses from one
colour to another
Tasks with the same colour are algorithmically independent of
each other and can be executed in parallel
The completion of a task may enable tasks in the next colour to
be executed
 Note
–
–
Can store full matrix at each node, hence reduced communication
costs, since only vector blocks need to be exchanged
(With two-layer data structure), faster access to blocks and
block entries than conventional MTBDDs
Implementation
 Runs on Ethernet and Myrinet-enabled PC cluster [DSN’05]
–
–
–
Use MPI (the MPICH implementation)
Prototype extension for PRISM, uses PRISM numerical engines
and CUDD package for MTBDDs (Somenzi)
Various optimisations, non-blocking inter-processor communication,
load-balancing, etc
 Evaluated on a range of benchmarks
–
Kanban, FMS and Polling system
 Now running on the Grid [AHM’05]
–
–
–
Globus implementation using GT4
Uses Globus FTP service
Within PRISM, currently only steady-state
Experimental results: models
 Parameters and statistics of models
– Include Kanban 9,10 and FMS 13, previously intractable
– All compact, requiring less than 1GB
Experimental results: time
 Total execution times (in seconds) with 1 to 32 nodes
– Termination condition maximum relative difference 10-6
– Block numbers selected to minimise storage
Experimental results: FMS speed-up
Experimental results: Kanban speed-up
PRISM successes so far
 Fully automatic, no expert knowledge needed for
– Probabilistic reachability and temporal logic properties
– Expected time/cost
 Tangible results!
– Real flaws found in real-world protocols
– Greater level of detail, has exposed obscure dependencies
 PRISM tool robust
– Large, realistic models often possible
– Choice of engines
 Very broad range of application domains
– Communication protocols, distributed coordination, power
management, security protocols, quantum cryptographic
protocols, dependability, reliability, systems biology, …
But…
 Models monolithic and finite-state only
– Emphasis on efficiency
– No decomposition, abstraction
– No data reduction, yet
 State-space explosion has not gone away…
– Heuristics for MTBDDs/BDDs sometimes fail
– Parallelisation, sampling-based computation helps
 Limited expressiveness
–
–
–
–
–
Only PCTL plus extensions (LTL in progress)
Only exponential distributions
Direct support for PTAs in progress [FORMATS’04]
No continuous space models
No mobility yet…
Challenges for future
 Exploiting structure
– Abstraction, data/equivalence quotient, (de)compositionality…
– Parametric probabilistic verification?
 Proof assistant for probabilistic verification?
 Efficient methods for continuous models
– Continuous PTAs? Continuous time MDPs? LMPs?
 More expressive specifications
– Probabilistic LTL/PCTL*/mu-calculus?
 Real software, not models!
For more information…
J. Rutten, M. Kwiatkowska, G. Norman and
D. Parker
Mathematical Techniques for Analyzing
Concurrent and Probabilistic Systems
P. Panangaden and F. van Breugel (editors),
CRM Monograph Series, vol. 23, AMS
March 2004
www.cs.bham.ac.uk/~dxp/prism/




Case studies, statistics, group publications
Download, version 2.1 (2500 downloads)
Unix/Linux, Windows, Apple platforms
Publications by others and courses that
feature PRISM…
PRISM collaborators worldwide
Download