Probabilistic model checking for the Grid and on the Grid Marta Kwiatkowska School of Computer Science www.cs.bham.ac.uk/~mzk www.cs.bham.ac.uk/~dxp/prism ARW, 29-30th July 2005 Overview e-Science and the Grid – The promises and the challenges Probabilistic model checking – Why needed? – What does it involve? – The PRISM model checker … for the Grid – Collaborative protocols: thinkteam example – Performance: QoS analysis of workstation cluster … on the Grid – Parallel symbolic solution for Gauss-Seidel Conclusion With thanks to… Main collaborators on probabilistic model checking – Gethin Norman, Dave Parker, Jeremy Sproston, Christel Baier, Roberto Segala, Michael Huth, Luca de Alfaro, Joost-Pieter Katoen, Antonio Pacheco PRISM model checker, implementation & Grid-enabling – Dave Parker, Andrew Hinton, Rashid Mehmood, Yi Zhang, Hakan Younes, Stephen Gilmore, Michael Goldsmith, Conrado Daws, Fuzhi Wang Case studies – Dave Parker, Gethin Norman, Maurice ter Beek, Mieke Massink, Diego Latella, Boudewijn Haverkort, Holger Hermanns and Joost-Pieter Katoen And many more… The e-Scientist of the future Will this work? The Internet Remote access to high-performance computers, via Internet Remote access to visualisation facilities, via Internet Computational steering from anywhere, via PDA Visualisation, on your laptop Fast, online, accurate, … What enables e-Science? Data sharing: distribution, query searches, curation – Distributed databases – Web-services – Ontologies Collaboration: coordination, concurrency control – Collaborative protocols – Workflow – Web-service orchestration Compute power: scheduling, load-balancing, performance – Computer clusters – Campus and intra-Grids – Large-scale grids The role of probability Enables performance modelling – Pioneered by Erlang, in telecommunications, ca 1910 – Models: typically continuous time Markov chains – Emphasis on steady-state and transient probabilities Also used in stochastic planning, to model uncertainty – cf Bellman equations, ca 1950s – Models: Markov decision processes – Emphasis on finding optimum policies Our focus, probabilistic model checking – – – – Distinctive, on automated verification for probabilistic systems Temporal logic specifications, automata-theoretic techniques Shared models Exchanging techniques with the other two areas Verification via model checking… or falsification? The model or Model Checker send deliver Temporal logic specification Error trace Line Line Line … Line Line 5: … 21: … 15: … 27: … 45: ... Probabilistic model checking… in a nutshell 0.4 0.3 Probabilistic model send P¸ 0.9(deliver) Probabilistic temporal logic specification or Probabilistic Model Checker or The probability State State State … State State 5: 0.6789 6: 0.9789 7: 1.0 12: 0 13: 0.1245 Probabilistic model checking inputs… Models – – – – discrete time Markov chains (DTMCs) continuous time Markov chains (CTMCs) Markov decision processes (MDPs) (currently indirectly) probabilistic timed automata (PTAs) (Yes/No) temporal logic specification languages – Probabilistic temporal logic PCTL (for DTMCs/MDPs) – Continuous Stochastic Logic CSL (for CTMCs) – Probabilistic timed computation tree logic PTCTL (for PTAs) Quantitative specification language variants – Probability values for logics PCTL/CSL/PTCTL (for all models) – Extension with expectation operator (for all) – Extension with costs/rewards (for all) Probabilistic models Discrete time – Probabilistic transitions, represented as discrete probability distribution – Model types • Discrete time Markov chains (DTMCs): probabilistic choice only • Markov decision processes (MDPs): probabilistic choice and nondeterminism p1 p2 pn . . . i pi = 1 Continuous passage of time, randomly distributed delays, space – Model types • Continuous time Markov chains (CTMCs): exponentially delays, no nondeterminism • Probabilistic Timed Automata (PTAs): dense time, admit nondeterminism • Labelled Markov Processes (LMPs): continuous space/time time s0+1 f(x)dx = 1 Discrete-Time Markov Chains (DTMCs) Features: – Only probabilistic choice in each state Formally, (S,s0,P,L): fail 1 init 0.02 try s0 1 – S finite set of states – s0 initial state s1 0.01 s3 0.97 succ s2 – P: S £ S ! [0,1] probability matrix, s.t. s’ P(s,s’) = 1, all s – L: S ! 2AP atomic propositions Unfold into infinite paths s0s1s2s3s4… s.t. P(si,si+1) > 0, all i Probability for finite paths, multiply along path e.g. s0 s1 s1 s2 is 1 ¢ 0.01 ¢ 0.97 = 0.0097 1 Probability space Intuitively: – Sample space = infinite paths Paths from s – Event = set of paths ss1s2…sk – Basic event = cone Formally, (Paths, W, Pr) – For finite path w = ss1…sn, define probability P(w) = 1 if w has length one P(s,s1) ¢ … ¢ P(sn-1,sn) otherwise – Take W least s-algebra containing cones C(w) = { p 2 Paths | w is prefix of p} – Define Pr(C(w)) = P(w), all w – Pr extends uniquely to measure on Paths The probabilistic operator New probabilistic operator, a quantitative analogue of 8, 9 – e.g. send P¸ 0.9(deliver) “whenever a message is sent, the probability that it is eventually delivered is at least 0.9” – Takes path formulas a: a (eventually), a U b (until), etc Formally, s ² P» p(a) , Pr { p 2 Paths j p ² a } » p < 1 - p S a-paths threshold level p ¸ p Continuous Time Markov Chains (CTMCs) Features: – Discrete states and real time – Exponentially distributed random delays empty 0 3 3 4 1 4 3 2 4 full 3 Formally: – Set of states S plus rates R(s,s’) > 0 of moving from s to s’ – Probability of moving from s to s’ by time t > 0 is 1 - e-R(s,s’)¢ t – Transition rate matrix S £ S ! R¸0 Unfold into infinite paths s0t0s1t1s2t2s3… – probs (s’), probability of being in s’ in the long-run, starting in s – probs (s’,t), probability of being in s’ at time instant t Cone construction generalises via time intervals The logic CSL: syntax Continuous Stochastic Logic [ASSB96,BKH99] – For CTMCs, based on PCTL, for example • • P< 0.85(}<15 full), probability operator “the probability of queue becoming full within 15 secs is < 0.85” S< 0.01(down), steady-state operator “in the long run, the probability the system is down is less than 1%” The syntax of state and path formulas of CSL is: f ::= true | a | f Æ f | :f | S» p(f) | P» p(a) a ::= X f | f U· t f | f U f where p 2 [0,1] is a probability bound, t 2 R¸0 and » 2 { <, >, … } Extension with time intervals for until, cost/rewards and expectation operator E» c(f) CSL semantics Semantics of bounded until: p ² f1 U· t f2 , iff f2 satisfied at time instant t along p = s0L and f1 satisfied at all preceding time instants The added operators: s ² S» p(f) , Ss’ ² f probs (s’) » p where probs (s’) is prob. of being in s’ in the long-run, having started in s s ² P» p(a) , Pr { p 2 Paths j p ² a } » p where Pr is probability measure on paths as before Also quantitative form of formulas, instead of threshold The logic CSL: model checking By induction on structure of formula, except for – S» p(f) and P» p(f1 U· t f2) The steady-state operator – Reduces to solution of a linear equation system The time-bounded until – Numerical [Baier et al, CONCUR’00] • Reduces to transient analysis using uniformisation • Transform CTMC by removing all outgoing transitions from states satisfying f2 or :f1 • Then Pr { p 2 Paths j p ² f U· t f } = Ss’ ² f2 probs (s’,t) – Sampling-based, statistical hypothesis testing [Younes et al, CAV’03, TACAS’04] • Reduces to simulation and estimation of satisfaction of time bound Probabilistic model checking involves… Construction of models: – DTMCs/CTMCs, MDPs, and PTAs (with digital clocks) Implementation of probabilistic model checking algorithms – graph-theoretical algorithms, combined with • (probabilistic) reachability • qualitative model checking (for 0/1 probability) – numerical computation – iterative methods • quantitative model checking (plot probability values, expectations, rewards, steady-state, etc, for a range of parameters) • exhaustive – sampling-based – simulation • quantitative model checking as above, based on many simulation runs • approximate The PRISM probabilistic model checker History – Implemented at the University of Birmingham – First public release September 2001, ~7 years development – Open source, GPL licence, available freely for research an teaching – >2500 downloads, many users worldwide, >80 papers using PRISM, connection to several tools, >40 case studies and 6 flaws, … Approach – Based on symbolic, BDD-based techniques – Multi-Terminal BDDs, first algorithm [ICALP’97] www.cs.bham.ac.uk/~dxp/prism/ – For software, publications, case studies, taught courses using PRISM, etc Overview of PRISM Functionality – Implements temporal logic probabilistic model checking – Construction of models: discrete and continuous Markov chains (DTMCs/CTMCs), and Markov decision processes (MDPs) – High-level model description language, state-based, also PEPA – Probabilistic temporal logics: PCTL and CSL – Extension with costs/rewards, expectation operator Underlying computation combines graph-theoretical algorithms – Reachability, qualitative model checking, BDD-based with numerical computation – iterative methods – Linear equation system solution - Jacobi, Gauss-Seidel, ... – Uniformisation (CTMCs) – Dynamic programming (MDPs) – Explicit and symbolic (MTBDDs, etc.) PRISM technicalities (New) Simulator and sampling-based model checking – allows to “excute” the model step-by-step or randomly – avoids state-space explosion, trading off accuracy GUI implementation – integrated editor for PRISM language – automatic graph plotting Support for “experiments” – e.g. P=? [true U<=T error] for N=1..5,T=1..100 – repeats model checking for a range of model/formula parameters – plots on single graph PRISM property specifications PCTL/CSL (true/false) formula examples: – P<0.001 [ true U100 error ] “the probability of the system reaching an error state within 100 time units is less than 0.001” Can also write query formulae: – P=? [ true U10 terminate ] “what is the probability that the algorithm terminates successfully within 10 time units?” Instantaneous rewards, state-based, e.g. “queue size”: – R=? [ I=T ], expected reward at time instant T? Cumulative rewards, state/transition, e.g. “power consumed”: – R=? [ C<=T ], expected reward by time T? – R=? [ S ], expected long-run reward per unit time? PRISM real-world case studies CTMCs – – – – – thinkteam (by ter Beek, Massink & Latella) [DSVIS’05] QoS of workstation cluster (by Haverkort et al) [SRDS’00] Dependability of embedded controller [INCOM’04] Dynamic power management [HLDVT’02, FAC 2005] ERK pathway (by Calder, Hillston et al) [CMSB’05] – – – – – Bluetooth device discovery [ISOLA’04] Crowds anonymity protocol (by Shmatikov) [CSFW’02, JSC 2003] Randomised consensus [CAV’01,FORTE’02] Contract signing protocols (by Norman & Shmatikov) [FASEC’02] NAND multiplexing for nano (with Shukla) [VLSI’04,TCAD 2005] MDPs/DTMCs PTAs – IPv4 ZeroConf dynamic configuration [FORMATS’03] – Root contention in IEEE 1394 FireWire [FAC 2003, STTT 2004] – IEEE 802.11 WLAN protocol (and Gopinath) [PROBMIV’02,CAV’05] Screenshot: Graphs Screenshot: Graphical input language Case Study: thinkteam Groupware systems, e-Business – For collaborative working, geographically distributed thinkteam (TT) is think3’s Product Data Management (PDM) application – Helps to capture, organise, automate and share engineering product information – Dispersed and asynchronous groupware system – Performs controlled storage and retrieval of data (vaulting) thinkteam analysis [ter Beek, Massink & Latella] – Functional verification, e.g. • concurrency control, detected problem with release of data locks • denial of service – Stochastic, quantitative analysis of user interface (here), to appear in [DSVIS’05] The thinkteam architecture Operations: get, import, checkOut, unCheckOut, checkIn, checkInOut thinkteam user interface thinkteam’s access to documents – Currently based on retrial principle (no queue) – Want to consider reservation system Modelling and quantitative analysis – – – – – Study effect of document access policies on users Developed stochastic (CTMC) models in PEPA and PRISM Analysed the retrial policy and waiting-list policy Properties expressible in CSL Used PRISM to obtain quantitative results Observations – Retrial approach less convenient from user’s perspective • Rapidly deteriorates with increasing number of competing users Results: thinkteam The chance that users in retry mode reach checkOut in 5 hours P=?[true U<=5 (mode=“checkOut”){mode=“retry”}] Results: thinkteam policies Comparing effect of the retrial and waiting-list policies on users (mu is 5 and 10 users respectively) Case Study: Workstation cluster Two sub-clusters (N workstations in each cluster) – – – – By Haverkort, Hermanns and Katoen, Proc. IEEE SRDS'00 See www.cs.bham.ac.uk/~dxp/prism/ for PRISM analysis Star topology with a central switch Components can break down, single repair unit left switch left sub-cluster backbone right switch right sub-cluster Case Study: Workstation cluster CTMC model – rates inverse of average durations – for example: • • • • Backbone fails on average every 5000 hours Backbone repair on average takes 8 hours Workstation fails on average every 500 hours Workstation repair on average takes 0.5 hours Minimum QOS: at least 3/4*N workstations operational and connected via switches Premium QOS: all (N) workstations operational and connected via switches Results: Workstation cluster The worst case probability of going from minimum to premium QoS within T hours P=?[true U<=T ”premium” {“minimum”}{min}] Results: Workstation cluster The probability it takes more than T time units to recover from insufficient QoS P=?[!”minimum” U>=T ”minimum” {!”minimum”}{max}] Simulator output: Workstation cluster Scalability with sampling-based method Cell cycle control PRISM model size limit What we have learnt from practice Probabilistic model checking – Is capable of finding ‘corner cases’ and ‘unusual trends’ – Good for worst-case scenarios, for all initial states – Benefits from quantitative-style analysis for a range of parameters – Is limited by state space size – Useful for real-world protocol analysis, power management, performance, biological processes, … Simulation and sampling-based techniques – Limited by accuracy of the results, not state-space explosion – May need to rerun experiments for each possible start state, not always feasible – Statistical methods in conjunction with sampling help – Nested formulas may be difficult New directions and challenges Often not practical to analyse the full system beforehand – Adaptive methods • Genetic algorithm-based methods [Jarvis et al, 2005] – Online methods • Continuous approximations using ODEs [Gilmore et al, 2005] • Applicable for sufficiently large numbers of entities • Work with PEPA, derivation of ODEs Scalability challenge – – – – – – State-space explosion has not gone away… Parallelisation Distributed computation Compositionality Abstraction Approximate methods Why Grid-enable? Experience with PRISM indicates – – – – Symbolic representation very compact, >1010 states for CTMCs Extremely large models feasible depending on regularity Numerical computation often slow Sampling-based computation can be even slower Can parallelise the symbolic approach – – – – Facilitates extraction of the dependency information Compactness enables storage of the full matrix at each node Focus on steady-state solution for CTMCs, can generalise Use wavefront techniques to parallelise Easy to distribute sampling-based computation – Individual simulation runs are independent Numerical Solution for CTMCs Steady-state probability distribution can be obtained by solving linear equation system: – pQ = 0 with constraint Si pi = 1 We consider the more general problem of solving: – Ax = b, – where A is nn matrix, b vector of length n Numerical solution techniques – Direct, not feasible for very large models – Iterative stationary (Jacobi, Gauss-Seidel), memory efficient – Projection methods (Krylov, CGS, …), fastest convergence, but require several vectors Transient probabilities similar – Computed via an iterative method (uniformisation) Gauss-Seidel Computes one matrix row at a time Updates ith element using most up-to-date values Computation for a single iteration, nn matrix: 1. for (0 · i · n-1) 2. xi := (bi - 0· j· n-1, j i Aij ¢ xj) / Aii Can be reformulated in block form, NN blocks, length M 1. for (0 · p · N-1) 2. v := b(p) 3. for each block A(pq) with qp 4. v := v - A(pq) x(q) 5. for (0· i· M-1,ij) 6. x(p)i := (vi - S0· j· M A(pp)ij ¢ x(p)j ) / A(pp)ii Computes one matrix block at a time Parallelising Gauss-Seidel Inherently sequential for dense matrices – Uses results from current and previous iterations Permutation has no effect on correctness of the result – Can be exploited to achieve parallelisation for certain sparse matrix problems, e.g. Koester, Ranka & Fox 1994 The block formulation helps, although – – Requires row-wise access to blocks and block entries Need to respect computational dependencies, i.e. when computing vector block x(p), use values from current iteration for blocks q < p, and from previous iteration for q > p Idea: propose to use wavefront techniques – – Work with symbolic, MTBDD-based data structures Extract dependency information and form execution schedule MTBDD data structures Recursive, based on Binary Decision Diagrams (BDDs) – – Stored in reduced form (DAG), with isomorphic subtrees stored only once Exploit regularity to obtain compact matrix storage Matrices as MTBDDs Representation – – Root represents the whole matrix Leaves store matrix entries, reachable by following paths from the root node Matrices as MTBDDs Recursively descending through the tree – – Divides the matrix into submatrices One level, divide into two submatrices Matrices as MTBDDs Recursively descending through the tree – – Provides a convenient block decomposition Two levels, divide into four blocks A two-layer structure from MTBDDs Can obtain block decomposition, store as two sparse matrices – Enables fast row-wise access to blocks and block entries (Parker’02, Mehmood’05) Wavefront techniques An approach to parallel programming, e.g. Joubert et al ‘98 – – Divide a computation into many tasks Form a schedule for these tasks A schedule contains several wavefronts – – Each wavefront comprises tasks that are algorithmically independent of each other I.e. correctness is not affected by the order of execution The execution is carried out from one wavefront to another – – Tasks assigned according to the dependency structure Each wavefront contains tasks that can be executed in parallel Our approach – – Tasks are determined by matrix blocks Fast extraction of dependency information from MTBDD matrix A two-layer structure from MTBDDs Can obtain block decomposition, store as two sparse matrices – Enables fast row-wise access to blocks and block entries (Parker’02, Mehmood’05) Dependency graph from MTBDD By traversal of top levels of MTBDD, as for top layer Generating a Wavefront Schedule By colouring the dependency graph Can generate a schedule to let the computation perform from one colour to another A Wavefront Parallel Gauss-Seidel The computation ordering of Gauss-Seidel method is determined by the colouring scheme – – – – Each processor is responsible for the computation and storage of every vector block associated with tasks assigned to it The computation of Gauss-Seidel method progresses from one colour to another Tasks with the same colour are algorithmically independent of each other and can be executed in parallel The completion of a task may enable tasks in the next colour to be executed Note – – Can store full matrix at each node, hence reduced communication costs, since only vector blocks need to be exchanged (With two-layer data structure), faster access to blocks and block entries than conventional MTBDDs Implementation Runs on Ethernet and Myrinet-enabled PC cluster [DSN’05] – – – Use MPI (the MPICH implementation) Prototype extension for PRISM, uses PRISM numerical engines and CUDD package for MTBDDs (Somenzi) Various optimisations, non-blocking inter-processor communication, load-balancing, etc Evaluated on a range of benchmarks – Kanban, FMS and Polling system Now running on the Grid [AHM’05] – – – Globus implementation using GT4 Uses Globus FTP service Within PRISM, currently only steady-state Experimental results: models Parameters and statistics of models – Include Kanban 9,10 and FMS 13, previously intractable – All compact, requiring less than 1GB Experimental results: time Total execution times (in seconds) with 1 to 32 nodes – Termination condition maximum relative difference 10-6 – Block numbers selected to minimise storage Experimental results: FMS speed-up Experimental results: Kanban speed-up PRISM successes so far Fully automatic, no expert knowledge needed for – Probabilistic reachability and temporal logic properties – Expected time/cost Tangible results! – Real flaws found in real-world protocols – Greater level of detail, has exposed obscure dependencies PRISM tool robust – Large, realistic models often possible – Choice of engines Very broad range of application domains – Communication protocols, distributed coordination, power management, security protocols, quantum cryptographic protocols, dependability, reliability, systems biology, … But… Models monolithic and finite-state only – Emphasis on efficiency – No decomposition, abstraction – No data reduction, yet State-space explosion has not gone away… – Heuristics for MTBDDs/BDDs sometimes fail – Parallelisation, sampling-based computation helps Limited expressiveness – – – – – Only PCTL plus extensions (LTL in progress) Only exponential distributions Direct support for PTAs in progress [FORMATS’04] No continuous space models No mobility yet… Challenges for future Exploiting structure – Abstraction, data/equivalence quotient, (de)compositionality… – Parametric probabilistic verification? Proof assistant for probabilistic verification? Efficient methods for continuous models – Continuous PTAs? Continuous time MDPs? LMPs? More expressive specifications – Probabilistic LTL/PCTL*/mu-calculus? Real software, not models! For more information… J. Rutten, M. Kwiatkowska, G. Norman and D. Parker Mathematical Techniques for Analyzing Concurrent and Probabilistic Systems P. Panangaden and F. van Breugel (editors), CRM Monograph Series, vol. 23, AMS March 2004 www.cs.bham.ac.uk/~dxp/prism/ Case studies, statistics, group publications Download, version 2.1 (2500 downloads) Unix/Linux, Windows, Apple platforms Publications by others and courses that feature PRISM… PRISM collaborators worldwide