Concurrent Executions on Relaxed Memory Models Challenges & Opportunities for Software Model Checking Rajeev Alur University of Pennsylvania Joint work with Sebastian Burckhardt and Milo Martin SPIN Workshop, July 2007 Model Checking Concurrent Software Multi-threaded Software Shared-memory Multiprocessor Concurrent Executions Bugs Concurrency on Multiprocessors Initially x = y = 0 thread 1 x = 1 y = 1 thread 2 print y print x →1 →0 Output not consistent with any interleaved execution! can be the result of out-of-order stores can be the result of out-of-order loads improves performance Architectures with Weak Memory Models A modern multiprocessor does not enforce global ordering of all instructions for performance reasons Each processor has pipelined architecture and executes multiple instructions simultaneously Each processor has a local cache, and loads/stores to local cache become visible to other processors and shared memory at different times Lamport (1979): Sequential consistency semantics for correctness of multiprocessor shared memory (like interleaving) Considered too limiting, and many “relaxations” proposed In theory: TSO, RMO, Relaxed … In practice: Alpha, Intel IA-32, IBM 370, Sun SPARC, PowerPC … Active research area in computer architecture Programming with Weak Memory Models Concurrent programming is already hard, shouldn’t the effects of weaker models be hidden from the programmer? Mostly yes … Safe programming using extensive use of synchronization primitives Use locks for every access to shared data Compilers use memory fences to enforce ordering Not always … Non-blocking data structures Highly optimized library code for concurrency Code for lock/unlock instructions Non-blocking Queue (MS’96) boolean_t dequeue(queue_t *queue, value_t *pvalue) { node_t *head; node_t *tail; node_t *next; 2 3 1 while (true) { head = queue->head; tail = queue->tail; head tail next = head->next; if (head == queue->head) { if (head == tail) { if (next == 0) return false; cas(&queue->tail, (uint32) tail, (uint32) next); } else { *pvalue = next->value; if (cas(&queue->head, (uint32) head, (uint32) next)) break; } } } delete_node(head); return true; } Software Model Checking for Concurrent Code on Multiprocessors Why?: Real bugs in real code Opportunities 10s—100s lines of low-level library C code Hard to design and verify -> buggy Effects of weak memory models, fences … Challenges Lots of behaviors possible: high level of concurrency How to formalize and reason about weak memory models? Talk Outline Motivation Relaxed Memory Models CheckFence: Analysis tool for Concurrent Data Types Hierarchy of Abstractions Programs (multi-threaded) -- Synchronization primitives: locks, atomic blocks -- Library primitives (e.g. shared queues) Language level memory model Assembly code -- Synchronization primitives: compare&swap, LL/SC -- Loads/stores from shared memory Architecture level memory model Hardware -- multiple processors -- write buffers, caches, communication bus … Shared Memory Consistency Models Specifies restrictions on what values a read from shared memory can return Program Order: x <p y if x and y are instructions belonging to the same thread and x appears before y Sequential Consistency (Lamport 79): There exists a global order < of all accesses such that < is consistent with the program order <p Each read returns value of most recent, according to <, write to the same location (or initial value, if no such write exists) Clean abstraction for programmers, but high implementation cost Effect of Memory Model Initially flag1 = flag2 = 0 thread 1 thread 2 1. 2. 1. 2. flag1 = 1; if (flag2 == 0) crit. sect. flag2 = 1; if (flag1 == 0) crit. sect. Ensures mutual exclusion if architecture supports SC memory Most architectures do not enforce ordering of accesses to different memory locations Does not ensure mutual exclusion under weaker models Ordering can be enforced using “fence” instructions Insert MEMBAR between lines 1 and 2 to ensure mutual exclusion Relaxed Memory Models A large variety of models exist; a good starting point: Shared Memory Consistency Models: A tutorial IEEE Computer 96, Adve & Gharachorloo How to relax memory order requirement? Operations of same thread to different locations need not be globally ordered How to relax write atomicity requirement? Read may return value of a write not yet globally visible Uniprocessor semantics preserved Typically defined in architecture manuals (e.g. SPARC manual) Unusual Effects of Memory Models Initially A = flag1 = flag2 = 0 thread 1 flag1 = 1; A = 1; reg1 = A; reg2 = flag2; thread 2 flag2 = 1; A = 2; reg3 = A; reg4 = flag1; Result reg1 = 1; reg3 = 2; reg2 = reg4 = 0 Possible on TSO/SPARC Write to A propagated only to local reads to A Reads to flags can occur before writes to flags Not allowed on IBM 370 Read of A on a processor waits till write to A is complete Which Memory Model? Memory models are platform dependent RMO We use a conservative approximation “Relaxed” to capture common effects PSO TSO 390 SC Alpha IA-32 Relaxed Once code is correct for “Relaxed”, it is correct for many models Formalization of Relaxed A set of addresses X set of memory accesses a(x) memory address of x V set of values S are stores L are loads v(x) value loaded or stored by x <p is a partial order over X (program order) Correctness wrt Relaxed demands existence of a total order < over X s.t. For a load l in L, define the following set of stores that are “visible to l”: S(l) = { s in S | a(s) = a(l) and (s < l or s <p l ) } 1. If x <p y and a(x) = a(y) and y in S, then x < y 2. For l in L and s in S(l), always either v(l) = v(s) or there exists another store s’ in S(l) such that s < s’ Talk Outline Motivation Relaxed memory models CheckFence: Analysis tool for Concurrent Data Types CheckFence Focus concurrency libraries with lock-free synchronization ... are simple, fast, and safe to use concurrent versions of queues, sets, maps, etc. more concurrency, less waiting fewer deadlocks ... are notoriously hard to design and verify tricky interleavings routinely escape reasoning and testing exposed to relaxed memory models code needs to contain memory fences for correct operation! Concurrent Algorithms: + Lock-free queues, sets, lists CheckFence Tool Architecture: + Multiprocessors + Relaxed memory models References [CAV 2006], [PLDI 2007] Burckhardt thesis Computer-Aided Verification: + Model checking C code + Counterexamples Non-blocking Queue Processor 1 .... ... enqueue(1) ... enqueue(2) .... .... .... Processor 2 .... ... ... a = dequeue() b = dequeue() void enqueue(int val) { ... } int dequeue() { ... } The client program The implementation on multiple processors calls operations optimized: no locks. not race-free exposed to memory model Non-blocking Queue (MS’96) boolean_t dequeue(queue_t *queue, value_t *pvalue) { node_t *head; node_t *tail; node_t *next; 2 3 1 while (true) { head = queue->head; tail = queue->tail; head tail next = head->next; if (head == queue->head) { if (head == tail) { if (next == 0) return false; cas(&queue->tail, (uint32) tail, (uint32) next); } else { *pvalue = next->value; if (cas(&queue->head, (uint32) head, (uint32) next)) break; } } } delete_node(head); return true; } Correctness Condition Data type implementations must appear sequentially consistent to the client program: the observed argument and return values must be consistent with some interleaved, atomic execution of the operations. Observation enqueue(1) dequeue() -> 2 Witness Interleaving enqueue(2) dequeue() -> 1 enqueue(1) enqueue(2) dequeue() -> 1 dequeue() -> 2 How To Bound Executions Verify individual “symbolic tests” finite number of concurrent threads finite number of operations/thread nondeterministic input values Example thread 1 enqueue(X) thread 2 dequeue() → Y User creates suite of tests of increasing size Why symbolic test programs? 1) Make everything finite State is unbounded (dynamic memory allocation) ... is bounded for individual test Checking sequential consistency is undecidable (AMP 96) ... is decidable for individual test 2) Gives us finite instruction sequence to work with State space too large for interleaved system model .... can directly encode value flow between instructions Memory model specified by axioms .... can directly encode ordering axioms on instructions Bounded Model Checker Pass: all executions of the test are observationally equivalent to a serial execution CheckFence Fail: Inconclusive: runs out of time or memory Memory Model Axioms Tool Architecture Trace C code Memory model Symbolic Test Symbolic test gives exponentially many executions (symbolic inputs, dynamic memory allocation, ordering of instructions). CheckFence solves for “bad” executions. construct CNF formula whose solutions correspond precisely to the concurrent executions Trace C code Memory model Symbolic Test automatic, lazy loop unrolling automatic specification mining (enumerate correct observations) Specification Mining thread 1 thread 2 dequeue() → Z enqueue(X); enqueue(Y) Possible Operation-level Interleavings enqueue(X) enqueue(X) dequeue() -> Z enqueue(Y) dequeue() -> Z enqueue(X) dequeue() -> Z enqueue(Y) enqueue(Y) For each interleaving, obtain symbolic constraint by encoding corresponding executions in SAT solver Spec is disjunction of all possibilities: Spec: (Z=X) | (Z=null) To find bugs, check satisfiability of Phi & ~ Spec where Phi encodes all possible concurrent executions Encoding Memory Order thread 1 s1 s2 Example: thread 2 store store l1 l2 load load Encoding variables Use bool vars for relative order (x<y) of memory accesses Use bitvector variables Ax and Dx for address and data values associated with memory access x Encode constraints encode transitivity of memory order encode ordering axioms of the memory model Example (for SC): (s1<s2) & (l1<l2) encode value flow “Loaded value must match last value stored to same address” Example: value must flow from s1 to l1 under following conditions: ((s1<l1)&(A s1 ) -> (D = Al1)|((s2<s1)|(l1<s2)|(As2 != Al1)) s1= Dl1) Example: Memory Model Bug 1 head Processor 1 links new node into list ... 3 node->value = 2; ... 1 head = node; ... 2 3 Processor 2 reads value at head of list ... 2 value = head->value; ... Processor 1 reorders the stores! memory accesses happen in order 1 2 3 --> Processor 2 loads uninitialized value adding a fence between lines on left side prevents reordering Studied Algorithms Type Description LOC Source Queue Two-lock queue 80 M. Michael and L. Scott Queue Non-blocking queue 98 (PODC 1996) Set Lazy list-based set 141 Heller et al. (OPODIS 2005) Set Nonblocking list 174 T. Harris (DISC 2001) Deque “snark” algorithm 159 D. Detlefs et al. (DISC 2000) LL/VL/SC CAS-based 74 M. Moir (PODC 1997) LL/VL/SC Bounded Tags 198 Results snark algorithm has 2 known bugs lazy list-based set had a unknown bug (missing initialization; missed by formal correctness proof [CAV 2006] because of hand-translation of pseudocode) Type Description Queue Two-lock queue Queue Non-blocking queue Set Lazy list-based set Set Nonblocking list Deque original “snark” Deque fixed “snark” LL/VL/SC CAS-based LL/VL/SC Bounded Tags regular bugs 1 unknown 2 known Results snark algorithm has 2 known bugs lazy list-based set had a unknown bug (missing initialization; missed by formal correctness proof [CAV 2006] because of hand-translation of pseudocode) Many failures on relaxed memory model • inserted fences by hand to fix them • small testcases sufficient for this purpose Type Description regular bugs # Fences inserted Store Load Dependent Aliased Store Load Loads Loads Queue Two-lock queue 1 Queue Non-blocking queue 2 Set Lazy list-based set Set Nonblocking list Deque original “snark” Deque fixed “snark” 1 unknown 1 4 1 2 1 3 1 2 3 4 6 2 known 4 2 LL/VL/SC CAS-based 3 LL/VL/SC Bounded Tags 4 Typical Tool Performance Very efficient on small testcases (< 100 memory accesses) Example (nonblocking queue): T0 = i (e | d) T1 = i (e | e | d | d ) - find counterexamples within a few seconds - verify within a few minutes - enough to cover all 9 fences in nonblocking queue Slows down with increasing number of memory accesses in test Example (snark deque): Dq = pop_l | pop_l | pop_r | pop_r | push_l | push_l | push_r | push_r - has 134 memory accesses (77 loads, 57 stores) - Dq finds second snark bug within ~1 hour Does not scale past ~300 memory accesses Conclusions Software model checking of low-level concurrent software requires encoding of memory models Challenge for model checking due to high level of concurrency and axiomatic specifications Opportunity to find bugs in library code that’s hard to design and verify CheckFence project at Penn SAT-based bounded model checking for concurrent data types Real bugs in real code with fences Research Challenges What’s the best way to verify C code (on relaxed memory models)? SAT-based encoding seems suitable to capture specifications of memory models, but many opportunities for improvement Can one develop explicit-state methods for these applications? Language-level memory models: Can we verify Java concurrency libraries using the new new Java memory model? Hardware support for transactional memory Current interest in industry and architecture research Can formal verification influence designs/standards?