Concurrency Testing Challenges, Algorithms, and Tools Madan Musuvathi Microsoft Research Concurrency is HARD A concurrent program should Function correctly Maximize throughput Finish as many tasks as possible Minimize latency Respond to requests as soon as possible While handling nondeterminism in the environment Concurrency is Pervasive Concurrency is an age-old problem of computer science Most programs are concurrent At least the one that you expect to get paid for, anyway Solving the Concurrency Problem We need Better programming abstractions Better analysis and verification techniques Better testing methodologies Weakest Link Testing is more important than you think My first-ever computer program: Wrote it in Basic Not the world’s best programming language With no idea about program correctness I didn’t know first-order logic, loop-invariants, … I hadn’t heard about Hoare, Dijkstra, … But still managed to write correct programs, using the write, test, [debug, write, test]+ cycle How many of you have … written a program > 10,000 lines? written a program, compiled it, called it done without testing the program on a single input? written a program, compiled it, called it done without testing the program on few interesting inputs? Imagine a world where you can’t pick the inputs during testing … You write the program Check its correctness by staring at it int factorial ( int x ) { int ret = 1; while(x > 1){ ret *= x; x --; } return ret; } Give the program to the computer The computer tests on inputs of its choice factorial(5) = 120 factorial(5) = 120 the next 100 times factorial(7) = 5040 The computer runs this program again and again on these inputs for a week The program didn’t crash and therefore it is correct This is the world of concurrency testing You write the program Parent_thread() { if (p != null) { p = new P(); Set (initEvent); } } Child_thread(){ if (p != null) { Wait (initEvent); } } Check its correctness by staring at it Give the program to the computer The computer generates some interleavings The computer runs this program again and again on these interleavings The program didn’t crash and therefore its is correct Demo How do we test concurrent software today CHESS Proposition Capture and expose nondeterminism to a scheduler Threads can run at different speeds Asynchronous tasks can start at arbitrary time in the future Hardware/compiler can reorder instructions Explore the nondeterminism using several algorithms Tackle the astronomically large number of interleavings Remember: Any algorithm is better than no control at all CHESS in a nutshell CHESS is a user-mode scheduler Controls all scheduling nondeterminism Replace the OS scheduler Guarantees: Every program run takes a different thread interleaving Reproduce the interleaving for every run Download CHESS source from http://chesstool.codeplex.com/ CHESS architecture Unmanaged Program Win32 Wrappers Windows CHESS Exploration Engine CHESS Scheduler Managed Program .NET Wrappers CLR • Every run takes a different interleaving • Reproduce the interleaving for every run Running Example Thread 1 Lock (l); bal += x; Unlock(l); Thread 2 Lock (l); t = bal; Unlock(l); Lock (l); bal = t - y; Unlock(l); Introduce Schedule() points Thread 1 Schedule(); Lock (l); bal += x; Schedule(); Unlock(l); Thread 2 Schedule(); Lock (l); t = bal; Schedule(); Unlock(l); Schedule(); Lock (l); bal = t - y; Schedule(); Unlock(l); Instrument calls to the CHESS scheduler Each call is a potential preemption point First-cut solution: Random sleeps Thread 1 Sleep(rand()); Lock (l); bal += x; Sleep(rand()); Unlock(l); Thread 2 Sleep(rand()); Lock (l); t = bal; Sleep(rand()); Unlock(l); Sleep(rand()); Lock (l); bal = t - y; Sleep(rand()); Unlock(l); Introduce random sleep at schedule points Does not introduce new behaviors Sleep models a possible preemption at each location Sleeping for a finite amount guarantees starvation-freedom Improvement 1: Capture the “happens-before” graph Thread 1 Schedule(); Lock (l); bal += x;Sleep(5) Schedule(); Unlock(l); Thread 2 Schedule(); Schedule(); Lock (l); (l); Lock bal; tt == bal; Schedule(); Schedule(); Unlock(l); Unlock(l); Schedule(); Schedule(); Lock (l); Lock bal = (l); t - y; bal = t - y;Sleep(5) Schedule(); Schedule(); Unlock(l); Unlock(l); Delays that result in the same “happens-before” graph are equivalent Avoid exploring equivalent interleavings Improvement 2: Understand synchronization semantics Thread 1 Schedule(); Lock (l); bal += x; Schedule(); Unlock(l); Thread 2 Schedule(); Schedule(); Lock (l); (l); Lock bal; tt == bal; Schedule(); Unlock(l); Schedule(); Unlock(l); Schedule(); Lock (l); Schedule(); bal = t - y; Lock (l); Schedule(); bal = t - y; Unlock(l); Schedule(); Unlock(l); Avoid exploring delays that are impossible Identify when threads can make progress CHESS maintains a run queue and a wait queue Mimics OS scheduler state CHESS modes: speed vs coverage Fast-mode Introduce schedule points before synchronizations, volatile accesses, and interlocked operations Finds many bugs in practice Data-race mode Introduce schedule points before memory accesses Finds race-conditions due to data races Captures all sequentially consistent (SC) executions CHESS Design Choices Soundness Any bug found by CHESS should be possible in the field Should not introduce false errors (both safety and liveness) Completeness Any bug found in the field should be found by CHESS In theory, we need to capture all sources of nondeterminism In practice, we need to effectively explore the astronomically large state space Capture all sources of nondeterminism? No. Scheduling nondeterminism? Yes Timing nondeterminism? Yes Controls when and in what order the timers fire Nondeterministic system calls? Mostly CHESS uses precise abstractions for many system calls Input nondeterminism? No Rely on users to provide inputs Program inputs, return values of system calls, files read, packets received,… Good tradeoff in the short term But can’t find race-conditions on error handling code Capture all sources of nondeterminism? No. Hardware relaxations? Yes Hardware can reorder instructions Non-SC executions possible in programs with data races Sober [CAV ‘08] can detect and explore such non-SC executions Compiler relaxations? No Very few people understand what compilers can do to programs with data races Far fewer than those who understand the general theory of relativity Schedule Exploration Algorithms Two kinds Reduction algorithms Explore one out of a large number equivalent interleavings Prioritization algorithms Pick “interesting” interleavings before you run out of resources Remember: anything is better than nothing Schedule Exploration Algorithms Reduction Algorithms Enumerating Thread Interleavings Using Depth-First Search x = 1; y = 1; x = 2; y = 2; Explore (State s) { T = set of threads in s; foreach x = 1;t in T { s’ = schedule t in s Explore(s’); y = 1; } } x = 2; y = 2; 0,0 2,0 1,0 1,1 1,0 2,0 2,2 2,1 2,1 2,2 1,1 1,2 1,2 2,2 2,2 2,1 1,2 1,1 1,1 Behaviorally equivalent interleavings x = 1; x = 1; y = 2; if(x == 1) { equiv if(x == 1) { y = 2; y = 3; } Reach the same final state (x = 1, y = 3) y = 3; } Behaviorally inequivalent interleavings x = 1; x = 1; y = 2; if(x == 1) { equiv if(x == 1) { y = 3; } y = 3; } y = 2; Reach different final states (1, 3) vs (1,2) Behaviorally inequivalent interleavings if(x == 1) { x = 1; x = 1; if(x == 1) { equiv y = 3; } y = 2; y = 2; Don’t necessarily have to reach different states Execution Equivalence Two executions are equivalent if they can be obtained by commuting independent operations x=1 r1 = y r2 = y r3 = x x=1 r2 = y r1 = y r3 = x r2 = y x=1 r1 = y r3 = x r2 = y x=1 r3 = x r1 = y Formalism Execution is a sequence of transitions Each transition is of the form <tid, var, op> tid: thread performing the transition var: the memory location accessed in the transition op: READ | WRITE | READWRITE Two steps are independent if They are executed by different threads and Either they access different variable or READ the same variable Equivalence makes the schedule space a Directed Acyclic Graph x = 1; y = 1; x = 2; y = 2; 0,0 2,0 1,0 1,1 1,0 2,0 2,2 2,1 2,1 2,2 1,1 1,2 1,2 2,2 2,2 2,1 1,2 1,1 1,1 DFS in a DAG (CS 101) HashTable visited; Explore (Sequence s) { T = set of threads enabled in S; foreach t in T { s’ = s . <t,v,o> ; Explore(s’); s” = canon(s”); if (s’ in visited) continue; } if (s’’ in visited) continue; visited.Add(s’); } visited.Add(s’’); Explore(s’); } Explore(s’); } } } Sleep sets algorithm explores a DAG without maintaining the table Sleep Set Algorithm x = 1; y = 1; x = 2; y = 2; 0,0 2,0 1,0 Identify transitions that take you to visited states 1,1 1,0 2,0 2,2 2,1 2,2 1,1 1,2 2,2 2,1 1,2 1,1 Sleep Set Algorithm Explore (Sequence s, sleep C) { T = set of transitions enabled in s; T’ = T – C; foreach t in T’ { C=C+t s’ = s . t ; C’ = C – {transitions dependent on t} Explore(s’, C’); } } Summary Sleep sets ensure that a stateless execution does not explode a DAG into a tree Persistent Set Reduction x = 1; x = 2; y = 1; y = 2; With Sleep Sets x = 1; x = 2; y = 1; y = 2; With Persistent Sets x = 1; x = 2; y = 1; y = 2; Assumption: we are only interested in the reachability of final states (for instance, no global assertions) Persistent Sets A set of transitions P is persistent in a state s, if In the state space X reachable from s by only exploring transitions not in P Every transition in X is independent with P P “persists” in X It is sound to only explore P from s s x With Persistent Sets x = 1; x = 2; y = 1; y = 2; Dynamic Partial-Order Reduction Algorithm [Flanagan & Godefroid] Identifies persistent sets dynamically After execution a transition, insert a schedule point before the most recent conflict y = 1; x = 1; y=1 x = 2; z = 3; x=1 x=2 x=2 z=3 x=1 z=3 Schedule Exploration Algorithms Prioritization Algorithms Schedule Prioritization Preemption bounding Few preemptions are sufficient for finding lots of bugs Preemption sealing Insert preemptions where you think bugs are Random If you don’t have additional information about the state space, random is the best Still do partial-order reduction Concurrency Correctness Criterion CHESS checks for various correctness criteria Assertion failures Deadlocks Livelocks Data races Atomicity violations (Deterministic) Linearizability violations Concurrency Correctness Criterion Linearizability Checking in CHESS Motivation Writing good test oracles is hard Thread 1 Thread 2 Bank.Add($20) Bank.Withdraw($20); Assert(Bank.Balance() == ?) Motivation Writing good test oracles is hard Thread 1 Thread 2 q.AddFirst(10) q.AddLast(20) q.RemoveLast() q.RemoveFirst() Assert(q.IsEmpty()) Is this a correct assertion to check for? Now what if there are 5 threads each performing 5 queue operations We want to magically Check if a Bank behaves like a Bank should do Check if a queue behaves like a queue Answer: Check for linearizability Linearizability The correctness notion closest to “thread safety” A concurrent component behaves as if it is protected by a single global lock Each operation appears to take effect instantaneously at some point between the call and return The Problem with Linearizability Checking Need a sequential specification Imagine writing a sequential specification for your operating system Instead, check if a component is linearizable with respect to some deterministic specification This can be done automatically Generate the sequential specification by “inserting a global lock” LineUp: Two-Phase method For a given test: First, generate the sequential specification Enumerate serial executions of the test Record all observed histories Assume the generated histories are the intended behaviors of the component Second, check linearizability with respect to the generated specification Enumerate fully concurrent executions Test history against compatibility with serial executions Line-Up on the Bank Example Thread 1 Thread 2 Bank.Add($20) Bank.Withdraw($20); Assert( Bank.Balance() == 20 || Bank.Balance() == 0 ) Serial executions imply that the final balance can be 20 or 0 Concurrent executions should satisfy the assertion Line-Up guarantees Full Completeness: If Line-Up reports a violation, the implementation is not linearizable with respect to any deterministic specification. Restricted Soundness: If the implementation is not linearizable with respect to any deterministic specification, there exists a test on which Line-Up will report a violation. Linearizability Violations Non-linearizable histories can reveal implementation errors (e.g. incorrect synchronization) The nonlinearizable behavior below was caused by a bug in .NET 4.0 (accidental lock timeout). Thread 1 Thread 2 Add 200 return TryTake Add 200 return TryTake Return 200 return empty Generalizing Linearizability Some operations may block. e.g. semaphore.acquire() Blocking can be “good” (expected behavior) or “bad” (bug). Original definition of linearizability does not make this distinction. Blocking is always o.k. We generalized definition to be able to catch “bad blocking” A buggy counter implementation class Counter{ int count = 0; bool b = false; Lock lock = new Lock(); void inc() { b = true; lock.acquire(); count = count + 1; lock.release(); b = false; } void get() { lock.acquire(); t = count; if(!b) lock.release(); return t; } } Stuck History: inc call get call get 1 inc ret inc call Results Each letter is a separate root cause Questions (A) Incorrect use of CAS causes state corruption. (B) RemoveLast() uses an incorrect lock-free optimization. (C) Call to SemaphoreSlim includes a timeout parameter by mistake. (D) ToArray() can livelock when crossing segment boundaries. Note that the harness for this class performs a particular pre-test sequence (add 31 elements, remove 31 elements). (E) Insufficient locking: thread can get preempted while trying to set an exception. (F) Barrier is not a linearizable data type. Barriers block each thread until all threads have entered the barrier, a behavior that is not equivalent to any serial execution. (G) Cancel is not a linearizable method: The effect of the cancellation can be delayed past the operation return, and in fact even past subsequent operations on the same thread. (H) Count() may release a lock it does not own if interleaved with Add(). (I) Bag is nondeterministic by design to improve performance: the returned value can depend on the specific interleaving. (J) Count may return 0 even if the collection is not empty. The specification of the Count method was weakened after Line-Up detected this behavior. (K) TryTake may fail even if the collection is not empty. The specification of the TryTake method was weakened after Line-Up detected this behavior. (L) SetResult() throws the wrong exception if the task is already reserved for completion by somebody else, but not completed yet. Results: Phase 1 / Phase 2 Outline Preemption bounding Makes CHESS effective on deep state spaces Fair stateless model checking Sober FeatherLite Concurrency Explorer Outline Preemption bounding Makes CHESS effective on deep state spaces Fair stateless model checking Makes CHESS effective on cyclic state spaces Enables CHESS to find liveness violations (livelocks) Sober FeatherLite Concurrency Explorer Concurrent programs have cyclic state spaces L1: while( ! done) { L2: Sleep(); } ! done L1 ! done L2 done L1 done L2 M1: done = 1; Spinlocks Non-blocking algorithms Implementations of synchronization primitives Periodic timers … A demonic scheduler unrolls any cycle ad-infinitum while( ! done) { Sleep(); } ! done done = 1; ! done ! done ! done done done done Depth bounding Prune executions beyond a bounded number of steps ! done ! done ! done Depth bound ! done done done done Problem 1: Ineffective state coverage Bound has to be large enough to reach the deepest bug Typically, greater than 100 ! done synchronization operations Every unrolling of a cycle ! done redundantly explores reachable state space ! done Depth bound ! done Problem 2: Cannot find livelocks Livelocks : lack of progress in a program temp = done; while( ! temp) { Sleep(); } done = 1; Key idea while( ! done) { Sleep(); } ! done ! done done done done = 1; This test terminates only when the scheduler is fair Fairness is assumed by programmers All cycles in correct programs are unfair A fair cycle is a livelock We need a fair demonic scheduler Test Harness Concurrent Program Avoid unrolling unfair cycles Effective state coverage Detect fair cycles Win32 API Fair Demonic Demonic Scheduler Scheduler Find livelocks What notion of “fairness” do we use? Weak fairness Forall t :: GF ( enabled(t) scheduled(t) ) A thread that remains enabled should eventually be scheduled while( ! done) { Sleep(); } done = 1; A weakly-fair scheduler will eventually schedule Thread 2 Example: round-robin Weak fairness does not suffice Lock( l ); While( ! done) { Unlock( l ); Sleep(); Lock( l ); } Unlock( l ); en = {T1, T2} en = {T1, T2} T1: Sleep() T2: Lock( l ) T1: Lock( l ) T2: Lock( l ) Lock( l ); done = 1; Unlock( l ); en = { T1 } T1: Unlock( l ) T2: Lock( l ) en = {T1, T2} T1: Sleep() T2: Lock( l ) Strong Fairness Forall t :: GF enabled(t) GF scheduled(t) A thread that is enabled infinitely often is scheduled infinitely often Lock( l ); While( ! done) { Unlock( l ); Sleep(); Lock( l ); } Unlock( l ); Lock( l ); done = 1; Unlock( l ); Thread 2 is enabled and competes for the lock infinitely often Good Samaritan violation Thread yield the processor when not making progress Forall threads t : GF scheduled(t) GF yield(t) while( ! done) { ; } done = 1; Found many such violations, including one in the Singularity boot process Results in “sluggish I/O” behavior during bootup Results: Achieves more coverage faster Work stealing queue with one stealer Without fairness, with depth bound With fairness 20 30 40 50 60 States Explored 1726 871 1505 1726 1307 683 Percentage Coverage 100% 50% 87% 100% 76% 40% 143 97 763 2531 >5000 >5000 Time (secs) Finding livelocks and finding (not missing) safety violations Program Lines of code Safety Bugs Livelocks Work Stealing Q 4K 4 CDS 6K 1 CCR 9K 1 2 ConcRT 16K 2 2 Dryad 18K 7 APE 19K 4 STM 20K 2 TPL 24K 4 PLINQ 24K 1 Singularity 175K 5 2 26 (total) 11 (total) Acknowledgement: testers from PCP team Outline Preemption bounding Makes CHESS effective on deep state spaces Fair stateless model checking Makes CHESS effective on cyclic state spaces Enables CHESS to find liveness violations (livelocks) Sober Detect relaxed-memory model errors Do not miss behaviors only possible in a relaxed memory model FeatherLite Concurrency Explorer Single slide on Sober Relaxed memory verification problem Is P correct on a relaxed memory model Sober: split the problem into two parts Is P correct on a sequentially consistent (SC) machine Is P sequentially consistent on a relaxed memory model Check this while only exploring SC executions CAV ‘08 solves the problem for a memory model with store buffers (TSO) EC2 ‘08 extends this approach to a general class of memory models Outline Preemption bounding Makes CHESS effective on deep state spaces Fair stateless model checking Makes CHESS effective on cyclic state spaces Enables CHESS to find liveness violations (livelocks) Sober Detect relaxed-memory model errors Do not miss behaviors only possible in a relaxed memory model FeatherLite A light-weight data-race detection engine (<20% overhead) Concurrency Explorer Single slide on FeatherLite Current data-race detection tools are slow Process every memory access done by the program One in 5 instructions access memory 1 billion accesses/sec Key idea: Do smart adaptive sampling of memory accesses Naïve sampling does not work, need to sample both racing instructions Cold-path hypothesis: At least one of the racing instructions occurs in a cold path Races between fast-paths are most probably benign FeatherLite adaptively samples cold-paths at 100% rate and hot-paths at 0.1% rate Finds 70% of the data-races with <20% runtime overhead Existing data-race detection tools >10X overhead Outline Preemption bounding Makes CHESS effective on deep state spaces Fair stateless model checking Makes CHESS effective on cyclic state spaces Enables CHESS to find liveness violations (livelocks) Sober Detect relaxed-memory model errors Do not miss behaviors only possible in a relaxed memory model FeatherLite A light-weight data-race detection engine (<20% overhead) Concurrency Explorer First-class concurrency debugging Concurrency explorer Single-step over a thread interleaving Inspect program states at each step Program state = Stack of all threads + globals Limited bi-directional debugging Interleaving slices for better understanding Working on: Closer integration with the Visual Studio debugger Explore neighborhood interleavings Conclusion Don’t stress, use CHESS CHESS binary and papers available at http://research.microsoft.com/CHESS Points to get across Capturing non-determinism Sync-orders, data-races, hardware interleavings Adding elastic delay Soundness & completeness Scoping Preemptions Questions Did you find new bugs How is this different from your previous papers How is this different from previous mc efforts How is this different from Are these behaviors “expected” ? Thread 1 Add 10 Thread 2 Thread 3 Thread 1 TryTake Add 20 return “empty” return Add 10 TryTake Add 20 return return10 TryTake Thread 2 Thread 3 return return10 return TryTake return “empty” Linearizability Component is linearizable if all operations Appear to take effect at a single temporal point And that point is between the call and the return “As if the component was protected by a single global lock” Thread 1 Add 10 Thread 2 Thread 3 TryTake TryTake return Add 20 return return10 return “empty” This behavior is not linearizable Thread 2 getting a 10 means that Thread 1’s Add got the queue before Thread 3’s Add So, when Thread 3 does a TryTake, 20 should be still in the queue Thread 1 Thread 2 Thread 3 return Add 10 TryTake Add 20 return10 return TryTake return “empty” Linearizable? Thread 1 Thread 2 Thread 3 return Add 10 TryTake Add 20 return20 return TryTake return “empty” How is Linearizability different than Seriazliability? Serializability All operations happen atomically in some serial order Linearizability All operations happen at a single instant That instant is between the call and return Serializable behavior that is not Linearizable Thread 1 Thread 2 Add 10 return TryTake return “empty” Linearizability assumes that there is a global observer that can observe that Thread 1 finished before Thread 2 started This is what makes linearizability composable Serializable behavior that is not Linearizable Thread 1 Thread 2 Add 10 return TryTake return “empty” Linearizability assumes that there is a global observer that can observe that Thread 1 finished before Thread 2 started This is what makes linearizability composable Serializability does not compose Thread 1 Thread 2 Add 10 return TryTake return “empty” Add 10 return TryTake return “empty” The behavior of the blue queue and green queue are individually serializable But, together, the behavior is not serializable To make this all the more confusing Database concurrency control ensures that transactions are linearizable Even though the literature only talks about serializability Quote from Jim Gray: “When a transaction finishes, the state of the database immediately reflects the updates of the transaction” The commit point of a transaction is guaranteed between the transaction begin and end When using a two-phase locking protocol, for instance “Standard” definition of Linearizability Is a little more general than my interpretation “as if protected by a single global lock” Sometimes, a concurrent implementation can have more behaviors than a sequential implementation Example: a set implemented as a queue A sequential version will be FIFO even order does not matter for a set For performance, a concurrent version can break the FIFO ordering but still maintain the set abstraction Define a “sequential specification” A Sequential Specification (A fancy word for something you already know but don’t usually think about) Each object has a state the sequence of elements in the queue Each operation has a precondition and a postcondition Precondition: if the queue is not empty Postcondition: Remove will remove the first element in queue Another example Precondition: True Postcondition: TryTake will Return false if the queue is empty and leave the state unchanged Otherwise, return true and remove the first element in the queue