Systematic Stress Testing of Concurrent Programs

Concurrency Testing Challenges, Algorithms, and Tools Madan Musuvathi Microsoft Research Concurrency is HARD  A concurrent program should  Function correctly  Maximize throughput  Finish as many tasks as possible  Minimize latency  Respond to requests as soon as possible  While handling nondeterminism in the environment Concurrency is Pervasive  Concurrency is an age-old problem of computer science  Most programs are concurrent  At least the one that you expect to get paid for, anyway Solving the Concurrency Problem  We need  Better programming abstractions  Better analysis and verification techniques  Better testing methodologies Weakest Link Testing is more important than you think  My first-ever computer program:  Wrote it in Basic  Not the world’s best programming language  With no idea about program correctness  I didn’t know first-order logic, loop-invariants, …  I hadn’t heard about Hoare, Dijkstra, …  But still managed to write correct programs, using the write, test, [debug, write, test]+ cycle How many of you have …  written a program > 10,000 lines?  written a program, compiled it, called it done without testing the program on a single input?  written a program, compiled it, called it done without testing the program on few interesting inputs? Imagine a world where you can’t pick the inputs during testing …  You write the program  Check its correctness by staring at it int factorial ( int x ) { int ret = 1; while(x > 1){ ret *= x; x --; } return ret; }  Give the program to the computer  The computer tests on inputs of its choice  factorial(5) = 120  factorial(5) = 120 the next 100 times  factorial(7) = 5040  The computer runs this program again and again on these inputs for a week  The program didn’t crash and therefore it is correct This is the world of concurrency testing  You write the program Parent_thread() { if (p != null) { p = new P(); Set (initEvent); } } Child_thread(){ if (p != null) { Wait (initEvent); } }  Check its correctness by staring at it  Give the program to the computer  The computer generates some interleavings  The computer runs this program again and again on these interleavings  The program didn’t crash and therefore its is correct Demo How do we test concurrent software today CHESS Proposition  Capture and expose nondeterminism to a scheduler  Threads can run at different speeds  Asynchronous tasks can start at arbitrary time in the future  Hardware/compiler can reorder instructions  Explore the nondeterminism using several algorithms  Tackle the astronomically large number of interleavings  Remember: Any algorithm is better than no control at all CHESS in a nutshell  CHESS is a user-mode scheduler  Controls all scheduling nondeterminism  Replace the OS scheduler  Guarantees:  Every program run takes a different thread interleaving  Reproduce the interleaving for every run  Download CHESS source from http://chesstool.codeplex.com/ CHESS architecture Unmanaged Program Win32 Wrappers Windows CHESS Exploration Engine CHESS Scheduler Managed Program .NET Wrappers CLR • Every run takes a different interleaving • Reproduce the interleaving for every run Running Example Thread 1 Lock (l); bal += x; Unlock(l); Thread 2 Lock (l); t = bal; Unlock(l); Lock (l); bal = t - y; Unlock(l); Introduce Schedule() points Thread 1 Schedule(); Lock (l); bal += x; Schedule(); Unlock(l); Thread 2 Schedule(); Lock (l); t = bal; Schedule(); Unlock(l); Schedule(); Lock (l); bal = t - y; Schedule(); Unlock(l);  Instrument calls to the CHESS scheduler  Each call is a potential preemption point First-cut solution: Random sleeps Thread 1 Sleep(rand()); Lock (l); bal += x; Sleep(rand()); Unlock(l); Thread 2 Sleep(rand()); Lock (l); t = bal; Sleep(rand()); Unlock(l); Sleep(rand()); Lock (l); bal = t - y; Sleep(rand()); Unlock(l);  Introduce random sleep at schedule points  Does not introduce new behaviors  Sleep models a possible preemption at each location  Sleeping for a finite amount guarantees starvation-freedom Improvement 1: Capture the “happens-before” graph Thread 1 Schedule(); Lock (l); bal += x;Sleep(5) Schedule(); Unlock(l); Thread 2 Schedule(); Schedule(); Lock (l); (l); Lock bal; tt == bal; Schedule(); Schedule(); Unlock(l); Unlock(l); Schedule(); Schedule(); Lock (l); Lock bal = (l); t - y; bal = t - y;Sleep(5) Schedule(); Schedule(); Unlock(l); Unlock(l);  Delays that result in the same “happens-before” graph are equivalent  Avoid exploring equivalent interleavings Improvement 2: Understand synchronization semantics Thread 1 Schedule(); Lock (l); bal += x; Schedule(); Unlock(l); Thread 2 Schedule(); Schedule(); Lock (l); (l); Lock bal; tt == bal; Schedule(); Unlock(l); Schedule(); Unlock(l); Schedule(); Lock (l); Schedule(); bal = t - y; Lock (l); Schedule(); bal = t - y; Unlock(l); Schedule(); Unlock(l);  Avoid exploring delays that are impossible  Identify when threads can make progress  CHESS maintains a run queue and a wait queue  Mimics OS scheduler state CHESS modes: speed vs coverage  Fast-mode  Introduce schedule points before synchronizations, volatile accesses, and interlocked operations  Finds many bugs in practice  Data-race mode  Introduce schedule points before memory accesses  Finds race-conditions due to data races  Captures all sequentially consistent (SC) executions CHESS Design Choices  Soundness  Any bug found by CHESS should be possible in the field  Should not introduce false errors (both safety and liveness)  Completeness  Any bug found in the field should be found by CHESS  In theory, we need to capture all sources of nondeterminism  In practice, we need to effectively explore the astronomically large state space Capture all sources of nondeterminism? No.  Scheduling nondeterminism? Yes  Timing nondeterminism? Yes  Controls when and in what order the timers fire  Nondeterministic system calls? Mostly  CHESS uses precise abstractions for many system calls  Input nondeterminism? No  Rely on users to provide inputs  Program inputs, return values of system calls, files read, packets received,…  Good tradeoff in the short term  But can’t find race-conditions on error handling code Capture all sources of nondeterminism? No.  Hardware relaxations? Yes  Hardware can reorder instructions  Non-SC executions possible in programs with data races  Sober [CAV ‘08] can detect and explore such non-SC executions  Compiler relaxations? No  Very few people understand what compilers can do to programs with data races  Far fewer than those who understand the general theory of relativity Schedule Exploration Algorithms Two kinds  Reduction algorithms  Explore one out of a large number equivalent interleavings  Prioritization algorithms  Pick “interesting” interleavings before you run out of resources  Remember: anything is better than nothing Schedule Exploration Algorithms Reduction Algorithms Enumerating Thread Interleavings Using Depth-First Search x = 1; y = 1; x = 2; y = 2; Explore (State s) { T = set of threads in s; foreach x = 1;t in T { s’ = schedule t in s Explore(s’); y = 1; } } x = 2; y = 2; 0,0 2,0 1,0 1,1 1,0 2,0 2,2 2,1 2,1 2,2 1,1 1,2 1,2 2,2 2,2 2,1 1,2 1,1 1,1 Behaviorally equivalent interleavings x = 1; x = 1; y = 2; if(x == 1) { equiv if(x == 1) { y = 2; y = 3; }  Reach the same final state (x = 1, y = 3) y = 3; } Behaviorally inequivalent interleavings x = 1; x = 1; y = 2; if(x == 1) { equiv if(x == 1) { y = 3; } y = 3; } y = 2;  Reach different final states (1, 3) vs (1,2) Behaviorally inequivalent interleavings if(x == 1) { x = 1; x = 1; if(x == 1) { equiv y = 3; } y = 2; y = 2;  Don’t necessarily have to reach different states Execution Equivalence  Two executions are equivalent if they can be obtained by commuting independent operations x=1 r1 = y r2 = y r3 = x x=1 r2 = y r1 = y r3 = x r2 = y x=1 r1 = y r3 = x r2 = y x=1 r3 = x r1 = y Formalism  Execution is a sequence of transitions  Each transition is of the form <tid, var, op>  tid: thread performing the transition  var: the memory location accessed in the transition  op: READ | WRITE | READWRITE  Two steps are independent if  They are executed by different threads and  Either they access different variable or READ the same variable Equivalence makes the schedule space a Directed Acyclic Graph x = 1; y = 1; x = 2; y = 2; 0,0 2,0 1,0 1,1 1,0 2,0 2,2 2,1 2,1 2,2 1,1 1,2 1,2 2,2 2,2 2,1 1,2 1,1 1,1 DFS in a DAG (CS 101) HashTable visited; Explore (Sequence s) { T = set of threads enabled in S; foreach t in T { s’ = s . <t,v,o> ; Explore(s’); s” = canon(s”); if (s’ in visited) continue; } if (s’’ in visited) continue; visited.Add(s’); } visited.Add(s’’); Explore(s’); } Explore(s’); } } } Sleep sets algorithm explores a DAG without maintaining the table Sleep Set Algorithm x = 1; y = 1; x = 2; y = 2; 0,0 2,0 1,0 Identify transitions that take you to visited states 1,1 1,0 2,0 2,2 2,1 2,2 1,1 1,2 2,2 2,1 1,2 1,1 Sleep Set Algorithm Explore (Sequence s, sleep C) { T = set of transitions enabled in s; T’ = T – C; foreach t in T’ { C=C+t s’ = s . t ; C’ = C – {transitions dependent on t} Explore(s’, C’); } } Summary  Sleep sets ensure that a stateless execution does not explode a DAG into a tree Persistent Set Reduction x = 1; x = 2; y = 1; y = 2; With Sleep Sets x = 1; x = 2; y = 1; y = 2; With Persistent Sets x = 1; x = 2; y = 1; y = 2;  Assumption: we are only interested in the reachability of final states (for instance, no global assertions) Persistent Sets  A set of transitions P is persistent in a state s, if  In the state space X reachable from s by only exploring transitions not in P  Every transition in X is independent with P  P “persists” in X  It is sound to only explore P from s s x With Persistent Sets x = 1; x = 2; y = 1; y = 2; Dynamic Partial-Order Reduction Algorithm [Flanagan & Godefroid]  Identifies persistent sets dynamically  After execution a transition, insert a schedule point before the most recent conflict y = 1; x = 1; y=1 x = 2; z = 3; x=1 x=2 x=2 z=3 x=1 z=3 Schedule Exploration Algorithms Prioritization Algorithms Schedule Prioritization  Preemption bounding  Few preemptions are sufficient for finding lots of bugs  Preemption sealing  Insert preemptions where you think bugs are  Random  If you don’t have additional information about the state space, random is the best  Still do partial-order reduction Concurrency Correctness Criterion CHESS checks for various correctness criteria  Assertion failures  Deadlocks  Livelocks  Data races  Atomicity violations  (Deterministic) Linearizability violations Concurrency Correctness Criterion Linearizability Checking in CHESS Motivation  Writing good test oracles is hard Thread 1 Thread 2 Bank.Add($20) Bank.Withdraw($20); Assert(Bank.Balance() == ?) Motivation  Writing good test oracles is hard Thread 1 Thread 2 q.AddFirst(10) q.AddLast(20) q.RemoveLast() q.RemoveFirst() Assert(q.IsEmpty())  Is this a correct assertion to check for?  Now what if there are 5 threads each performing 5 queue operations We want to magically  Check if a Bank behaves like a Bank should do  Check if a queue behaves like a queue  Answer: Check for linearizability Linearizability  The correctness notion closest to “thread safety”  A concurrent component behaves as if it is protected by a single global lock  Each operation appears to take effect instantaneously at some point between the call and return The Problem with Linearizability Checking  Need a sequential specification  Imagine writing a sequential specification for your operating system  Instead, check if a component is linearizable with respect to some deterministic specification  This can be done automatically  Generate the sequential specification by “inserting a global lock” LineUp: Two-Phase method  For a given test:  First, generate the sequential specification  Enumerate serial executions of the test  Record all observed histories  Assume the generated histories are the intended behaviors of the component  Second, check linearizability with respect to the generated specification  Enumerate fully concurrent executions  Test history against compatibility with serial executions Line-Up on the Bank Example Thread 1 Thread 2 Bank.Add($20) Bank.Withdraw($20); Assert( Bank.Balance() == 20 || Bank.Balance() == 0 )  Serial executions imply that the final balance can be 20 or 0  Concurrent executions should satisfy the assertion Line-Up guarantees  Full Completeness: If Line-Up reports a violation, the implementation is not linearizable with respect to any deterministic specification.  Restricted Soundness: If the implementation is not linearizable with respect to any deterministic specification, there exists a test on which Line-Up will report a violation. Linearizability Violations  Non-linearizable histories can reveal implementation errors (e.g. incorrect synchronization)  The nonlinearizable behavior below was caused by a bug in .NET 4.0 (accidental lock timeout). Thread 1 Thread 2 Add 200 return TryTake Add 200 return TryTake Return 200 return empty Generalizing Linearizability  Some operations may block.  e.g. semaphore.acquire()  Blocking can be “good” (expected behavior) or “bad” (bug).  Original definition of linearizability does not make this distinction.  Blocking is always o.k.  We generalized definition to be able to catch “bad blocking” A buggy counter implementation class Counter{ int count = 0; bool b = false; Lock lock = new Lock(); void inc() { b = true; lock.acquire(); count = count + 1; lock.release(); b = false; } void get() { lock.acquire(); t = count; if(!b) lock.release(); return t; } } Stuck History: inc call get call get 1 inc ret inc call Results  Each letter is a separate root cause Questions (A) Incorrect use of CAS causes state corruption. (B) RemoveLast() uses an incorrect lock-free optimization. (C) Call to SemaphoreSlim includes a timeout parameter by mistake. (D) ToArray() can livelock when crossing segment boundaries. Note that the harness for this class performs a particular pre-test sequence (add 31 elements, remove 31 elements). (E) Insufficient locking: thread can get preempted while trying to set an exception. (F) Barrier is not a linearizable data type. Barriers block each thread until all threads have entered the barrier, a behavior that is not equivalent to any serial execution. (G) Cancel is not a linearizable method: The effect of the cancellation can be delayed past the operation return, and in fact even past subsequent operations on the same thread. (H) Count() may release a lock it does not own if interleaved with Add(). (I) Bag is nondeterministic by design to improve performance: the returned value can depend on the specific interleaving. (J) Count may return 0 even if the collection is not empty. The specification of the Count method was weakened after Line-Up detected this behavior. (K) TryTake may fail even if the collection is not empty. The specification of the TryTake method was weakened after Line-Up detected this behavior. (L) SetResult() throws the wrong exception if the task is already reserved for completion by somebody else, but not completed yet. Results: Phase 1 / Phase 2 Outline  Preemption bounding  Makes CHESS effective on deep state spaces  Fair stateless model checking  Sober  FeatherLite  Concurrency Explorer Outline  Preemption bounding  Makes CHESS effective on deep state spaces  Fair stateless model checking  Makes CHESS effective on cyclic state spaces  Enables CHESS to find liveness violations (livelocks)  Sober  FeatherLite  Concurrency Explorer Concurrent programs have cyclic state spaces L1: while( ! done) { L2: Sleep(); }      ! done L1 ! done L2 done L1 done L2 M1: done = 1; Spinlocks Non-blocking algorithms Implementations of synchronization primitives Periodic timers … A demonic scheduler unrolls any cycle ad-infinitum while( ! done) { Sleep(); } ! done done = 1; ! done ! done ! done done done done Depth bounding  Prune executions beyond a bounded number of steps ! done ! done ! done Depth bound ! done done done done Problem 1: Ineffective state coverage  Bound has to be large enough to reach the deepest bug  Typically, greater than 100 ! done synchronization operations  Every unrolling of a cycle ! done redundantly explores reachable state space ! done Depth bound ! done Problem 2: Cannot find livelocks  Livelocks : lack of progress in a program temp = done; while( ! temp) { Sleep(); } done = 1; Key idea while( ! done) { Sleep(); } ! done ! done done done done = 1;  This test terminates only when the scheduler is fair  Fairness is assumed by programmers All cycles in correct programs are unfair A fair cycle is a livelock We need a fair demonic scheduler Test Harness Concurrent Program  Avoid unrolling unfair cycles  Effective state coverage  Detect fair cycles Win32 API Fair Demonic Demonic Scheduler Scheduler  Find livelocks  What notion of “fairness” do we use? Weak fairness  Forall t :: GF ( enabled(t)  scheduled(t) )  A thread that remains enabled should eventually be scheduled while( ! done) { Sleep(); } done = 1;  A weakly-fair scheduler will eventually schedule Thread 2  Example: round-robin Weak fairness does not suffice Lock( l ); While( ! done) { Unlock( l ); Sleep(); Lock( l ); } Unlock( l ); en = {T1, T2} en = {T1, T2} T1: Sleep() T2: Lock( l ) T1: Lock( l ) T2: Lock( l ) Lock( l ); done = 1; Unlock( l ); en = { T1 } T1: Unlock( l ) T2: Lock( l ) en = {T1, T2} T1: Sleep() T2: Lock( l ) Strong Fairness  Forall t :: GF enabled(t)  GF scheduled(t)  A thread that is enabled infinitely often is scheduled infinitely often Lock( l ); While( ! done) { Unlock( l ); Sleep(); Lock( l ); } Unlock( l ); Lock( l ); done = 1; Unlock( l );  Thread 2 is enabled and competes for the lock infinitely often Good Samaritan violation  Thread yield the processor when not making progress  Forall threads t : GF scheduled(t)  GF yield(t) while( ! done) { ; } done = 1;  Found many such violations, including one in the Singularity boot process  Results in “sluggish I/O” behavior during bootup Results: Achieves more coverage faster Work stealing queue with one stealer Without fairness, with depth bound With fairness 20 30 40 50 60 States Explored 1726 871 1505 1726 1307 683 Percentage Coverage 100% 50% 87% 100% 76% 40% 143 97 763 2531 >5000 >5000 Time (secs) Finding livelocks and finding (not missing) safety violations Program Lines of code Safety Bugs Livelocks Work Stealing Q 4K 4 CDS 6K 1 CCR 9K 1 2 ConcRT 16K 2 2 Dryad 18K 7 APE 19K 4 STM 20K 2 TPL 24K 4 PLINQ 24K 1 Singularity 175K 5 2 26 (total) 11 (total) Acknowledgement: testers from PCP team Outline  Preemption bounding  Makes CHESS effective on deep state spaces  Fair stateless model checking  Makes CHESS effective on cyclic state spaces  Enables CHESS to find liveness violations (livelocks)  Sober  Detect relaxed-memory model errors  Do not miss behaviors only possible in a relaxed memory model  FeatherLite  Concurrency Explorer Single slide on Sober  Relaxed memory verification problem  Is P correct on a relaxed memory model  Sober: split the problem into two parts  Is P correct on a sequentially consistent (SC) machine  Is P sequentially consistent on a relaxed memory model  Check this while only exploring SC executions  CAV ‘08 solves the problem for a memory model with store buffers (TSO)  EC2 ‘08 extends this approach to a general class of memory models Outline  Preemption bounding  Makes CHESS effective on deep state spaces  Fair stateless model checking  Makes CHESS effective on cyclic state spaces  Enables CHESS to find liveness violations (livelocks)  Sober  Detect relaxed-memory model errors  Do not miss behaviors only possible in a relaxed memory model  FeatherLite  A light-weight data-race detection engine (<20% overhead)  Concurrency Explorer Single slide on FeatherLite  Current data-race detection tools are slow  Process every memory access done by the program  One in 5 instructions access memory  1 billion accesses/sec  Key idea: Do smart adaptive sampling of memory accesses  Naïve sampling does not work, need to sample both racing instructions  Cold-path hypothesis: At least one of the racing instructions occurs in a cold path  Races between fast-paths are most probably benign  FeatherLite adaptively samples cold-paths at 100% rate and hot-paths at 0.1% rate  Finds 70% of the data-races with <20% runtime overhead  Existing data-race detection tools >10X overhead Outline  Preemption bounding  Makes CHESS effective on deep state spaces  Fair stateless model checking  Makes CHESS effective on cyclic state spaces  Enables CHESS to find liveness violations (livelocks)  Sober  Detect relaxed-memory model errors  Do not miss behaviors only possible in a relaxed memory model  FeatherLite  A light-weight data-race detection engine (<20% overhead)  Concurrency Explorer  First-class concurrency debugging Concurrency explorer  Single-step over a thread interleaving  Inspect program states at each step  Program state = Stack of all threads + globals  Limited bi-directional debugging  Interleaving slices for better understanding  Working on:  Closer integration with the Visual Studio debugger  Explore neighborhood interleavings Conclusion  Don’t stress, use CHESS  CHESS binary and papers available at http://research.microsoft.com/CHESS Points to get across  Capturing non-determinism  Sync-orders, data-races, hardware interleavings  Adding elastic delay  Soundness & completeness  Scoping Preemptions Questions  Did you find new bugs  How is this different from your previous papers  How is this different from previous mc efforts  How is this different from Are these behaviors “expected” ? Thread 1 Add 10 Thread 2 Thread 3 Thread 1 TryTake Add 20 return “empty” return Add 10 TryTake Add 20 return return10 TryTake Thread 2 Thread 3 return return10 return TryTake return “empty” Linearizability  Component is linearizable if all operations  Appear to take effect at a single temporal point  And that point is between the call and the return  “As if the component was protected by a single global lock” Thread 1 Add 10 Thread 2 Thread 3 TryTake TryTake return Add 20 return return10 return “empty” This behavior is not linearizable  Thread 2 getting a 10 means that Thread 1’s Add got the queue before Thread 3’s Add  So, when Thread 3 does a TryTake, 20 should be still in the queue Thread 1 Thread 2 Thread 3 return Add 10 TryTake Add 20 return10 return TryTake return “empty” Linearizable? Thread 1 Thread 2 Thread 3 return Add 10 TryTake Add 20 return20 return TryTake return “empty” How is Linearizability different than Seriazliability?  Serializability  All operations happen atomically in some serial order  Linearizability  All operations happen at a single instant  That instant is between the call and return Serializable behavior that is not Linearizable Thread 1 Thread 2 Add 10 return TryTake return “empty”  Linearizability assumes that there is a global observer that can observe that Thread 1 finished before Thread 2 started  This is what makes linearizability composable Serializable behavior that is not Linearizable Thread 1 Thread 2 Add 10 return TryTake return “empty”  Linearizability assumes that there is a global observer that can observe that Thread 1 finished before Thread 2 started  This is what makes linearizability composable Serializability does not compose Thread 1 Thread 2 Add 10 return TryTake return “empty” Add 10 return TryTake return “empty”  The behavior of the blue queue and green queue are individually serializable  But, together, the behavior is not serializable To make this all the more confusing  Database concurrency control ensures that transactions are linearizable  Even though the literature only talks about serializability  Quote from Jim Gray:  “When a transaction finishes, the state of the database immediately reflects the updates of the transaction”  The commit point of a transaction is guaranteed between the transaction begin and end  When using a two-phase locking protocol, for instance “Standard” definition of Linearizability  Is a little more general than my interpretation  “as if protected by a single global lock”  Sometimes, a concurrent implementation can have more behaviors than a sequential implementation  Example: a set implemented as a queue  A sequential version will be FIFO even order does not matter for a set  For performance, a concurrent version can break the FIFO ordering but still maintain the set abstraction  Define a “sequential specification” A Sequential Specification  (A fancy word for something you already know but don’t usually think about)  Each object has a state  the sequence of elements in the queue  Each operation has a precondition and a postcondition  Precondition: if the queue is not empty  Postcondition: Remove will remove the first element in queue  Another example  Precondition: True  Postcondition: TryTake will  Return false if the queue is empty and leave the state unchanged  Otherwise, return true and remove the first element in the queue

Systematic Stress Testing of Concurrent Programs

Related documents

Products

Support

Systematic Stress Testing of Concurrent Programs

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib