# Systematic Stress Testing of Concurrent Programs

```Concurrency Testing
Challenges, Algorithms, and Tools
Microsoft Research
Concurrency is HARD
 A concurrent program should
 Function correctly
 Maximize throughput
 Finish as many tasks as possible
 Minimize latency
 Respond to requests as soon as possible
 While handling nondeterminism in the environment
Concurrency is Pervasive
 Concurrency is an age-old problem of computer
science
 Most programs are concurrent
 At least the one that you expect to get paid for, anyway
Solving the Concurrency Problem
 We need
 Better programming abstractions
 Better analysis and verification techniques
 Better testing methodologies
Testing is more important than you think
 My first-ever computer program:
 Wrote it in Basic
 Not the world’s best programming language
 With no idea about program correctness
 I didn’t know first-order logic, loop-invariants, …
 I hadn’t heard about Hoare, Dijkstra, …
 But still managed to write correct programs, using the
write, test, [debug, write, test]+ cycle
How many of you have …
 written a program &gt; 10,000 lines?
 written a program, compiled it, called it done without
testing the program on a single input?
 written a program, compiled it, called it done without
testing the program on few interesting inputs?
Imagine a world where you can’t pick the inputs
during testing …
 You write the program
 Check its correctness by staring at it
int factorial ( int x ) {
int ret = 1;
while(x &gt; 1){
ret *= x;
x --;
}
return ret;
}
 Give the program to the computer
 The computer tests on inputs of its
choice
 factorial(5) = 120
 factorial(5) = 120 the next 100 times
 factorial(7) = 5040
 The computer runs this program again
and again on these inputs for a week
 The program didn’t crash and therefore it
is correct
This is the world of concurrency testing
 You write the program
if (p != null) {
p = new P();
Set (initEvent);
}
}
if (p != null) {
Wait (initEvent);
}
}
 Check its correctness by staring at it
 Give the program to the computer
 The computer generates some
interleavings
 The computer runs this program again
and again on these interleavings
 The program didn’t crash and
therefore its is correct
Demo
How do we test concurrent software today
CHESS Proposition
 Capture and expose nondeterminism to a scheduler
 Threads can run at different speeds
 Asynchronous tasks can start at arbitrary time in the
future
 Hardware/compiler can reorder instructions
 Explore the nondeterminism using several algorithms
 Tackle the astronomically large number of interleavings
 Remember: Any algorithm is better than no control at all
CHESS in a nutshell
 CHESS is a user-mode scheduler
 Controls all scheduling nondeterminism
 Replace the OS scheduler
 Guarantees:
 Every program run takes a different thread interleaving
 Reproduce the interleaving for every run
http://chesstool.codeplex.com/
CHESS architecture
Unmanaged
Program
Win32
Wrappers
Windows
CHESS
Exploration
Engine
CHESS
Scheduler
Managed
Program
.NET
Wrappers
CLR
• Every run takes a different interleaving
• Reproduce the interleaving for every run
Running Example
Lock (l);
bal += x;
Unlock(l);
Lock (l);
t = bal;
Unlock(l);
Lock (l);
bal = t - y;
Unlock(l);
Introduce Schedule() points
Schedule();
Lock (l);
bal += x;
Schedule();
Unlock(l);
Schedule();
Lock (l);
t = bal;
Schedule();
Unlock(l);
Schedule();
Lock (l);
bal = t - y;
Schedule();
Unlock(l);
 Instrument calls to the
CHESS scheduler
 Each call is a potential
preemption point
First-cut solution: Random sleeps
Sleep(rand());
Lock (l);
bal += x;
Sleep(rand());
Unlock(l);
Sleep(rand());
Lock (l);
t = bal;
Sleep(rand());
Unlock(l);
Sleep(rand());
Lock (l);
bal = t - y;
Sleep(rand());
Unlock(l);
 Introduce random sleep at
schedule points
 Does not introduce new
behaviors
 Sleep models a possible
preemption at each location
 Sleeping for a finite amount
guarantees starvation-freedom
Improvement 1:
Capture the “happens-before” graph
Schedule();
Lock (l);
bal += x;Sleep(5)
Schedule();
Unlock(l);
Schedule();
Schedule();
Lock (l);
(l);
Lock
bal;
tt == bal;
Schedule();
Schedule();
Unlock(l);
Unlock(l);
Schedule();
Schedule();
Lock (l);
Lock
bal = (l);
t - y;
bal
=
t
- y;Sleep(5)
Schedule();
Schedule();
Unlock(l);
Unlock(l);
 Delays that result in the
same “happens-before”
graph are equivalent
 Avoid exploring equivalent
interleavings
Improvement 2:
Understand synchronization semantics
Schedule();
Lock (l);
bal += x;
Schedule();
Unlock(l);
Schedule();
Schedule();
Lock (l);
(l);
Lock
bal;
tt == bal;
Schedule();
Unlock(l);
Schedule();
Unlock(l);
Schedule();
Lock (l);
Schedule();
bal = t - y;
Lock (l);
Schedule();
bal = t - y;
Unlock(l);
Schedule();
Unlock(l);
 Avoid exploring delays that
are impossible
 Identify when threads can
make progress
 CHESS maintains a run
queue and a wait queue
 Mimics OS scheduler state
CHESS modes: speed vs coverage
 Fast-mode
 Introduce schedule points before synchronizations,
volatile accesses, and interlocked operations
 Finds many bugs in practice
 Data-race mode
 Introduce schedule points before memory accesses
 Finds race-conditions due to data races
 Captures all sequentially consistent (SC) executions
CHESS Design Choices
 Soundness
 Any bug found by CHESS should be possible in the field
 Should not introduce false errors (both safety and
liveness)
 Completeness
 Any bug found in the field should be found by CHESS
 In theory, we need to capture all sources of
nondeterminism
 In practice, we need to effectively explore the
astronomically large state space
Capture all sources of nondeterminism?
No.
 Scheduling nondeterminism? Yes
 Timing nondeterminism? Yes
 Controls when and in what order the timers fire
 Nondeterministic system calls? Mostly
 CHESS uses precise abstractions for many system calls
 Input nondeterminism? No
 Rely on users to provide inputs
 Program inputs, return values of system calls, files read, packets
 Good tradeoff in the short term
 But can’t find race-conditions on error handling code
Capture all sources of nondeterminism?
No.
 Hardware relaxations? Yes
 Hardware can reorder instructions
 Non-SC executions possible in programs with data races
 Sober [CAV ‘08] can detect and explore such non-SC
executions
 Compiler relaxations? No
 Very few people understand what compilers can do to
programs with data races
 Far fewer than those who understand the general theory
of relativity
Schedule Exploration Algorithms
Two kinds
 Reduction algorithms
 Explore one out of a large number equivalent
interleavings
 Prioritization algorithms
 Pick “interesting” interleavings before you run out of
resources
 Remember: anything is better than nothing
Schedule Exploration Algorithms
Reduction Algorithms
Using Depth-First Search
x = 1;
y = 1;
x = 2;
y = 2;
Explore (State s) {
T = set of threads in s;
foreach
x = 1;t in T {
s’ = schedule t in s
Explore(s’);
y = 1;
}
}
x = 2;
y = 2;
0,0
2,0
1,0
1,1
1,0
2,0
2,2
2,1
2,1
2,2
1,1
1,2
1,2
2,2
2,2
2,1
1,2
1,1
1,1
Behaviorally equivalent interleavings
x = 1;
x = 1;
y = 2;
if(x == 1) {
equiv
if(x == 1) {
y = 2;
y = 3; }
 Reach the same final state (x = 1, y = 3)
y = 3; }
Behaviorally inequivalent interleavings
x = 1;
x = 1;
y = 2;
if(x == 1) {
equiv
if(x == 1) {
y = 3; }
y = 3; }
y = 2;
 Reach different final states (1, 3) vs (1,2)
Behaviorally inequivalent interleavings
if(x == 1) {
x = 1;
x = 1;
if(x == 1) {
equiv
y = 3; }
y = 2;
y = 2;
 Don’t necessarily have to reach different states
Execution Equivalence
 Two executions are equivalent if they can be obtained
by commuting independent operations
x=1
r1 = y
r2 = y
r3 = x
x=1
r2 = y
r1 = y
r3 = x
r2 = y
x=1
r1 = y
r3 = x
r2 = y
x=1
r3 = x
r1 = y
Formalism
 Execution is a sequence of transitions
 Each transition is of the form &lt;tid, var, op&gt;
 tid: thread performing the transition
 var: the memory location accessed in the transition
 op: READ | WRITE | READWRITE
 Two steps are independent if
 They are executed by different threads and
 Either they access different variable or READ the same
variable
Equivalence makes the schedule space a
Directed Acyclic Graph
x = 1;
y = 1;
x = 2;
y = 2;
0,0
2,0
1,0
1,1
1,0
2,0
2,2
2,1
2,1
2,2
1,1
1,2
1,2
2,2
2,2
2,1
1,2
1,1
1,1
DFS in a DAG (CS 101)
HashTable visited;
Explore (Sequence s) {
T = set of threads enabled in S;
foreach t in T {
s’ = s . &lt;t,v,o&gt; ;
Explore(s’);
s”
= canon(s”);
if (s’
in visited) continue;
} if
(s’’ in visited) continue;
Explore(s’);
} Explore(s’);
} }
}
Sleep sets algorithm
explores a DAG without
maintaining the table
Sleep Set Algorithm
x = 1;
y = 1;
x = 2;
y = 2;
0,0
2,0
1,0
Identify transitions
that take you to
visited states
1,1
1,0
2,0
2,2
2,1
2,2
1,1
1,2
2,2
2,1
1,2
1,1
Sleep Set Algorithm
Explore (Sequence s, sleep C) {
T = set of transitions enabled in s;
T’ = T – C;
foreach t in T’ {
C=C+t
s’ = s . t ;
C’ = C – {transitions dependent on t}
Explore(s’, C’);
}
}
Summary
 Sleep sets ensure that a stateless execution
does not explode a DAG into a tree
Persistent Set Reduction
x = 1;
x = 2;
y = 1;
y = 2;
With Sleep Sets
x = 1;
x = 2;
y = 1;
y = 2;
With Persistent Sets
x = 1;
x = 2;
y = 1;
y = 2;
 Assumption: we are only interested in the reachability
of final states (for instance, no global assertions)
Persistent Sets
 A set of transitions P is
persistent in a state s, if
 In the state space X reachable
from s by only exploring
transitions not in P
 Every transition in X is
independent with P
 P “persists” in X
 It is sound to only explore P from s
s
x
With Persistent Sets
x = 1;
x = 2;
y = 1;
y = 2;
Dynamic Partial-Order Reduction
Algorithm [Flanagan &amp; Godefroid]
 Identifies persistent sets dynamically
 After execution a transition, insert a schedule point
before the most recent conflict
y = 1;
x = 1;
y=1
x = 2;
z = 3;
x=1
x=2
x=2
z=3
x=1
z=3
Schedule Exploration Algorithms
Prioritization Algorithms
Schedule Prioritization
 Preemption bounding
 Few preemptions are sufficient for finding lots of bugs
 Preemption sealing
 Insert preemptions where you think bugs are
 Random
 If you don’t have additional information about the state
space, random is the best
 Still do partial-order reduction
Concurrency Correctness Criterion
CHESS checks for various correctness
criteria
 Assertion failures
 Livelocks
 Data races
 Atomicity violations
 (Deterministic) Linearizability violations
Concurrency Correctness Criterion
Linearizability Checking in CHESS
Motivation
 Writing good test oracles is hard
Bank.Withdraw(\$20);
Assert(Bank.Balance() == ?)
Motivation
 Writing good test oracles is hard
q.RemoveLast()
q.RemoveFirst()
Assert(q.IsEmpty())
 Is this a correct assertion to check for?
 Now what if there are 5 threads each performing 5
queue operations
We want to magically
 Check if a Bank behaves like a Bank should do
 Check if a queue behaves like a queue
 Answer: Check for linearizability
Linearizability
 The correctness notion closest to “thread safety”
 A concurrent component behaves as if it is protected
by a single global lock
 Each operation appears to take effect instantaneously
at some point between the call and return
The Problem with Linearizability
Checking
 Need a sequential specification
 Imagine writing a sequential specification for your
operating system
 Instead, check if a component is linearizable with
respect to some deterministic specification
 This can be done automatically
 Generate the sequential specification by “inserting a
global lock”
LineUp: Two-Phase method
 For a given test:
 First, generate the sequential specification
 Enumerate serial executions of the test
 Record all observed histories
 Assume the generated histories are the intended
behaviors of the component
 Second, check linearizability with respect to the
generated specification
 Enumerate fully concurrent executions
 Test history against compatibility with serial executions
Line-Up on the Bank Example
Bank.Withdraw(\$20);
Assert( Bank.Balance() == 20 ||
Bank.Balance() == 0 )
 Serial executions imply that the final balance can be 20
or 0
 Concurrent executions should satisfy the assertion
Line-Up guarantees
 Full Completeness:
If Line-Up reports a violation, the
implementation is not linearizable with respect
to any deterministic specification.
 Restricted Soundness:
If the implementation is not linearizable with
respect to any deterministic specification, there
exists a test on which Line-Up will report a
violation.
Linearizability Violations
 Non-linearizable histories can reveal
implementation errors (e.g. incorrect
synchronization)
 The nonlinearizable behavior below was caused by a
bug in .NET 4.0 (accidental lock timeout).
Add 200 return TryTake
return TryTake
Return 200
return empty
Generalizing Linearizability
 Some operations may block.
 e.g. semaphore.acquire()
 Blocking can be “good” (expected behavior) or
 Original definition of linearizability does not make
this distinction.
 Blocking is always o.k.
 We generalized definition to be able to catch “bad
blocking”
A buggy counter implementation
class Counter{
int count = 0; bool b = false;
Lock lock = new Lock();
void inc() {
b = true;
lock.acquire();
count = count + 1;
lock.release();
b = false;
}
void get() {
lock.acquire();
t = count;
if(!b)
lock.release();
return t;
}
}
Stuck History:
inc call
get call
get 1
inc ret
inc call
Results
 Each letter is a
separate root
cause
Questions
(A) Incorrect
use of CAS causes state corruption. (B) RemoveLast() uses an incorrect
lock-free optimization. (C) Call to SemaphoreSlim includes
a timeout parameter by mistake. (D) ToArray() can livelock when
crossing segment boundaries. Note that the harness for this class
performs a particular pre-test sequence (add 31 elements, remove
31 elements). (E) Insufficient locking: thread can get preempted
while trying to set an exception. (F) Barrier is not a linearizable
data type. Barriers block each thread until all threads have entered
the barrier, a behavior that is not equivalent to any serial execution.
(G) Cancel is not a linearizable method: The effect of the cancellation
can be delayed past the operation return, and in fact even
past subsequent operations on the same thread. (H) Count() may
release a lock it does not own if interleaved with Add(). (I) Bag is
nondeterministic by design to improve performance: the returned
value can depend on the specific interleaving. (J) Count may return
0 even if the collection is not empty. The specification of the
Count method was weakened after Line-Up detected this behavior.
(K) TryTake may fail even if the collection is not empty. The
specification of the TryTake method was weakened after Line-Up
detected this behavior. (L) SetResult() throws the wrong exception
if the task is already reserved for completion by somebody else, but
not completed yet.
Results: Phase 1 / Phase 2
Outline
 Preemption bounding
 Makes CHESS effective on deep state spaces
 Fair stateless model checking
 Sober
 FeatherLite
 Concurrency Explorer
Outline
 Preemption bounding
 Makes CHESS effective on deep state spaces
 Fair stateless model checking
 Makes CHESS effective on cyclic state spaces
 Enables CHESS to find liveness violations (livelocks)
 Sober
 FeatherLite
 Concurrency Explorer
Concurrent programs have cyclic state spaces
L1: while( ! done) {
L2: Sleep();
}





! done
L1
! done
L2
done
L1
done
L2
M1: done = 1;
Spinlocks
Non-blocking algorithms
Implementations of synchronization primitives
Periodic timers
…
A demonic scheduler unrolls any cycle
while( ! done)
{
Sleep();
}
! done
done = 1;
! done
! done
! done
done
done
done
Depth bounding
 Prune executions beyond a bounded number of steps
! done
! done
! done
Depth bound
! done
done
done
done
Problem 1: Ineffective state coverage
 Bound has to be large enough to
reach the deepest bug
 Typically, greater than 100
! done
synchronization operations
 Every unrolling of a cycle
! done
redundantly explores reachable
state space
! done
Depth bound
! done
Problem 2: Cannot find livelocks
 Livelocks : lack of progress in a program
temp = done;
while( ! temp)
{
Sleep();
}
done = 1;
Key idea
while( ! done)
{
Sleep();
}
! done
! done
done
done
done = 1;
 This test terminates only when the scheduler is fair
 Fairness is assumed by programmers
All cycles in correct programs are unfair
A fair cycle is a livelock
We need a fair demonic scheduler
Test
Harness
Concurrent
Program
 Avoid unrolling unfair cycles
 Effective state coverage
 Detect fair cycles
Win32 API
Fair
Demonic
Demonic
Scheduler
Scheduler
 Find livelocks
 What notion of “fairness” do we use?
Weak fairness
 Forall t :: GF ( enabled(t)  scheduled(t) )
 A thread that remains enabled should eventually be
scheduled
while( ! done)
{
Sleep();
}
done = 1;
 A weakly-fair scheduler will eventually schedule Thread 2
 Example: round-robin
Weak fairness does not suffice
Lock( l );
While( ! done)
{
Unlock( l );
Sleep();
Lock( l );
}
Unlock( l );
en = {T1, T2}
en = {T1, T2}
T1: Sleep()
T2: Lock( l )
T1: Lock( l )
T2: Lock( l )
Lock( l );
done = 1;
Unlock( l );
en = { T1 }
T1: Unlock( l )
T2: Lock( l )
en = {T1, T2}
T1: Sleep()
T2: Lock( l )
Strong Fairness
 Forall t :: GF enabled(t)  GF scheduled(t)
 A thread that is enabled infinitely often is scheduled
infinitely often
Lock( l );
While( ! done)
{
Unlock( l );
Sleep();
Lock( l );
}
Unlock( l );
Lock( l );
done = 1;
Unlock( l );
 Thread 2 is enabled and competes for the lock infinitely
often
Good Samaritan violation
 Thread yield the processor when not making progress
 Forall threads t : GF scheduled(t)  GF yield(t)
while( ! done)
{
;
}
done = 1;
 Found many such violations, including one in the
Singularity boot process
 Results in “sluggish I/O” behavior during bootup
Results: Achieves more coverage faster
Work stealing queue with one stealer
Without fairness, with depth bound
With
fairness
20
30
40
50
60
States
Explored
1726
871
1505
1726
1307
683
Percentage
Coverage
100%
50%
87%
100%
76%
40%
143
97
763
2531
&gt;5000
&gt;5000
Time
(secs)
Finding livelocks and
finding (not missing) safety violations
Program
Lines of code
Safety Bugs
Livelocks
Work Stealing Q
4K
4
CDS
6K
1
CCR
9K
1
2
ConcRT
16K
2
2
18K
7
APE
19K
4
STM
20K
2
TPL
24K
4
PLINQ
24K
1
Singularity
175K
5
2
26 (total)
11 (total)
Acknowledgement: testers from PCP team
Outline
 Preemption bounding
 Makes CHESS effective on deep state spaces
 Fair stateless model checking
 Makes CHESS effective on cyclic state spaces
 Enables CHESS to find liveness violations (livelocks)
 Sober
 Detect relaxed-memory model errors
 Do not miss behaviors only possible in a relaxed memory
model
 FeatherLite
 Concurrency Explorer
Single slide on Sober
 Relaxed memory verification problem
 Is P correct on a relaxed memory model
 Sober: split the problem into two parts
 Is P correct on a sequentially consistent (SC) machine
 Is P sequentially consistent on a relaxed memory model
 Check this while only exploring SC executions
 CAV ‘08 solves the problem for a memory model with
store buffers (TSO)
 EC2 ‘08 extends this approach to a general class of
memory models
Outline
 Preemption bounding
 Makes CHESS effective on deep state spaces
 Fair stateless model checking
 Makes CHESS effective on cyclic state spaces
 Enables CHESS to find liveness violations (livelocks)
 Sober
 Detect relaxed-memory model errors
 Do not miss behaviors only possible in a relaxed memory model
 FeatherLite
 A light-weight data-race detection engine (&lt;20%
 Concurrency Explorer
Single slide on FeatherLite
 Current data-race detection tools are slow
 Process every memory access done by the program
 One in 5 instructions access memory  1 billion accesses/sec
 Key idea: Do smart adaptive sampling of memory accesses
 Na&iuml;ve sampling does not work, need to sample both racing instructions
 Cold-path hypothesis: At least one of the racing instructions occurs in a
cold path
 Races between fast-paths are most probably benign
 FeatherLite adaptively samples cold-paths at 100% rate and hot-paths
at 0.1% rate
 Finds 70% of the data-races with &lt;20% runtime overhead
 Existing data-race detection tools &gt;10X overhead
Outline
 Preemption bounding
 Makes CHESS effective on deep state spaces
 Fair stateless model checking
 Makes CHESS effective on cyclic state spaces
 Enables CHESS to find liveness violations (livelocks)
 Sober
 Detect relaxed-memory model errors
 Do not miss behaviors only possible in a relaxed memory model
 FeatherLite
 A light-weight data-race detection engine (&lt;20% overhead)
 Concurrency Explorer
 First-class concurrency debugging
Concurrency explorer
 Single-step over a thread interleaving
 Inspect program states at each step
 Program state = Stack of all threads + globals
 Limited bi-directional debugging
 Interleaving slices for better understanding
 Working on:
 Closer integration with the Visual Studio debugger
 Explore neighborhood interleavings
Conclusion
 Don’t stress, use CHESS
 CHESS binary and papers available at
http://research.microsoft.com/CHESS
Points to get across
 Capturing non-determinism
 Sync-orders, data-races, hardware interleavings
 Adding elastic delay
 Soundness &amp; completeness
 Scoping Preemptions
Questions
 Did you find new bugs
 How is this different from your previous papers
 How is this different from previous mc efforts
 How is this different from
Are these behaviors “expected” ?
TryTake
return “empty”
return
TryTake
return
return10
TryTake
return
return10
return
TryTake
return “empty”
Linearizability
 Component is linearizable if all operations
 Appear to take effect at a single temporal point
 And that point is between the call and the return
 “As if the component was protected by a single
global lock”
TryTake
TryTake
return
return
return10
return “empty”
This behavior is not linearizable
 Thread 2 getting a 10 means that Thread 1’s Add got
 So, when Thread 3 does a TryTake, 20 should be still
in the queue
return
TryTake
return10
return
TryTake
return “empty”
Linearizable?
return
TryTake
return20
return
TryTake
return “empty”
How is Linearizability different than
Seriazliability?
 Serializability
 All operations happen atomically in some serial order
 Linearizability
 All operations happen at a single instant
 That instant is between the call and return
Serializable behavior that is not
Linearizable
return
TryTake
return “empty”
 Linearizability assumes that there is a global observer
that can observe that Thread 1 finished before Thread
2 started
 This is what makes linearizability composable
Serializable behavior that is not
Linearizable
return
TryTake
return “empty”
 Linearizability assumes that there is a global observer
that can observe that Thread 1 finished before Thread
2 started
 This is what makes linearizability composable
Serializability does not compose
return TryTake
return “empty”
return TryTake
return “empty”
 The behavior of the blue queue and green queue are
individually serializable
 But, together, the behavior is not serializable
To make this all the more confusing
 Database concurrency control ensures that
transactions are linearizable
 Even though the literature only talks about serializability
 Quote from Jim Gray:
 “When a transaction finishes, the state of the database
immediately reflects the updates of the transaction”
 The commit point of a transaction is guaranteed
between the transaction begin and end
 When using a two-phase locking protocol, for instance
“Standard” definition of Linearizability
 Is a little more general than my interpretation
 “as if protected by a single global lock”
 Sometimes, a concurrent implementation can have
more behaviors than a sequential implementation
 Example: a set implemented as a queue
 A sequential version will be FIFO even order does not
matter for a set
 For performance, a concurrent version can break the
FIFO ordering but still maintain the set abstraction
 Define a “sequential specification”
A Sequential Specification
 (A fancy word for something you already know but don’t