Sebastian Burckhardt
Rajeev Alur
Milo M. K. Martin
Department of Computer and Information Science
University of Pennsylvania
CAV 2006, Seattle
software multiprocessor concurrent executions bugs
-2-
concurrency libraries can help e.g. Java JSR-166 but how to debug the libraries?
Sebastian Burckhardt
optimized implementations of concurrent datatypes concurrent executions shared-memory multiprocessor with relaxed memory model bugs case study: use SAT solver to find bugs
-3-
Sebastian Burckhardt
Algorithm published by M. Michael and M. Scott [PODC 1996]
1 2 lock head lock
3 tail lock head lock tail
Singly linked list with head and tail pointers
Dummy node at front
Independent head and tail locks
→ allows for concurrent enqueue() and dequeue()
Race condition if queue is empty
-4-
Sebastian Burckhardt
client program observes
ordering of operation calls within each thread
argument and return values of the operation thread 1 enqueue( 1 ) enqueue( 2 ) thread 2 enqueue( 3 ) dequeue() → 1 thread 3 dequeue() → 3 dequeue() → 2
code is correct if and only if all executions are observationally equivalent to some serial execution
(def. serial: interleaved at operation boundaries only)
We assume serial executions are correct
(can be verified by convential sequential methods)
-5-
Sebastian Burckhardt
Serial
SC
Relaxed
serial executions threads interleave the operations
(operations are atomic)
(operations are in-order) sequentially consistent executions threads interleave the instructions
(instructions are atomic)
(instructions are in-order) relaxed executions hardware makes performancemotivated compromises
(stores may be non-atomic)
(loads/stores may be out-of-order)
-6-
Sebastian Burckhardt
Example: thread 1 x = 1 y = 2 thread 2 print y print x
→ 2
→ 0 output not consistent with any interleaved execution!
can be the result of out-of-order stores
can be the result of out-of-order loads
improves performance (more choices for processor)
Q: Why doesn’t everything break?
A: Relaxations are transparent to “normal” programs
uniprocessor semantics are preserved library code for lock/unlock contains memory ordering fences
-7-
Sebastian Burckhardt
Alpha
RMO
PSO
TSO
SC
390
Relaxed
PPC
Memory models are platform dependent
We use a conservative approximation
“Relaxed” to capture common effects
Once code is correct for “Relaxed”, it is correct for all models
See paper for formal spec of “Relaxed”
-8-
Sebastian Burckhardt
done coming up
General motivation
Case study parameters
Two-lock queue implementation
Correctness criterion
Relaxed memory models
Our verification method
Symbolic tests
SAT encoding
Results
Bugs found
Evaluation & Conclusion
-9-
Sebastian Burckhardt
symbolic test
1
Encoder
5
SAT solver
2 implementation code with commit points
3 pass counterexample
4
-10-
Sebastian Burckhardt
1
Verify individual “symbolic tests”
finite number of operations nondeterministic instruction order nondeterministic input values
Example
(this is the smallest one in our test suite) thread 1 enqueue( X ) thread 2 dequeue() → Y
User creates suite of tests of increasing size
-11-
Sebastian Burckhardt
1) Avoid undecidability by making everything finite:
State is unbounded (dynamic memory allocation)
... is bounded for individual test
Sequential consistency is undecidable
... is decidable for individual test
2) Gives us finite instruction sequence to work with
State space too large for interleaved system model
.... can directly encode value flow between instructions
Memory model specified by axioms
.... can directly encode ordering axioms on instructions
-12-
Sebastian Burckhardt
2
we handtranslated Michael & Scott’s code (above) into a low-level representation that uses explicit loads, stores we added code for dynamic memory allocation and locks
-13-
Sebastian Burckhardt
3
designate where the operation commits logically
given order of commit points, we can construct serial witness execution eliminates the in
“ executions equivalent serial execution”
-14-
Sebastian Burckhardt
4
1
11
12
2
3
13
14
thread 1 enqueue ( 1 ) thread 2 dequeue() → 0
4
5
6
9
10
7
8
commit point order ( 3 < 6 ) indicates that enqueue precedes dequeue, so we would expect dequeue()
→
1 incorrect value ( 0 ) of queue element gets read ( 7 ) before correct value ( 1 ) is being written ( 11 ).
-15-
Sebastian Burckhardt
5
Given
symbolic test T ( A , B ) thread 1 enqueue( A ) thread 2 dequeue() → B memory model Y implementation code & commit point specifications
Encoding
First step : encode concurrent executions of T on Y as solutions to CNF formula
Y
( A , B , X ) (aux vars X )
Second step : encode counterexamples as solutions to
Y
( A , B , X )
Atomic
( A’ , B’ , X’ )
( A = A’ )
(commit point orders match)
(( B
B’ ) (some operations commit out of order))
-16-
Sebastian Burckhardt
Finite instruction sequence for each thread
Only loads, stores, moves, and fences
Each register is assigned exactly once
Control flow represented by predicates
-17-
Sebastian Burckhardt
Example: two threads: thread 1 s1 s2 store store thread 2 l1 l2 load load
Encoding variables
Use bool vars for relative order ( x<y ) of memory accesses
Use bitvector variables A x
2 and D x associated with memory access x for address and data values
Encode constraints
encode transitivity of memory order encode ordering axioms of the memory model
Example (for SC): ( s1<s2 )
( l1<l2 ) encode value flow
“Loaded value must match last value stored to same address”
Example: value must flow from s1 to l1 under following conditions:
(
( s1<l1 )
( A s1
D l1
)
= A l1
)
( ( s2<s1 )
( l1<s2 )
( A s2
A l1
) )
)
( D s1
=
-18-
Sebastian Burckhardt
input values output values intermediate values memory order variables communication formula
-19thread-local formulas
Sebastian Burckhardt
done coming up
General motivation
Case study parameters
Two-lock queue implementation
Correctness criterion
Relaxed memory models
Our verification method
Symbolic tests
SAT encoding
Results
Bugs found
Evaluation & Conclusion
-20-
Sebastian Burckhardt
3 were mistakes we made
first commit point guess was wrong
incorrect/insufficient fences in lock/unlock and alloc/free
2 were caused by missing fences in queue implementation
(not fault of authors... were assuming SC multiprocessor)
---load-load-fence
---store-store-fence
-21-
Sebastian Burckhardt
1000
900
800
700
600
500
400
300
200
100
0
Graph shows tests in our suite (unsatisfiable instances only)
y-axis : runtime in seconds
x-axis : # accesses
(loads/stores) in test
Fast on small tests, slow on long tests
Not sensitive to # threads
0 50 100 150
# memory accesses
-22-
All 5 problems were found on smallest 2 tests... all under 1 sec
Sebastian Burckhardt
We would recommend this method to designers and implementors of concurrent data types.
PROs
CHALLENGES
quickly finds subtle bugs
supports relaxed memory models counterexample traces not truly scalable
(though scalable enough to be useful)
not fully automatic
catches broad range of bugs
(not limited to deadlocks or data races) is more automatic than deductive methods
does not solve full problem
(bounded instances, commit points)
-23-
Sebastian Burckhardt
-24-
Sebastian Burckhardt
Ordering/Atomicity Relaxations
The following 2 examples illustrate the main effects
(1. ordering relaxation / 2. atomicity relaxation)
Where necessary, a programmer can prevent these effects by inserting fence instructions
EXAMPLE 1 store, load may execute out of order
3
1 processor 1 store A, 1 load B, 0
4
2 processor 2 store B, 1 load A, 0
EXAMPLE 2 stores are buffered locally before effect is global
1/6
2
3 processor 1 store A, 1 load A, reg store reg, B
-25-
4
5 processor 2 load B, 1 load A, 0 initially A=B=0 pink numbers = memory order initially A=B=0 split store into local / remote components
Sebastian Burckhardt
What code?
Data type implementations optimized for concurrent execution
(Concurrency libraries)
What machines?
Common shared-memory multiprocessors
(e.g. PPC, Sparc, Alpha)
What bugs?
Bugs caused by concurrency
(We assume code runs fine if single-threaded)
-26-
Sebastian Burckhardt
Encoding Concurrent Executions
label x
1 x
2 load a[0], R1 store R1 , y label x
3 x
4 load y, R2 move R2 +1, R3 store 1 , a[ R3 ]
Variables
O(n 2 ) bitvectors R1 , R2 , R3 for intermediate values boolean variables M ij to represent memory order x i
< x j
(for i < j)
Constraints
memory order is transitive:
Λ i<j<k
( M ij
M jk
) → M ik loads get latest value stored to same address
O(n 3 ) memory order must respect memory model axioms and fences
(e.g. sequential consistency requires M
12
M
34
) thread-local computations connect values (e.g. R3 = R2 + 1)
-27-
Sebastian Burckhardt