Concurrent Executions on Relaxed Memory Models Rajeev Alur University of Pennsylvania

advertisement
Concurrent Executions on Relaxed Memory Models
Challenges & Opportunities for Software Model Checking
Rajeev Alur
University of Pennsylvania
Joint work with Sebastian Burckhardt and Milo Martin
SPIN Workshop, July 2007
Model Checking Concurrent Software
Multi-threaded Software
Shared-memory Multiprocessor
Concurrent Executions
Bugs
Concurrency on Multiprocessors
Initially x = y = 0
thread 1
x = 1
y = 1
thread 2
print y
print x
→1
→0
Output not consistent with any interleaved execution!
 can be the result of out-of-order stores
 can be the result of out-of-order loads
 improves performance
Architectures with Weak Memory Models
 A modern multiprocessor does not enforce global ordering of
all instructions for performance reasons
 Each processor has pipelined architecture and executes multiple
instructions simultaneously
 Each processor has a local cache, and loads/stores to local cache
become visible to other processors and shared memory at
different times
 Lamport (1979): Sequential consistency semantics for correctness of
multiprocessor shared memory (like interleaving)
 Considered too limiting, and many “relaxations” proposed
 In theory: TSO, RMO, Relaxed …
 In practice: Alpha, Intel IA-32, IBM 370, Sun SPARC, PowerPC …
 Active research area in computer architecture
Programming with Weak Memory Models
 Concurrent programming is already hard, shouldn’t the effects
of weaker models be hidden from the programmer?
 Mostly yes …
 Safe programming using extensive use of synchronization
primitives
 Use locks for every access to shared data
 Compilers use memory fences to enforce ordering
 Not always …
 Non-blocking data structures
 Highly optimized library code for concurrency
 Code for lock/unlock instructions
Non-blocking Queue (MS’96)
boolean_t dequeue(queue_t *queue, value_t *pvalue)
{
node_t *head;
node_t *tail;
node_t *next;
2
3
1
while (true) {
head = queue->head;
tail = queue->tail;
head
tail
next = head->next;
if (head == queue->head) {
if (head == tail) {
if (next == 0)
return false;
cas(&queue->tail, (uint32) tail, (uint32) next);
} else {
*pvalue = next->value;
if (cas(&queue->head, (uint32) head, (uint32) next))
break;
}
}
}
delete_node(head);
return true;
}
Software Model Checking for
Concurrent Code on Multiprocessors
Why?: Real bugs in real code
 Opportunities
 10s—100s lines of low-level library C code
 Hard to design and verify -> buggy
 Effects of weak memory models, fences …
 Challenges
 Lots of behaviors possible: high level of concurrency
 How to formalize and reason about weak memory models?
Talk Outline
 Motivation
 Relaxed Memory Models
 CheckFence: Analysis tool for Concurrent Data Types
Hierarchy of Abstractions
Programs (multi-threaded)
-- Synchronization primitives: locks, atomic blocks
-- Library primitives (e.g. shared queues)
Language level memory model
Assembly code
-- Synchronization primitives: compare&swap, LL/SC
-- Loads/stores from shared memory
Architecture level memory model
Hardware
-- multiple processors
-- write buffers, caches, communication bus …
Shared Memory Consistency Models
 Specifies restrictions on what values a read from shared
memory can return
 Program Order: x <p y if x and y are instructions belonging to
the same thread and x appears before y
 Sequential Consistency (Lamport 79): There exists a global
order < of all accesses such that
 < is consistent with the program order <p
 Each read returns value of most recent, according to <, write to
the same location (or initial value, if no such write exists)
 Clean abstraction for programmers, but high implementation cost
Effect of Memory Model
Initially flag1 = flag2 = 0
thread 1
thread 2
1.
2.
1.
2.
flag1 = 1;
if (flag2 == 0)
crit. sect.
flag2 = 1;
if (flag1 == 0)
crit. sect.
Ensures mutual exclusion if architecture supports SC memory
Most architectures do not enforce ordering of accesses to different
memory locations
 Does not ensure mutual exclusion under weaker models
Ordering can be enforced using “fence” instructions
 Insert MEMBAR between lines 1 and 2 to ensure mutual exclusion
Relaxed Memory Models
 A large variety of models exist; a good starting point:
Shared Memory Consistency Models: A tutorial
IEEE Computer 96, Adve & Gharachorloo
 How to relax memory order requirement?
 Operations of same thread to different locations need not be
globally ordered
 How to relax write atomicity requirement?
 Read may return value of a write not yet globally visible
 Uniprocessor semantics preserved
 Typically defined in architecture manuals (e.g. SPARC manual)
Unusual Effects of Memory Models
Initially A = flag1 = flag2 = 0
thread 1
flag1 = 1;
A = 1;
reg1 = A;
reg2 = flag2;
thread 2
flag2 = 1;
A = 2;
reg3 = A;
reg4 = flag1;
Result reg1 = 1; reg3 = 2; reg2 = reg4 = 0
Possible on TSO/SPARC
 Write to A propagated only to local reads to A
 Reads to flags can occur before writes to flags
Not allowed on IBM 370
 Read of A on a processor waits till write to A is complete
Which Memory Model?
 Memory models are
platform dependent
RMO
 We use a conservative
approximation
“Relaxed” to capture
common effects
PSO
TSO
390
SC
Alpha
IA-32
Relaxed
 Once code is correct
for “Relaxed”, it is
correct for many
models
Formalization of Relaxed
A set of addresses
X set of memory accesses
a(x) memory address of x
V set of values
S are stores L are loads
v(x) value loaded or stored by x
<p is a partial order over X (program order)
Correctness wrt Relaxed demands existence of a total order < over X s.t.
For a load l in L, define the following set of stores that are “visible to l”:
S(l) = { s in S | a(s) = a(l) and (s < l or s <p l ) }
1. If x <p y and a(x) = a(y) and y in S, then x < y
2. For l in L and s in S(l), always either v(l) = v(s) or there exists another store s’ in
S(l) such that s < s’
Talk Outline
 Motivation
 Relaxed memory models
 CheckFence: Analysis tool for Concurrent Data Types
CheckFence Focus
concurrency libraries with lock-free synchronization
... are simple, fast, and safe to use
 concurrent versions of queues, sets, maps, etc.
 more concurrency, less waiting
 fewer deadlocks
... are notoriously hard to design and verify
 tricky interleavings routinely escape reasoning and testing
 exposed to relaxed memory models
code needs to contain memory fences for correct operation!
Concurrent Algorithms:
+ Lock-free queues, sets, lists
CheckFence
Tool
Architecture:
+ Multiprocessors
+ Relaxed memory models
References
[CAV 2006],
[PLDI 2007]
Burckhardt thesis
Computer-Aided
Verification:
+ Model checking C code
+ Counterexamples
Non-blocking Queue
Processor 1
....
... enqueue(1)
... enqueue(2)
....
....
....
Processor 2
....
...
...
a = dequeue()
b = dequeue()
void enqueue(int val) {
...
}
int dequeue() {
...
}
The client program
The implementation
 on multiple processors
 calls operations
 optimized: no locks.
 not race-free
 exposed to memory model
Non-blocking Queue (MS’96)
boolean_t dequeue(queue_t *queue, value_t *pvalue)
{
node_t *head;
node_t *tail;
node_t *next;
2
3
1
while (true) {
head = queue->head;
tail = queue->tail;
head
tail
next = head->next;
if (head == queue->head) {
if (head == tail) {
if (next == 0)
return false;
cas(&queue->tail, (uint32) tail, (uint32) next);
} else {
*pvalue = next->value;
if (cas(&queue->head, (uint32) head, (uint32) next))
break;
}
}
}
delete_node(head);
return true;
}
Correctness Condition
Data type implementations must appear
sequentially consistent to the client program:
the observed argument and return values must be consistent with some
interleaved, atomic execution of the operations.
 Observation
enqueue(1)
dequeue() -> 2
 Witness Interleaving
enqueue(2)
dequeue() -> 1
enqueue(1)
enqueue(2)
dequeue() -> 1
dequeue() -> 2
How To Bound Executions
 Verify individual “symbolic tests”
 finite number of concurrent threads
 finite number of operations/thread
 nondeterministic input values
 Example
thread 1
enqueue(X)
thread 2
dequeue() → Y
 User creates suite of tests of increasing size
Why symbolic test programs?
1) Make everything finite
 State is unbounded (dynamic memory allocation)
... is bounded for individual test
 Checking sequential consistency is undecidable (AMP 96)
... is decidable for individual test
2) Gives us finite instruction sequence to work with
 State space too large for interleaved system model
.... can directly encode value flow between instructions
 Memory model specified by axioms
.... can directly encode ordering axioms on instructions
Bounded Model Checker
Pass: all executions of the test
are observationally
equivalent to a serial
execution
CheckFence
Fail:
Inconclusive:
runs out of time
or memory
Memory
Model Axioms
Tool Architecture
Trace
C code
Memory model
Symbolic Test
Symbolic test gives exponentially many executions
(symbolic inputs, dynamic memory allocation, ordering of instructions).
CheckFence solves for “bad” executions.
construct CNF formula whose solutions
correspond precisely to the concurrent
executions
Trace
C code
Memory model
Symbolic Test
automatic, lazy
loop unrolling
automatic specification mining
(enumerate correct observations)
Specification Mining
thread 1
thread 2
dequeue() → Z
enqueue(X);
enqueue(Y)
Possible Operation-level Interleavings
enqueue(X)
enqueue(X)
dequeue() -> Z
enqueue(Y)
dequeue() -> Z
enqueue(X)
dequeue() -> Z
enqueue(Y)
enqueue(Y)
For each interleaving, obtain symbolic constraint by encoding
corresponding executions in SAT solver
Spec is disjunction of all possibilities:
Spec: (Z=X) | (Z=null)
To find bugs, check satisfiability of Phi & ~ Spec
where Phi encodes all possible concurrent executions
Encoding Memory Order
thread 1
s1
s2
 Example:
thread 2
store
store
l1
l2
load
load
 Encoding variables
 Use bool vars for relative order (x<y) of memory accesses
 Use bitvector variables Ax and Dx for address and data values
associated with memory access x
 Encode constraints
 encode transitivity of memory order
 encode ordering axioms of the memory model
Example (for SC): (s1<s2) & (l1<l2)
 encode value flow
“Loaded value must match last value stored to same address”
Example: value must flow from s1 to l1 under following conditions:
((s1<l1)&(A
s1
) -> (D
= Al1)|((s2<s1)|(l1<s2)|(As2 != Al1))
s1=
Dl1)
Example: Memory Model Bug
1
head
Processor 1
links new node into list
...
3 node->value = 2;
...
1 head = node;
...
2
3
Processor 2
reads value at head of list
...
2 value = head->value;
...
Processor 1 reorders the stores!
memory accesses happen in order 1 2 3
--> Processor 2 loads uninitialized value
adding a fence between lines on left side prevents reordering
Studied Algorithms
Type
Description
LOC
Source
Queue
Two-lock queue
80
M. Michael and L. Scott
Queue
Non-blocking queue
98
(PODC 1996)
Set
Lazy list-based set
141
Heller et al. (OPODIS 2005)
Set
Nonblocking list
174
T. Harris (DISC 2001)
Deque
“snark” algorithm
159
D. Detlefs et al. (DISC 2000)
LL/VL/SC CAS-based
74
M. Moir (PODC 1997)
LL/VL/SC Bounded Tags
198
Results
 snark algorithm has 2 known bugs
 lazy list-based set had a unknown bug
(missing initialization; missed by formal correctness proof
[CAV 2006] because of hand-translation of pseudocode)
Type
Description
Queue
Two-lock queue
Queue
Non-blocking queue
Set
Lazy list-based set
Set
Nonblocking list
Deque
original “snark”
Deque
fixed “snark”
LL/VL/SC CAS-based
LL/VL/SC Bounded Tags
regular
bugs
1 unknown
2 known
Results
 snark algorithm has 2 known bugs
 lazy list-based set had a unknown bug
(missing initialization; missed by formal correctness proof
[CAV 2006] because of hand-translation of pseudocode)
 Many failures on relaxed memory model
• inserted fences by hand to fix them
• small testcases sufficient for this purpose
Type
Description
regular
bugs
# Fences inserted
Store Load Dependent Aliased
Store Load
Loads
Loads
Queue
Two-lock queue
1
Queue
Non-blocking queue
2
Set
Lazy list-based set
Set
Nonblocking list
Deque
original “snark”
Deque
fixed “snark”
1 unknown
1
4
1
2
1
3
1
2
3
4
6
2 known
4
2
LL/VL/SC CAS-based
3
LL/VL/SC Bounded Tags
4
Typical Tool Performance
 Very efficient on small testcases (< 100 memory accesses)
Example (nonblocking queue): T0 = i (e | d) T1 = i (e | e | d | d )
- find counterexamples within a few seconds
- verify within a few minutes
- enough to cover all 9 fences in nonblocking queue
 Slows down with increasing number of memory accesses in test
Example (snark deque):
Dq = pop_l | pop_l | pop_r | pop_r | push_l | push_l | push_r | push_r
- has 134 memory accesses (77 loads, 57 stores)
- Dq finds second snark bug within ~1 hour
 Does not scale past ~300 memory accesses
Conclusions
 Software model checking of low-level concurrent
software requires encoding of memory models
 Challenge for model checking due to high level of
concurrency and axiomatic specifications
 Opportunity to find bugs in library code that’s hard to
design and verify
 CheckFence project at Penn
 SAT-based bounded model checking for concurrent data types
 Real bugs in real code with fences
Research Challenges
 What’s the best way to verify C code (on relaxed memory
models)?
 SAT-based encoding seems suitable to capture specifications of
memory models, but many opportunities for improvement
 Can one develop explicit-state methods for these applications?
 Language-level memory models: Can we verify Java
concurrency libraries using the new new Java memory
model?
 Hardware support for transactional memory
 Current interest in industry and architecture research
 Can formal verification influence designs/standards?
Download