Part 2: Incomplete techniques and bug finding
Race detection
Context bounding and Sequentialization
Odds and ends
“Two threads simultaneously access the same memory location, with at least one access being a write ”
Most concurrent software is written to avoid data-races
Extremely tricky to write racy code that is correct
For racy code, correctness on one architecture and complier does not imply correctness for all
Race-detection is a very efficient light-weight bug detection technique
Dynamic techniques: Work on actual executions
Lockset assumption: Whenever two different threads access a location (with one of them being a write), they will both hold a common lock .
Possible to write non-racy code violating this assumption
False positives from code that violates this assumption
for each tid: LocksHeld[tid] = ∅ for each v : Cands[v] = set of all locks for each instruction in trace:
// Maintain lockset if (instruction = lock(l))
LocksHeld[tid] = LocksHeld[tid] ∪ { l } if (instruction = unlock(l))
LocksHeld[tid] = LocksHeld[tid] \ { l }
// Update candidate locks if (access to v by thread tid)
Cands[v] := Cands[v] ∩ LocksHeld[tid] if (Cands[v] = ∅ )
Output “Potential race on v”
Based on a simple locking discipline
Each variable is protected by some lock
Thread 2: Thread 1: lock (l1) x := x + 4 unlock (l1) lock(l1) lock(l2) x := x – y y := y + 4 unlock(l2) unlock(l1) lock (l2) y = y + 5 x := x - 3 unlock (l2) x := x - 3
Thread2: Thread3: Thread1:
// Initialize g := 1729
// Share lock(l) published := true unlock(l) lock(l) assume(published) unlock(l) y := g y := f(l, y); output(y) lock(l) assume(published) unlock(l) z := g z := g(l, z) output(z)
Simple locking discipline leads to false positives in many common cases
Lazy initialization
Initialize followed by read-only access
Virgin
Shared
Modified write (new thread) write write read (new thread)
Exclusive Shared r/w (first thread)
for each access to v by thread tid: update State[v] if State[v] = Exclusive
// Do nothing else if State[v] = Shared
// Update lockset, don’t report any races
Cands[v] = Cands[v] ∩ LocksHeld[tid] else if State[v] = Shared-Modified
// Update lockset, report races
Cands[v] := Cands[v] ∩ LocksHeld[tid] if (Cands[v] = ∅ )
Output “Potential race on v”
Works well for initialize-and-publish, lazy initialization, etc
Doesn’t work with ownership-transfer, etc
Thread1:
// Initialize g := 1729
// Share lock(l) published := true unlock(l)
Thread2: lock(l) assume(published) unlock(l) y := g y := f(l, y); output(y)
Thread3: lock(l) assume(published) unlock(l) z := g z := g(l, z) output(z)
Thread2: Thread1:
…
… await (Obj.owner == 1);
Obj.foo()
Obj.bar := baz();
…
Obj.owner := 2
… await (Obj.owner = 2);
Obj.bar()
Obj.foo = baz();
…
Proposed by Lamport in 1978 for distributed systems
Many race detection algorithms try and approximate the happens-before relation
Basic idea: two events are related if and only if communication allows information-flow between them
We write e i
e j
Informally, if e trace i
for event-i happens-before event-j e j
, e i happens-before e j in all variations of the
A race if two events e and neither e i
e j i and e j accesses the same location, holds, nor e j
e i holds
If two events are from the same thread, the earlier one happens-before the later
thread(e i
) = thread(e j
)
∧ i < j e i
e j
Happens-before is transitive
(e i
e j
) ∧ (e j
e k
) (e i
e k
)
Every synchronization gives some happens-befores
LOCK: if e i
WAIT/NOTIFY: if e i
e j
… is a unlock and e j is a lock later, e i is a notify and e j
e j is a corresponding wait, e i
…
Thread1: obj := new Foo()
Notify(obj)
Thread2: data = readFile()
Wait(obj) obj.data = data
Notify(obj)
Wait(obj) lock(l) obj.data = obj.data + 4 unlock(l) lock(l) obj.data = obj.data - 4 unlock(l)
…
The happens-before relation is usually very expensive to compute
Few dynamic techniques actually compute the full relation
Classical method proposed by Lamport himself
Vector clocks : with each event, associate a “vector clock” storing the last event from the other threads that affects it
e i
e j if and only if VC[e j
][thread(i)] >= e i
VC[e] = [ e
1
, e
2
, e
3
, … , e n
]
Last relevant event from thread 1
Last relevant event from thread 2
VC[e][thread(e)] := e
// regular events
VC[e][tid] := VC[prev(e)][tid]
Can be extended to other synchronization primitives!
// Acquire locks e i
e j if and only if VC[e j
][thread(i)] >= e i
// Release locks
VC[e][tid] := VC[prev(e)][tid]
LVC[tid] := max(VC[e][tid], LVC[tid])
…
Thread1:
T1_0: obj := new Foo()
T1_1: Notify(obj)
Thread2:
T2_0: data = readFile()
T2_1: Wait(obj)
T2_2: obj.data = data
T2_3: Notify(obj)
T1_2: Wait(obj)
T1_3: lock(l)
T1_4: obj.data = obj.data + 4
T1_5: unlock(l)
T2_4: lock(l)
T2_5: obj.data = obj.data - 4
T2_6: unlock(l)
…
Every race or false race reported by happens-before based methods are also reported by LockSet based methods
Fewer false positives, Potential false negatives
Why? Every happens-before relation is not really a true synchronization
Thread2: Thread1: y = y + 1 lock(l) x = x + 1 unlock(l) lock(l) x = x + 1 unlock(l) y = y + 1
First line of defence for most concurrent programs
Many bugs just show up as race conditions
Lockset is fast
Lots of false positives
Happens-before is slow
Reports only true data races
Potential false negatives
There are hybrid techniques
Compute approximations of LockSet and HB
Folk knowledge : Most concurrency bugs are shallow in terms of required context-switches
Most bugs require very few bug fixes
Most concurrency bugs are atomicity violations or order violations
For an empirical study, see Shan Lu et al. 2006…2008
Why not check concurrent programs only up to a few context switches?
Much more efficient
Culmination of techniques proposed by Qadeer et al in
2004
Correctness primarily given by assertions in the code
Can also use monitors
Can detect data-races, deadlocks, etc
Main idea: Use a scheduler that explores traces of the program deterministically , prioritizing traces having few context-switches
Non-determinism source:
Input
Scheduling
Timing and library
Input non-determinism controlled by specifying fixed inputs
Scheduling non-determinism controlled by writing deterministic scheduler
Library non-determinism: model library code
Thread1: x = 1
…
…
…
… y = k
…
Threadn: x = 1
…
…
…
… y = k
Additionally, scheduler can use polynomial amount of space
• Remember c spots for context switches
• Permutations of the n+c atomic blocks
Exploring k steps in each of the n threads
Number of executions is
O(n nk )
Exploring k steps in each thread, but only c context-switches
Number of executions is
O((n 2 k) c .n!)
Not exponential in k
void Deposit100() {
ChessSchedule();
EnterCriticalSection(&cs); balance += 100;
ChessSchedule();
LeaveCriticalSection(&cs);
} void Withdraw100() { int t;
ChessSchedule();
EnterCriticalSection(&cs); t = balance;
ChessSchedule();
LeaveCriticalSection(&cs);
ChessSchedule();
EnterCriticalSection(&cs); balance = t -100;
ChessSchedule();
LeaveCriticalSection(&cs);
}
Heuristics: More pre-emption points in critical code, etc
Coverage guarantee: When 2 context-switches are explored, every remaining bug requires at least 3 context-switches
Build a deterministic scheduler
Complications: Fairness and Live locks, weak memory models
Advantages:
Runs real code on real systems
Only scheduler has been replaced
Disadvantages:
Is mostly program agnostic
Exhaustive testing
CHESS approach: Concurrent program + bound on context switches explore all interleavings
General sequentialization approach: Concurrent program + bound on context switches Sequential program
Then, verify sequential program using your favourite verification technique
Many flavours of context-bounded analysis:
PDS based (Qadeer et al.)
Transformation based sequentialization: Eager, Lazy (Lal et al.)
BMC based (Parlato et al.)
What is hard about sequentialization?
Have to remember local variables across phases (though they don’t change)
If exploring T1 T2 T1, have to remember locals of T1 across phase of T2
Lal-Reps 2008: Instead, do a source to source transformation
Copy each statement and global variable c times
Now, we can explore T1 T1 T2 instead of T1 T2 T1
Only one threads local variables relevant at each stage
Replace each global variable X by X[tid][0..K]
X[tid][i] represents the value of the global variable X the i th time thread tid is scheduled
X := X + 1 if (phase = 0)
X[tid][0] := X[tid][0] + 1 else if (phase = 1)
X[tid][1] := X[tid][1] + 1
… else if (phase = K)
X[tid][K] = X[tid][K] + 1 if (phase < K && *) phase++; if phase == K + 1 phase = 1
Thread[tid+1]()
A program (T1||T2) is rewritten into Seq(T1); Seq(T2); check() for phase = 0 to K if (phase > 0) assume (X[0][phase] == X[N][phase – 1] for tid = 1 to N assume (X[tid][phase] == X[tid-1][phase])
Roughly,
Execute each thread sequentially
But, at random points, guess new values for global variables
In the end, check the guessed new values are consistent
Each green arrow is one part of the check!
Thread 0:
…
…
X[0][0] := X[0][0] + 1
…
…
…
X[0][1] := X[0][1] + 1
…
…
X[0][2] := X[0][2] + 1
…
Thread 1:
…
…
X[1][0] := X[1][0] + 1
…
…
…
X[1][1] := X[1][1] + 1
…
…
X[1][2] := X[1][2] + 1
…
The original Lal/Reps technique uses summarization for verification of the sequential program
Compute summaries for the relation of initial and final values of global variables
Extremely powerful idea
Advantage: Reduces the need to reason about locals of different threads
No need to reason explicitly about interleavings
Interleavings encoded into data (variables)
Scales linearly with number of threads
Currently, the best tools in the concurrency verification competitions use “sequentialization + BMC”
The previous sequentialization technique is better suited for analysis techniques, not model checking
No additional advantage using additional globals and then checking for consistency
Instead, just explicitly use non-determinism
First, rewrite threads by unrolling loops and inlining function calls
No loops
No function calls
Forward only control flow
Write a driver “main” function to schedule the threads one by one
Main driver: pc0 = 0, … , pcn = 0 main() { for (r = 0; r < K, r++) for (i = 0; i < n; i++) threadi();
}
• The resume mechanism jumps into “right” spot in the thread
• There is a potential CS before each statement
• What’s the problem? Lots of jumps in the control flow
• Bad for SMT encoding threadi(): switch(pci) { case 0: goto 0; case 1: goto 1;
…
}
0: CS(0); stmt0;
1: CS(1); stmt1;
…
M: CS(M); stmtm;
CS(j) := if(*) { pci = j; return
}
Main driver: pc0 = 0, … , pcn = 0 main() { for (r = 0; r < K, r++) for (i = 0; i < n; i++) nextCS = * assume (nextCS >= pci) threadi(); pci = nextCS
}
• Avoid the multiple control flow breaking jumps
• Restricted non-determinism to one spot threadi():
0: CS(0); stmt0;
1: CS(1); stmt1;
…
M: CS(M); stmtm;
CS(j) := if(j < pci || j >= nextCS)
{ goto j+1;
}
Host of related techniques
Can be adapted for analysis, model checking, testing, etc
Different techniques need different kinds of tuning
Basic idea: Most bugs require few context switches to turn up
Can leverage standard sequential program analysis techniques
Things we didn’t cover
Seq. Exec
In many cases we don’t want to write assertions
Just want concurrent program to do the same thing as a sequential program is doing
E x c
. e
C o n c
Standard correctness conditions
Linearizability [Herlihy/Wing 91]
Serializability [Papadimitrou and others 70s]
Method 0 Method 2
Method 1
Method 3
Method 0 Method 3 Method 1 Method 2
Root cause of bugs
Ordering violations
Atomicity violations
Data races
Coverage metrics and coverage guided search
Define use pairs [Tasirin et al]
Find ordering violations based on define use orderings
HaPSet [ Wang et al]
Find interesting interleavings by trying to cover all
“immediate histories” of events
Cute/JCute [Sen et al]
Concolic testing: Accumulate constraints along test run to guide future test runs
Analyze variations of the given concurrent trace
Run a test and record information
Build a predictive model by relaxing scheduling constraints
Analyze predictive model for alternate interleavings
Can flag false bugs
Symbolic predictive analysis
From a trace, build precise predictive model (as SMT formula)
No false bugs
Brief overview of concurrent verification techniques
Lecture 1: Full proof techniques
Lecture 2: Incomplete techniques / Bug finding
What did we learn?
Full verification is hard, not many techniques for weakmemory architectures
Use light-weight and incomplete techniques to detect shallow bugs
Code using a strict concurrency discipline is more likely to be correct, easier to verify