CFix - NCSU COE People

advertisement
Diagnosing and Fixing
Concurrency Bugs
Presented by Tao Wang
Credits to Dr. Guoliang Jin, Computer Science, NC STATE
We need reliable software
 People’s daily life now depends on reliable software
 Software companies spend lots of resources on debugging
 More than 50% effort on finding and fixing bugs
 Around $300 billion per year
2
Concurrency bugs hurt
 It is an increasingly parallel world
 Concurrency bugs in history
3
Multi-threaded program
 Concurrent programs under the shared-memory model
 Programs execute multiple interacting threads in parallel
 Threads communicate via shared memory
 Shared-memory accesses should be well-synchronized
thread1
thread2
thread3
thread4
core1
core2
core3
core4
cache
cache
cache
cache
Multicore chip
shared memory
4
An example
The interleaving
of concurrency
space bug
Thread 1
Thread 2
if (ptr != NULL) {
ptr->field = 1;
}
Huge
ptr = NULL;
Interleaving
space
Thread 1
if (ptr != NULL) {
ptr->field = 1;
}
5
Thread 2
ptr = NULL;
Bad
Thread 1
Thread 2
interleavings
if (ptr != NULL)
{
ptr = NULL;
ptr->field = 1;
}
Segmentation
Fault
Previous research focuses
on finding
Bug fixing
 Software quality does not improve until bugs are fixed
 Manual concurrency bug fixing is
 time-consuming: 73 days on average
 error-prone: 39% patches are buggy in the first release
 CFix: automated concurrency-bug fixing [PLDI’11*, OSDI’12]
 *SIGPLAN:
Program behaves correctly if bad interleavings do not occur
of the first
papers
 “one
Fix concurrency
bugs
by disabling bad interleavings
to attack the problem of automated bug fixing”
6
The interleaving space (again)
Bad
interleavings
Huge
Interleaving
space
Bad
interleavings
Bad
interleavings
Bad
interleavings
lead to
production-run failures
7
Failure diagnosis
 Failures still happen in production runs
 The reason behind failure needs to be understood
 Tools dealing with production runs demand low overhead
 Diagnostic information needs to be informative
 Production-run concurrency-bug failure diagnosis
 Design new monitoring schemes and sampling strategies
 CCI: a pure software solution [OOPSLA’10]
 PBI, LXR: hardware-assisted solutions [ASPLOS’13 & 14]
8
My work on concurrency bugs
Bug Detection and
software testing:
ConSeq
[ASPLOS’11]
Production-Run
Failure Diagnosis:
CCI/PBI/LXR
[OOPSLA’10,
ASPLOS’13 & 14]
9
Automated
Concurrency-Bug
Fixing: CFix
[PLDI’11*, OSDI’12]
*Received a SIGPLAN
CACM nomination
Outline
 Motivation and Overview
 Automated Concurrency-Bug Fixing




10
The problem and idea
Overview
Internals of CFix
Evaluation and summary
Automated fixing is difficult
Description:
Symptom
Triggering condition
…
?
Patch:
Correctness
Performance
Simplicity
 What is the correct behavior?
 Usually requires developers’ knowledge
 How to get the correct behavior?
 Correct program states under bug-triggering inputs
 No change to program states under other inputs
11
CFix’ insights
Description:
Symptom
Triggering condition
…
?
Patch:
Correctness
Performance
Simplicity
 What is the correct behavior?
 The program state is correct
as long as the buggy interleaving does not occur
 How to get the correct behavior?
 Only need to disable failure-inducing interleavings
 Can leverage well-defined synchronization operations
12
Description:
?
Interleavings that
Symptom
Triggering
lead
to software
condition
…
failure
atomicity violation
detectors
A
r
c
data race detectors
SenPLDI’08,
SavageTOCS’97,
YuSOSP’05,
EricksonOSDI’10,
KasikciASPLOS’10
13
Correctness
Performance
Simplicity
order violation
detectors
p
ParkASPLOS’09,
FlanaganPOPL’04,
LuASPLOS’06,
ChewEuroSys’10
Patch:
How to get a
general solution
that generates
good patches?
I1
I2
B
ZhangASPLOS’10,
LuciaMICRO’09,
YuISCA’09,
GaoASPLOS’11
abnormal data flow
detectors W
b
ZhangASPLOS’11,
ShiOOPSLA’10
R
Wg
Description:
Interleavings that
lead to software
failure
Patch:
CFix
Bug reports
14
Fix-Strategy
Design
Synchronization
Enforcement
Patch Testing
& Selection
Patch
Merging
Run-time
Support
Correctness
Performance
Simplicity
Source code
Mutual exclusion
Mutual exclusion
...
Order
Order
Patched binary
Patched binary
...
Patched binary
Patched binary
Selected binary . . . Selected binary
Merged binary
Final patched binary
Fix-strategy design: what to fix
Fix-Strategy
Design
Synchronization
Enforcement
Patch Testing
& Selection
Patch
Merging
Run-time
Support
15
Challenges:
 Huge variety of bugs
Two types of Concurrency bugs
Atomicity violation
Order violation
 Why these two?
 Real-world concurrency bug characteristics study[SHAN
ASPLOS’08]: 97% either atomicity violation or order violation
 Either can be fixed by mutual exclusion or order enforcement
16
Fix-strategy design: how to fix
Fix-Strategy
Design
Synchronization
Enforcement
Patch Testing
& Selection
Patch
Merging
Run-time
Support
17
Challenges:
 Inaccurate root cause
atomicity-violation
Thread 1
if (ptr != NULL) {
Thread 2
P
R
ptr->field = 1;
}
18
C
ptr = NULL;
Fix-strategy for atomicity-voilation
Thread 1
Thread 2
if (ptr != NULL) {
ptr = NULL;
ptr->field = 1;
}
19
CFix: fix-strategy design
Fix-Strategy
Design
Synchronization
Enforcement
Patch Testing
& Selection
Patch
Merging
Run-time
Support
20
Challenges:
 Inaccurate root cause
 Huge variety of bugs
Solution:
 A combination of
mutual exclusion &
order relationship
enforcement
Fix-strategies Overview
AV Detector
p
A
r
c
21
OV Detector
B
Race Detector
DU Detector
Wb
I1
I2
R
Wg
CFix: synchronization enforcement
Fix-Strategy
Design
Synchronization
Enforcement
Patch Testing
& Selection
Patch
Merging
Run-time
Support
22
Challenges:
 Correctness
 Performance
 simplicity
Solution:
 Mutual exclusion
enforcement: AFix [PLDI’11]
 Order relationship
enforcement: OFix [OSDI’12]
Atomicity violation in Fixing
 Input: three statements (p, c, r) with contexts
p
Thread 1
if (ptr != NULL) {
Thread 2
ptr = NULL;
c
r
ptr->field = 1;
}
 Idea: making the code region from p to c be mutually
exclusive with r
23
Mutual exclusion enforcement: AFix
 Approach: lock
p
r
c
 Goal:
 Correctness: paired lock acquisition and release operations
 Performance: Make the critical section as small as possible
24
A naïve solution
 A naïve solution
 Add lock on edges reaching p
 Add unlock on edges leaving c
 Potential new bugs
p
p
c
c
 Could lock without unlock
 Could unlock without lock
 etc.
25
The AFix solution
 Assume p and c are in the same function f
 Step 1: find protected nodes in critical section
 Step 2: add lock operations
p
 unprotected node  protected node
 protected node  unprotected node
c
 Avoid those potential bugs mentioned
26
Subtle details
 p and c adjustment when they are in different functions
 Observation: people put lock and unlock in one function
 Find the longest common prefix of p’s and c’s stack traces
 Adjust p and c accordingly
 Put r into a critical section
 Do nothing if we can reach r from the p–c critical section
 Lock type:
 Lock with timeout: if critical section has blocking operations
 Reentrant lock: if recursion is possible within critical section
27
OFix: two order relationships
…
Ai
…
A
Aj
A1
B
…
B
An
destroy
A1
B
…
initialization
use
An
allA-B
28
?
firstA-B
read
OFix allA-B enforcement
 Approach: condition variable and flag
 Insert signal operations in A-threads
 Insert wait operation before B
 Rules
 A-thread signals exactly once when it will not execute more A
 A-thread signals as soon as possible
 B proceeds when each A-thread has signaled
29
OFix allA-B enforcement: A side
How to identify the last A instance in one thread
. . .;
for (. . .)
. . . ; // A
. . .;
A
 Each thread that executes A

30
exactly once as soon as it can execute no more A
OFix allA-B enforcement: A side
How to identify the last thread that executes A
void main() {
for (. . .)
thread_create(thr_main);
. . .;
}
void thr_main() {
for (. . .)
. . . ; // A
. . .;
}
counter for
signal threads
void ofix_signal() {
mutex_lock(L);
=1
--;
thread
_create
if (
== 0)
cond_broadcast(con);
mutex_unlock(L);
A
++
}
31
OFix allA-B enforcement: B side
 Safe to execute only when
B
is 0
void ofix_wait() {
mutex_lock(L);
if (
!= 0)
cond_timedwait(con, L, t);
mutex_unlock(L);
}
 Give up if OFix knows that it introduces new deadlock
 Timed wait-operation to mask potential deadlocks
32
OFix firstA-B
 Basic enforcement
A
B
 When A may not execute
 Add a safety-net of signal with allA-B algorithm
33
CFix: patch testing & selection
Fix-Strategy
Design
Synchronization
Enforcement
Patch Testing
& Selection
Patch
Merging
Run-time
Support
34
Challenge:
 Multi-thread software
testing
Solution:
 CFix-patch oriented testing
Patch testing principles
 Two ideas:
 No exhaustive testing, but patch oriented testing
 Leverage existing testing techniques, with extra heuristics
 The work-flow
 Step 1 Prune incorrect patches
• Patches causing failures due to wrong fix strategies, etc
 Step 2 Prune slow patches
 Step 3 Prune complicated patches
35
Run once without external perturbation
 Reject if there is a time-out or failure
 Patches fixing wrong root cause
 Make software to fail deterministically
Thread 1
Thread 2
ptr->field = 1;
ptr = NULL;
ptr->field = 1;
36
Implicit bad patch
 A failure in patch_b implies a failure in patch_a
 If patch_a is less restrictive than patch_b
a
Mutual Exclusion
b
c
Order Relationships
 Helpful to prune patch_a
 Traditional testing may not find the failure in patch_a
37
CFix: patch merging
Fix-Strategy
Design
Synchronization
Enforcement
Patch Testing
& Selection
Patch
Merging
Run-time
Support
38
Challenge:
 One single programming
mistake usually leads to
multiple bug reports
Solution:
 Heuristics to merge patches
An example with multiple reports
void buf_write() {
p1 int tmp = buf_len + str_len;
if (tmp > MAX)
return;
c1 p2 memcpy(buf[buf_len], str, str_len);
r1 c2, r2 buf_len = tmp;
}
 Too many lock/unlock operations
 Potential new deadlocks
 May hurt performance and simplicity
39
p1
c1
p2
r1
c2, r2
Related patch: a case of AFix
 Merge if p, c, or r is in some other patch’s critical sections
lock(L1)
p1
lock(L2)
p2
c1
unlock(L1)
c2
unlock(L2)
unlock(L1)
40
lock(L1)
r1
unlock(L1)
lock(L1)
lock(L2)
r2
unlock(L2)
unlock(L1)
The merged patch for the example
void buf_write() {
p1 int tmp = buf_len + str_len;
if (tmp > MAX) {
p1
return;
p1
c1
p2
c1,p2
r1
c2,r2
c2,r1,r2
}
c1 p2 memcpy(buf[buf_len], str, str_len);
r1 c2, r2 buf_len = tmp;
}
41
CFix: run-time support
Fix-Strategy
Design
Synchronization
Enforcement
Patch Testing
& Selection
Patch
Merging
Run-time
Support
42
 To understand whether there
is a deadlock underlying
time-out
 Low-overhead, and suitable
for production runs
Evaluation methodology
APP.
PBZIP2
x264
FFT
HTTrack
Mozilla-1
transmission
ZSNES
Apache
MySQL-1
MySQL-2
Mozilla-2
Cherokee
Mozilla-3
43
AV
Detector
OV
Detector
RA
Detector
DU
Detector
Evaluation result
APP.
PBZIP2
AV
Detector

Detector

RA
Detector
DU
Detector

# of Ops
5

x264
7
FFT



HTTrack



2
Mozilla-1


2
transmission






ZSNES
44
OV
5
2
3
Apache


MySQL-1



5
MySQL-2



9
Mozilla-2


Cherokee



2
Mozilla-3



5
3
3
Summary
 Software reliability is critical
 Fixing Concurrency bugs is costly and error-prone
 CFix uses some heuristics, with good results in practice




45
A combination of mutual exclusion and order enforcement
Use testing to select the best patch
Fix root cause without requiring detectors to report it
Small overhead and good simplicity
Questions ?
Thank you
46
Download