Diagnosing and Fixing Concurrency Bugs Presented by Tao Wang Credits to Dr. Guoliang Jin, Computer Science, NC STATE We need reliable software People’s daily life now depends on reliable software Software companies spend lots of resources on debugging More than 50% effort on finding and fixing bugs Around $300 billion per year 2 Concurrency bugs hurt It is an increasingly parallel world Concurrency bugs in history 3 Multi-threaded program Concurrent programs under the shared-memory model Programs execute multiple interacting threads in parallel Threads communicate via shared memory Shared-memory accesses should be well-synchronized thread1 thread2 thread3 thread4 core1 core2 core3 core4 cache cache cache cache Multicore chip shared memory 4 An example The interleaving of concurrency space bug Thread 1 Thread 2 if (ptr != NULL) { ptr->field = 1; } Huge ptr = NULL; Interleaving space Thread 1 if (ptr != NULL) { ptr->field = 1; } 5 Thread 2 ptr = NULL; Bad Thread 1 Thread 2 interleavings if (ptr != NULL) { ptr = NULL; ptr->field = 1; } Segmentation Fault Previous research focuses on finding Bug fixing Software quality does not improve until bugs are fixed Manual concurrency bug fixing is time-consuming: 73 days on average error-prone: 39% patches are buggy in the first release CFix: automated concurrency-bug fixing [PLDI’11*, OSDI’12] *SIGPLAN: Program behaves correctly if bad interleavings do not occur of the first papers “one Fix concurrency bugs by disabling bad interleavings to attack the problem of automated bug fixing” 6 The interleaving space (again) Bad interleavings Huge Interleaving space Bad interleavings Bad interleavings Bad interleavings lead to production-run failures 7 Failure diagnosis Failures still happen in production runs The reason behind failure needs to be understood Tools dealing with production runs demand low overhead Diagnostic information needs to be informative Production-run concurrency-bug failure diagnosis Design new monitoring schemes and sampling strategies CCI: a pure software solution [OOPSLA’10] PBI, LXR: hardware-assisted solutions [ASPLOS’13 & 14] 8 My work on concurrency bugs Bug Detection and software testing: ConSeq [ASPLOS’11] Production-Run Failure Diagnosis: CCI/PBI/LXR [OOPSLA’10, ASPLOS’13 & 14] 9 Automated Concurrency-Bug Fixing: CFix [PLDI’11*, OSDI’12] *Received a SIGPLAN CACM nomination Outline Motivation and Overview Automated Concurrency-Bug Fixing 10 The problem and idea Overview Internals of CFix Evaluation and summary Automated fixing is difficult Description: Symptom Triggering condition … ? Patch: Correctness Performance Simplicity What is the correct behavior? Usually requires developers’ knowledge How to get the correct behavior? Correct program states under bug-triggering inputs No change to program states under other inputs 11 CFix’ insights Description: Symptom Triggering condition … ? Patch: Correctness Performance Simplicity What is the correct behavior? The program state is correct as long as the buggy interleaving does not occur How to get the correct behavior? Only need to disable failure-inducing interleavings Can leverage well-defined synchronization operations 12 Description: ? Interleavings that Symptom Triggering lead to software condition … failure atomicity violation detectors A r c data race detectors SenPLDI’08, SavageTOCS’97, YuSOSP’05, EricksonOSDI’10, KasikciASPLOS’10 13 Correctness Performance Simplicity order violation detectors p ParkASPLOS’09, FlanaganPOPL’04, LuASPLOS’06, ChewEuroSys’10 Patch: How to get a general solution that generates good patches? I1 I2 B ZhangASPLOS’10, LuciaMICRO’09, YuISCA’09, GaoASPLOS’11 abnormal data flow detectors W b ZhangASPLOS’11, ShiOOPSLA’10 R Wg Description: Interleavings that lead to software failure Patch: CFix Bug reports 14 Fix-Strategy Design Synchronization Enforcement Patch Testing & Selection Patch Merging Run-time Support Correctness Performance Simplicity Source code Mutual exclusion Mutual exclusion ... Order Order Patched binary Patched binary ... Patched binary Patched binary Selected binary . . . Selected binary Merged binary Final patched binary Fix-strategy design: what to fix Fix-Strategy Design Synchronization Enforcement Patch Testing & Selection Patch Merging Run-time Support 15 Challenges: Huge variety of bugs Two types of Concurrency bugs Atomicity violation Order violation Why these two? Real-world concurrency bug characteristics study[SHAN ASPLOS’08]: 97% either atomicity violation or order violation Either can be fixed by mutual exclusion or order enforcement 16 Fix-strategy design: how to fix Fix-Strategy Design Synchronization Enforcement Patch Testing & Selection Patch Merging Run-time Support 17 Challenges: Inaccurate root cause atomicity-violation Thread 1 if (ptr != NULL) { Thread 2 P R ptr->field = 1; } 18 C ptr = NULL; Fix-strategy for atomicity-voilation Thread 1 Thread 2 if (ptr != NULL) { ptr = NULL; ptr->field = 1; } 19 CFix: fix-strategy design Fix-Strategy Design Synchronization Enforcement Patch Testing & Selection Patch Merging Run-time Support 20 Challenges: Inaccurate root cause Huge variety of bugs Solution: A combination of mutual exclusion & order relationship enforcement Fix-strategies Overview AV Detector p A r c 21 OV Detector B Race Detector DU Detector Wb I1 I2 R Wg CFix: synchronization enforcement Fix-Strategy Design Synchronization Enforcement Patch Testing & Selection Patch Merging Run-time Support 22 Challenges: Correctness Performance simplicity Solution: Mutual exclusion enforcement: AFix [PLDI’11] Order relationship enforcement: OFix [OSDI’12] Atomicity violation in Fixing Input: three statements (p, c, r) with contexts p Thread 1 if (ptr != NULL) { Thread 2 ptr = NULL; c r ptr->field = 1; } Idea: making the code region from p to c be mutually exclusive with r 23 Mutual exclusion enforcement: AFix Approach: lock p r c Goal: Correctness: paired lock acquisition and release operations Performance: Make the critical section as small as possible 24 A naïve solution A naïve solution Add lock on edges reaching p Add unlock on edges leaving c Potential new bugs p p c c Could lock without unlock Could unlock without lock etc. 25 The AFix solution Assume p and c are in the same function f Step 1: find protected nodes in critical section Step 2: add lock operations p unprotected node protected node protected node unprotected node c Avoid those potential bugs mentioned 26 Subtle details p and c adjustment when they are in different functions Observation: people put lock and unlock in one function Find the longest common prefix of p’s and c’s stack traces Adjust p and c accordingly Put r into a critical section Do nothing if we can reach r from the p–c critical section Lock type: Lock with timeout: if critical section has blocking operations Reentrant lock: if recursion is possible within critical section 27 OFix: two order relationships … Ai … A Aj A1 B … B An destroy A1 B … initialization use An allA-B 28 ? firstA-B read OFix allA-B enforcement Approach: condition variable and flag Insert signal operations in A-threads Insert wait operation before B Rules A-thread signals exactly once when it will not execute more A A-thread signals as soon as possible B proceeds when each A-thread has signaled 29 OFix allA-B enforcement: A side How to identify the last A instance in one thread . . .; for (. . .) . . . ; // A . . .; A Each thread that executes A 30 exactly once as soon as it can execute no more A OFix allA-B enforcement: A side How to identify the last thread that executes A void main() { for (. . .) thread_create(thr_main); . . .; } void thr_main() { for (. . .) . . . ; // A . . .; } counter for signal threads void ofix_signal() { mutex_lock(L); =1 --; thread _create if ( == 0) cond_broadcast(con); mutex_unlock(L); A ++ } 31 OFix allA-B enforcement: B side Safe to execute only when B is 0 void ofix_wait() { mutex_lock(L); if ( != 0) cond_timedwait(con, L, t); mutex_unlock(L); } Give up if OFix knows that it introduces new deadlock Timed wait-operation to mask potential deadlocks 32 OFix firstA-B Basic enforcement A B When A may not execute Add a safety-net of signal with allA-B algorithm 33 CFix: patch testing & selection Fix-Strategy Design Synchronization Enforcement Patch Testing & Selection Patch Merging Run-time Support 34 Challenge: Multi-thread software testing Solution: CFix-patch oriented testing Patch testing principles Two ideas: No exhaustive testing, but patch oriented testing Leverage existing testing techniques, with extra heuristics The work-flow Step 1 Prune incorrect patches • Patches causing failures due to wrong fix strategies, etc Step 2 Prune slow patches Step 3 Prune complicated patches 35 Run once without external perturbation Reject if there is a time-out or failure Patches fixing wrong root cause Make software to fail deterministically Thread 1 Thread 2 ptr->field = 1; ptr = NULL; ptr->field = 1; 36 Implicit bad patch A failure in patch_b implies a failure in patch_a If patch_a is less restrictive than patch_b a Mutual Exclusion b c Order Relationships Helpful to prune patch_a Traditional testing may not find the failure in patch_a 37 CFix: patch merging Fix-Strategy Design Synchronization Enforcement Patch Testing & Selection Patch Merging Run-time Support 38 Challenge: One single programming mistake usually leads to multiple bug reports Solution: Heuristics to merge patches An example with multiple reports void buf_write() { p1 int tmp = buf_len + str_len; if (tmp > MAX) return; c1 p2 memcpy(buf[buf_len], str, str_len); r1 c2, r2 buf_len = tmp; } Too many lock/unlock operations Potential new deadlocks May hurt performance and simplicity 39 p1 c1 p2 r1 c2, r2 Related patch: a case of AFix Merge if p, c, or r is in some other patch’s critical sections lock(L1) p1 lock(L2) p2 c1 unlock(L1) c2 unlock(L2) unlock(L1) 40 lock(L1) r1 unlock(L1) lock(L1) lock(L2) r2 unlock(L2) unlock(L1) The merged patch for the example void buf_write() { p1 int tmp = buf_len + str_len; if (tmp > MAX) { p1 return; p1 c1 p2 c1,p2 r1 c2,r2 c2,r1,r2 } c1 p2 memcpy(buf[buf_len], str, str_len); r1 c2, r2 buf_len = tmp; } 41 CFix: run-time support Fix-Strategy Design Synchronization Enforcement Patch Testing & Selection Patch Merging Run-time Support 42 To understand whether there is a deadlock underlying time-out Low-overhead, and suitable for production runs Evaluation methodology APP. PBZIP2 x264 FFT HTTrack Mozilla-1 transmission ZSNES Apache MySQL-1 MySQL-2 Mozilla-2 Cherokee Mozilla-3 43 AV Detector OV Detector RA Detector DU Detector Evaluation result APP. PBZIP2 AV Detector Detector RA Detector DU Detector # of Ops 5 x264 7 FFT HTTrack 2 Mozilla-1 2 transmission ZSNES 44 OV 5 2 3 Apache MySQL-1 5 MySQL-2 9 Mozilla-2 Cherokee 2 Mozilla-3 5 3 3 Summary Software reliability is critical Fixing Concurrency bugs is costly and error-prone CFix uses some heuristics, with good results in practice 45 A combination of mutual exclusion and order enforcement Use testing to select the best patch Fix root cause without requiring detectors to report it Small overhead and good simplicity Questions ? Thank you 46