Triage: Diagnosing Production Run Failures at the User’s Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos and Yuanyuan Zhou University of Illinois at Urbana Champaign Motivation Software failures are a major contributor to system downtime. Security holes. Software has grown in size, complexity and cost. Software testing has become more difficult. Software packages inevitably contain bugs (even production ones). Motivation Result: Software failures during production runs at user’s site. One Solution: Offsite software diagnosis: Difficult to reproduce failure triggering conditions. Cannot provide timely online recovery (e.g. from fast Internet Worms). Programmers cannot be provided to every site. Privacy concerns. Goal: automatically diagnosing software failures occurring at end-user site production runs. Understand a failure that has happened. Find the root causes. Minimize manual debugging. Current state of the art Offsite diagnosis: Primitive onsite diagnosis: Interactive debuggers. Unprocessed failure Program slicing. information collections. Deterministic replay tools. Core Dump analysis (Partial execution path construction). Large overhead makes it impractical for production sites. All require manual analysis. Privacy concerns. Onsite Diagnosis Efficiently reproduce the occurred failure (i.e. fast and automatically). Impose little overhead during normal execution. Require no human involvement. Require no prior knowledge. Triage Capturing the failure point and conducting just-in-time failure diagnosis with checkpoint-reexecution. Delta Generation and Delta Analysis. Automated top-down human-like software failure diagnosis protocol. Reports: Failure nature and type. Failure-triggering conditions. Failure-related code/variable and the fault propagation chain. Triage Architecture 3 groups of components: 1. Runtime Group. 2. Control Group. 3. Analysis Group. Checkpoint & Reexecution Uses Rx (Previous work by authors). Rx checkpointing: Use fork()-like operations. Keeps a copy of accessed files and file pointers. Record messages using a network proxy. Replay may be potentially modified. Lightweight Monitoring for detecting failures Must not impose high overhead. Cheapest way: catch fault traps: Assertions Access violations Divide by zero More… Extensions: Branch histories, system call trace… Triage only uses exceptions and assertions. Control layer Implements the Triage Diagnosis protocol. Controls reexecutions with different inputs based on past results. Choice of analysis technique. Collects results and sends to off-site programmers. Analysis Layer Techniques: TDP: Triage Diagnosis Protocol Simple Replay Deterministic bug Coredump analysis Stack/Heap OK. Segmentation fault: strln() Dynamic bug detection Null-pointer dereference Delta Generation Collection of good and bad inputs Delta Analysis Code paths leading to fault Report TDP: Triage Diagnosis Protocol Example report Protocol extensions and variations Add different debugging techniques. Reorder diagnosis steps. Omit steps (e.g. memory checks for java programs). Protocol may be costume-designed for specific applications. Try and fix bugs: Filter failure triggering inputs. Dynamically delete code – risky. Change variable values. Automatic patch generation – future work? Delta Generation Two Goals: 1. 2. Generate many similar replays: some that fail and some that don’t. Identify signature of failure triggering inputs. Signatures may be used for: Failure analysis and reproduction. Input filtering e.g. Vigilante, Autograph ,etc. Delta Generation Changing the input Replay previously stored client requests via proxy – try different subsets and combinations. Isolate bug-triggering part – data “fuzzing”. Find non-failing inputs with minimum distance from failing ones. Make protocol aware changes. Use a “normal form” of the input, if specific triggering portion is known. Changing the Environment Pad or zero-fill new allocations. Change messages order. Drop messages. Manipulate thread scheduling. Modify the system environment. Make use of prior steps information (e.g. target specific buffers). Delta Generation Results passed to the next stage: Break code to basic blocks. For each replay extract a vector of exercise count of each block and block trace. Possible to change granularity. Example revisited Good run: Trace: AHIKBDEFEF…EG Block vector: {A:1,B:1,D:1,E:11,F:10,G:1 ,H:1,I:1,K:1} Bad run: Trace: AHIJBCDE Block vector: {A:1,B:1,C:1,D:1,E:1,H:1,I :1,J:1,K:1} Delta Analysis Follows three steps: 1. Basic Block Vector (BBV) Comparison: Find a pair of most similar failing and non-failing replays F and S. 2. Path comparison: Compare the execution path of F and S. 3. Intersection with backward slice: Find the difference that contributes to the failure. Delta Analysis: BBV Comparison The number of times each block is executed is recorded using instrumentation. Calculate the Manhattan distance between every pair of failing and non-failing replays (can relax the minimum demand and settle for similar). In the Example: {c:-1,E:10,F:10,G:1,J:-1,K:1} giving a Manhattan distance of 24. Delta Analysis: Path Comparison Consider execution order. Find where the failing and non-failing runs diverge. Compute: Minimum Edit Distance i.e. the minimum number of insertion, deletion, and substitution operations needed to transform one to the other. Example: Delta Analysis: Backward Slicing Want to eliminate differences that have no effect on the failure. Dynamic Backward Slicing: extracts a program slice consisting of all and only those that lead to a given instruction’s execution. Starting point may be supplied by earlier steps of the protocol. Overhead is acceptable in post-hoc analysis. Optimization: Dynamically build dependencies during replays. Experiments show that overhead is acceptably low. Backward Slicing and result Intersection Limitations and Extensions Need to define a privacy policy for the results sent to programmers. Very limited success with patch generation. Does not handle memory leaks well. Failure must occur. Does not handle incorrect operation. Difficult to reproduce bugs that take a long time to manifest. No support for deterministic replay on multi-processor architectures. False positives. Evaluation Methodology Experimented with 10 real software failures in 9 applications. Triage is implemented in Linux OS (2.4.22). Hardware: 2.4 GHz Pentium-4, 512K L2 cache, 1G memory and 100Mbs Ethernet. Triage checkpoints every 200ms and keeps 20 checkpoint. User study: 15 programmers were given 5 bugs and Triage’s report for some of the bugs. Compared time to locate the bug with and without the report. Bugs used for Evaluation Name Program App Description #L OC Bug Type Root Cause Description Apache1 apache-1.3.27 A web server 114 K Stack Smash Long alias match pattern overflows a local array Apache2 apache-1.3.12 A web server 102 K Semantic (NULL ptr) Missing certain part of url causes NULL pointer dereference CVS cvs-1.11.4 GNU version control server 115 K Double Free Error-handling code placed at wrong order leads to double free NySQL msql-4.0.12 A database server 102 8K Data Race Database logging error in case of data race Squid squid-2.3 A web proxy cache server 94K Heap Buffer Overflow Buffer length calculation misses special character cases BC bc-1.06 Interactive algebraic language 17K Heap Buffer Overflow Using wrong variable in for-loop end-condition Linux linux-extract Extracted from linux-2.6.6 0.3 K Semantic (copypaste error) Forget-to-change variable identifier due to copypaste MAN man-1.5h1 Documentation tools 4.7 K Global Buffer Overflow Wrong for-loop end-condition NCOMP ncompress-1.2.4 File (de)compression 1.9 K Stack Smash Fixed length array can not hold long input file name TAR tar-1.13.25 GNU tar archive tool 27K Semantic (NULL ptr) Directory property corner case is not well handled Experimental Results No input testing Experimental Results For application bugs, Delta generation only worked for BC and TAR. In all cases Triage correctly diagnoses the nature of the bug (deterministic or non-deterministic). In all 6 applicable cases Triage correctly pinpoints the bug type, buggy instruction, and memory location. When Delta Analysis is applied, it reduces the amount of data to be considered by 63% (Best: 98% worse: 12%). For MySQL – Finds an example interleaving pair as a trigger. Case Study 1: Apache Failure at ap_gregsub. Bug detector catches a stack smash in lmatcher. How can lmatcher affect try_alias_list? Stack smash overwrites the stack frame above it, invalidating r. Trace shows how lmatcher is called by try_alias_list. Failure is independent of the headers. Failure is triggered by requests for a specific resource. Case Study 2: Squid Coredump analysis suggests a heap overflow. Happens at strcat of two buffers. Fault propagation shows how buffers were allocated. t has strlen(usr) while the other buffer has strlen(user)*3. Input testing gives failuretriggering input. Gives minimally different non-failing inputs. Efficiency and Overhead Normal Execution overhead: Negligble effect caused by checkpointing. In no case over 5%. With 400ms checkpointing intervals – overhead is 0.1% Efficiency and Overhead Diagnosis Efficiency: Except for Delta Analysis, all steps are efficient. All (other) diagnostic steps finish within 5 minutes. Delta analysis time is governed by the Edit Distance D in the O(ND) computation (N – number of blocks). Comparison step of Delta Analysis may run in the background. User Study Real bugs: On average, programmers took 44.6% less time debugging using Triage reports. Toy bugs: On average, programmers took 18.4% less time debugging using Triage reports. Questions?