Automated Whitebox Fuzz Testing (NDSS 2008) Patrice Godefroid Michael Y. Levin David Molnar Microsoft (Research) Microsoft (CSE) UC Berkeley pg@microsoft.com mlevin@microsoft.com dmolnar@eecs.berkeley.edu Presented by: Edmund Warner University of Central Florida April 7, 2011 Acknowledgments Figures are taken directly from the paper or original presentation slides Some slides reused from the original presentation Overview Definition of Whitebox Fuzz Testing The Search Algorithm SAGE (Scalable, Automated, Guided Execution) Test Findings Conclusions What is Whitebox Fuzz Testing? Fuzz testing is a form of blackbox random testing Can be remarkably effective, but there are limitations Given the then branch statement: If (x == 10) then... has 1 in 2^32 chance of being executed if x is a random 32-bit input Can provide low code coverage Whitebox Fuzz Testing Combine fuzz testing with dynamic test generation Run the code with its input Collect constraints on inputs with symbolic execution Generate new constraints Solve constraints with constraint solver Synthesize new inputs Whitebox Fuzz Testing In theory, this approach can lead to full program path coverage Practically, it will fall short and the search will be incomplete: Number of execution paths in the program is huge Symbolic execution, constraint generation, and constraint solving are necessarily imprecise The Search Algorithm With blackbox fuzzing, it is unlikely to catch the error (5 values out of 2^(8*4) 4-byte cases) This is rather simple, however, for dynamic test generation Dynamic Test Generation For instance, we run the input “good” on the program. We develop a path constraint based off of the conditional statements crossed: <i0 != 'b', i1 != 'a', i2 != 'd', i3 != '!'> Create a new path constraint: <i0 = 'g', i1 != 'o', i2 != 'o', i3 = '!'> Limitations Path Explosion Does not scale to large, realistic programs Can be alleviated with different methods in the search algorithm Imperfect Symbolic Execution Complex program statements (pointer manipulation) OS and library functions (cost) The Search Algorithm Solution: Generational Search Places the initial input in a workList Runs program for bugs in the first execution WorkList is processed by selecting an element and expanding it Run with child inputs Assigned a score Added to workList The Search Algorithm More on ExpandExecution Tests program with input Generates path constraints (PC) Attempt to expand path constraints If so, save for later execution The Search Algorithm What does this mean? Given input with PC Attempts to expand all constraints in PC Instead of just the last with a depth-first search Or the first with a breadth-first search A parameter bound is used to limit backtracking through parent nodes End Result: achieve the largest search space in the shortest amount of time SAGE Scalable, Automated, Guided Execution Can test any file-reading program running on Windows by treating bytes read from files as symbolic input. SAGE Architecture Instead of being source-based, SAGE is a machine-code-based instrumentation Multitude of languages and build processes No need for specific source, compiler and build operations Slower to start, but encompasses much more Compiler and post-build transformations By performing symbolic execution on binary code that actually ships, SAGE can detects bugs also in the compiling and post-processign tools Unavailability of source Source-based may be difficult for self-modifying or JITed code SAGE doesn't need the data types or structures not visible at machine code level Constraint Generation SAGE is trace-based Uses replay of trace to update the concrete and symbolic stores This allows constraints to be built on input values *Given conditional jumps, it uses bitvectors to tag the EFLAGS used for the jumps Constraint Optimization SAGE employs a number of optimization techniques to improve speed and decrease memory consumption: Tag catching Unrelated constraint elimination Local constraint catching Flip count limit Concretization Constraint subsumption** Constraint subsumption checks to see if newly created contstraints imply or are being implied Findings Generational Search vs. Depth-First Search On Media1,2,3 applications they tested, DFS terminated in ~11 hours with nothing. GS ran for slightly longer and found 15 crashes in 4 buckets in Media3. Bogus files find few bugs Divergences are common: ~60% Most bugs are shallow** Impact of the block-coverage heuristic Adding 10407 blocks instead of 10633; not very effective in most cases Conclusions Most unique bugs found are on well formatted input, and in few generations There may be a limited sample size, but the success of finding bugs previously missed suggests a new search strategy SAGE still needs enhancement: precision, power Contributions A critical vulnerability was found in the MS07-017 ANI, which has been missed by extensive blackbox testing and static analysis A new search algorithm was introduced for systematic test generation, which has been optimized for large applications Introduction and implementation of SAGE, which can scale to programs with hundreds of millions of instructions Weaknesses The paper itself is hard to understand in certain areas Sometimes there is nondeterminism shown in the coverage of the program Same input, same program, same machine, different coverage Improvements Paper – more figures explaining the heuristics and rules Nondeterminism – export input coverage results to a database to be checked so that nothing is repeated