Using Set Operations on Code Coverage Data to Discover Program Properties by Nick Rutar Motivation Many Programs already have code coverage data Various Code Coverage Tools Available Widely Explored Area of Research Regression tests with coverage data becoming more common Code coverage data contains wealth of information about the program Data usually limited to how program reports it Want to milk the data for all it is worth Possibly useful for finding errors in the program Code Coverage Three Main Types Program usually Instrumented Statement Every line of code Conditional Every decision in program (if/else) Path Every path in the program Dynamic or Static Usually presented as a composite of separate tests Using Set Operations Why use set operations? Most developers familiar with sets Data for statement coverage maps nicely onto sets Possible to manipulate data easily and give glimpses of properties of the code Most code coverage tools implicitly use sets anyway Set Operations Union Intersection Traditional Coverage Lines ran on all tests Difference Potential for Locating Errors Probably biggest stretch from what data is currently being used for Set Operations At Work int main(int argc, char *argv) { int x, y, z; x = y = z =0; if (argc == 2) x = atoi(argv[1]); if (x == 1) y = 3; else if (x == 2) y = 4; if (y > 0) z = 5; else z = -2; return z; } Inputs No input 1 2 Union Intersection Difference Off the Beaten Path Sets Diff, - Union, U Intersection, I U/I Bad Sets - U Good Sets Sometimes give better basis for finding bad code Closest example of prior work only dealt with one bad run at a time Any given test - itself U (I of Sets & (U/I Bad Sets - U Good Sets)) Gives you the empty set Gives you a very rough slice of program that went bad Manipulate data as seen fit for what you are looking for … Other Code Coverage Info Pareto principle Better known as 80-20 rule Pareto noticed 80% of the land in Italy owned by 20% of people Shows up in all kinds of domains Nick’s high school - 80% of girls dated 20% of the boys Software 80-20 rule 20% of the lines of code is 80% of the runtime of the software Code Coverage often has frequency information Use that information for performance bottlenecks Implementation Create tool that can use the set information Implementation details Created in Java Based on output of format from LCOV coverage tool Takes in pre-generated coverage information as input Supports Union, Difference, and Intersection Supports Frequency Information Demo Evaluation Test Large Program against its regression test Use Dyninst for evaluation C++ program that does binary instrumentation 100+ Source Files ~30,000 LOC instrumented to create coverage data Nightly build already has coverage capability with regression tests Verify Union matches coverage data given by tool Use Difference to try to find errors Series of tests with various inputs See which inputs cause failure and locate lines to discover error Future Work For the Tool Create Template for Insertion into program This program doesn’t care what language you are using Just needs input format to generate initial sets Specify format in text file, program uses it to input data Better Visualization to specify points of interest Highlight source code that still has active lines Usability Write now more of a proof of concept than a battle hardened tool In General More evaluation of using Diff for finding errors in the program Evaluation of software bottlenecks IDE integration Questions???