Memories of Bug Fixes Sunghun Kim, Kai Pan, and E. James Whitehead Jr., University of California, Santa Cruz Presented By Gleneesha Johnson CMSC 838P, Fall 2006 Source Code Repositories • Collection of code changes • Most long developed projects have one • Typically used to store histories and make backups • Knowledge contained within hasn’t been fully leveraged – Previous development experience – Changes that fix bugs are particularly interesting Horizontal Bug Finding Techniques • Applicable across all projects • Based on pre-defined bug patterns, theorem proving, and model checking • ESC/Java, FindBugs, JLint, PMD Vertical Bug Finding Techniques • Aim for project-specific bugs • Different project requirements, business logic, and semantics • BugMem – Learn bug patterns over time Bug Fix Memories • Project-specific bug and fix knowledge base – Developed by analyzing history of bug fixes – Can be used to detect bugs and suggests fixes Building Bug Fix Memories • Extract source code, change logs, and deltas – Kenyon • File change – Contains list of region pairs that show differences between two file versions. – Regions called hunks • Consist of source code lines • Identify changes where a bug was fixed – Keyword search – Search for references to bug reports Hunks • Hunk pair (HP) → deleted hunk (DH) (bug hunk (BH)) & added hunk (AH) (fix hunk (FH)) Extracting Components From Hunks • Extracts syntax patterns, components, from hunks • Parsing – extracts syntax components • Normalizing – generalizes syntax to match similar code • Filtering – eliminates noise – Information filtering – Diff filtering Raw Component Extraction (Parsing) • Preprocessing • Basic syntax lines – Differentiate between composite statements (if, for, while) and simple statements (method call, assignment, etc.) – all multi-line simple statements are on a single line. – conditional predicates of if, for, while, etc. all lie on a single line. • Raw components – Based on abstract syntax tree of basic syntax line – 4 kinds: static Java call, Java call, user-defined call, and non-call Normalization • If the variable type of a raw component is known, normalize variable – i.e., foo.flag → Foo.flag • Generate two components for raw components that contain numeric, boolean, char or string literals – One with and one without normalization for the literal – i.e., i = 1 → int = 1 & int = int Normalization Continued • For a method call, actual parameters are normalized to the type of the parameter • Normalization level – Indicates a component’s degree of normalization amongst others extracted from same raw component – Level increases with degree of normalization Examples Information Filtering • Information value – indicates how much unique information a component carries – Determined by summing information value of its elements • Information value threshold – used to filter components with little unique information – Two is used as a threshold in the paper Diff Filtering • Removes code unchanged between bug and fix hunks Searching Memories • Bugs found by searching for matching patterns in bug hunks • Suggestions made by returning code in corresponding fix hunks • Several options for component searching – Adjust degree of matching and omission of very common components Options Evaluation Evaluation Continued • Determine half and full hit rates – Half hit – bug previously seen – Full hit – bug and fix previously seen • Five open source projects Evaluation Continued Evaluation Continued Evaluation Continued Comparison With PMD Limitations • Missing Memories • Not applicable in initial stage of a project • Doesn’t catch cross-file relationships Uses • Bug Finding Tool – BugMem – Can be integrated into Eclipse • Code example repository – Useful to developers new to a project The End! Questions?