50.530: Software Engineering Sun Jun SUTD Week 2: Automatic Testing A Big View: Testing the initial state C A B the behaviors we wanted the behaviors we have A Big View: Testing a test which shows a bug the initial state C A the behaviors we wanted the behaviors we have Testing • Methods: white-box testing, black-box testing, grey-box testing • Levels: unit testing, integration testing, system testing, etc. • Types: installation testing, compatibility testing, smoke and sanity testing, regression testing, acceptance testing, alpha testing, beta testing, function/non-functional testing, combinatorial testing, performance testing, security testing, etc. Research Question Isn’t jUnit good enough? How do we automatically generate test cases so as to reveal bugs? A Big View: Systematic Testing the initial state C A B the behaviors we wanted the behaviors we have A Big View: Random Testing a test which shows a bug the initial state C A the behaviors we wanted the behaviors we have Boyapati et al., ISSTA 2002, ACM SIGSOFT Distinguished Paper Award KORAT: AUTOMATED TESTING BASED ON JAVA PREDICATES Motivation • It is important to be able to generate test cases automatically. • It is important to generate test cases which are representative. • Korat is merely a sample approach for systematic test case generation, however, it is similar in spirit to many systematic testing techniques (e.g., combinatorial testing, parameterized testing). Example public class BinaryTree { public static class Node { Node left; Node right; } private Node root; private int size; public void remove (Node n) { //some code } … } How do we test remove(node n)? Example • How do we test remove(Node n)? – We need a valid BinaryTree object bt. – We need a valid Node object nd. – We need to know what is expected after executing bt.remove(nd) public class BinaryTree { public static class Node { Node left; Node right; } private Node root; private int size; public void remove (Node n) { //some code } … } Vocabulary Class invariant: • an invariant used to define what are valid objects of the class • e.g., size == 0 if root == null and size equals to the number of nodes in the tree public class BinaryTree { public static class Node { Node left; Node right; } private Node root; private int size; public void remove (Node n) { //some code } … } Vocabulary Pre-condition (of a method) • a condition which must be true prior to the execution of the method • e.g., n must not be null. The class invariant is always part of the pre-condition. public class BinaryTree { public static class Node { Node left; Node right; } private Node root; private int size; public void remove (Node n) { //some code } … } Vocabulary Post-condition (of a method) • a condition which must be true after the execution of the method • e.g., after remove, size is decremented by 1. The class invariant is always part of the post-condition. public class BinaryTree { public static class Node { Node left; Node right; } private Node root; private int size; public void remove (Node n) { //some code } … } Karat: Assumption public boolean repOK() { if (root == null) return size == 0; Set<Node> visited = new HashSet<Node>(); visited.add(root); LinkedList<Node> workList = new LinkedList<Node>(); workList.add(root); while (!workList.isEmpty()) { Node current = (Node) workList.removeFirst(); if (current.left != null) { if (!visited.add(current.left)) return false; workList.add(current.left); } if (current.right != null) { if (!visited.add(current.right)) return false; workList.add(current.right); } } A class invariant is encoded as a method repOk(), which return true if and only if the object is in a state which satisfies the class invariant. return (visited.size() == size); } Korat: Assumption • Pre-condition and postcondition are encoded in Java Modeling Language //@ public invariant repOk(); // class invariant // for BinaryTree /*@ public normal_behavior // specification for remove @ requires has(n); // precondition @ ensures !has(n); // postcondition @*/ public void remove(Node n) { // ... method body } This is probably too harsh a pre-condition? Karat: Approach Generate a BinaryTree bt and a Node n if repOk() and pre-condition is true otherwise Execute bt.remove(n) if post-condition is true otherwise Finitization • There are infinitely many candidates for bt and n. – For each variable in the class, define its domain interesting bt all possible bt Finitization public static Finitization finBinaryTree(int NUM_Node) { Finitization f = new Finitization (BinaryTree.class); ObjSet nodes = f.createObjSet(“Node”, NUM_Node); nodes.add(null); f.set("root", nodes); f.set("Node.left", nodes); f.set("Node.right", nodes); public class BinaryTree { return f; public static class Node { } Node left; Node right; } private Node root; private int size; … } Finitization public static Finitization finBinaryTree(int NUM_Node) { Finitization f = new Finitization (BinaryTree.class); ObjSet nodes = f.createObjSet(“Node”, NUM_Node); nodes.add(null); f.set("root", nodes); f.set("Node.left", nodes); f.set("Node.right", nodes); return f; translation } nodes = {null, N0, N1, N2} BinaryTree.root is a member of nodes Node.left is a member of nodes Node.right is a member of nodes Example Trees With finBinaryTree(3), there are 4 objects: one BinaryTree object, three Node objects, which could be set up as follows. Finitization: the Space • How many bt are there with finBinaryTree(3), assume that bt.size is always set to the right value? – 4^7 • How many bt are there with finBinaryTree(n)? – (n+1)^(2n+1) interesting bt all possible bt Filtering 1 • For each candidate bt and n, check the precondition of remove. If the pre-condition is not satisfied, ignore that tree. interesting bt invalid bt all possible bt public boolean repOK() { if (root == null) return size == 0; Set<Node> visited = new HashSet<Node>(); visited.add(root); LinkedList<Node> workList = new LinkedList<Node>(); workList.add(root); while (!workList.isEmpty()) { Node current = (Node) workList.removeFirst(); if (current.left != null) { if (!visited.add(current.left)) return false; workList.add(current.left); } if (current.right != null) { if (!visited.add(current.right)) return false; workList.add(current.right); } Is the following bt valid? } return (visited.size() == size); } Korat: Search Algorithm 1. Order all the elements in every class domain and every field domain 1. Node class ordering: <null, N0, N1, N2> 2. Assume domain of size: <3> 2. Generate a candidate as a vector of field domain indices, e.g., [1,0,2,2,0,0,0,0] Korat: Search Algorithm 3. Invoke repOk() to check if the candidate is valid, e.g., [1,0,2,2,0,0,0,0] is invalid 4. Backtrack to generate the next candidate in line, e.g., [1,0,2,2,0,0,0,1] Optimization 1 • During the execution of repOk, Korat monitors the fields that repOk accesses. – e.g., [0, 2, 3] for the following example • If repOk() results in false, backtrack until the accessed fields are different – e.g., try [1,0,2,3,0,0,0,0] after [1,0,2,2,0,0,0,0] Is this justified? Theory • For non-deterministic repOk methods, – All candidates for which repOk() always returns true are generated – Candidates for which repOk() always returns false are never generated; • Candidates for which repOk() sometimes returns true and sometimes false may or may not be generated. Optimization 2 • If we generated the above, we may not want to generate [1, 0, 3, 2, 0, 0, 0, 0]. Is this justified? N2 N1 Vocabulary object graph: N2 Isomorphic: two object graphs C and C’ are isomorphic iff there is a permutation per such that per(C) = C’ and per(C’) = C – e.g., per = {N1->N2, N2->N1} N1 Optimization 2 interesting bt representative all candidates in the same region are isomorphic Representative Given the two graphs below N2 N1 [1, 0, 2, 3, 0, 0, 0, 0] and [1, 0, 3, 2, 0, 0, 0, 0], Korat takes the latter as a representative, as it is “bigger”. Implementation: Op 2 • When backtracking from [a, b, c, …,k, …], – Korat tries [a, b, c, …,k+1, …] if k+1 is smaller than or equal to any number in the vector which has the same associated type. – Korat tries [a, b, c, …, j+1, …] otherwise. Example When backtrack from [1, 0, 2, 2, 0, 0, 0, 0], Korat skips [1, 0, 2, 3, 0, 0, 0, 0] (since there is a “bigger” representative [1,0,3,2,0,0,0,0]), and continues with [1,0,3,0,0,0,0,0] Result Only 5 bt are generated – assuming size is set to 3 always. Evaluation Is this biased? Experiment I Experiment II Experiment III Conclusion • Korat generates test cases from a specified domain and correctness specification. • Korat reduces test cases based on – pre-condition – a simple learning – symmetry reduction Exercise 1 Apply Korat to java.util.Stack by answering the following questions. • What is the repOk()? • What is the pre-condition and post-condition of method push and pop? • How would you track which fields are accessed in repOk()? • When are two stack objects isomorphic? Discussion Any thought on Korat? Pacheco et. al. ICSE 2007, cited 440+ FEEDBACK-DIRECTED RANDOM TEST GENERATION Random Testing • Easy to implement • Yields lots of test cases • Finds errors – 1990: Unix utilities Perhaps simply got lucky? – 1998: OS services – 2000: GUI applications – 2000: functional programs – 2005: object-oriented programs – 2007: flash memory, file systems Research Question Which one is better: systematic testing or random testing? Random vs Systematic • Theoretical work suggests that random testing is as effective as more systematic input generation techniques – Duran et al. 1984 and Hamlet et al. 1990 • Some empirical studies suggest systematic is more effective than random – Ferguson et al. 1996: vs. chaining – Marinov et al. 2003: vs. bounded exhaustive – Visser et al. 2006: vs. model checking and symbolic execution small benchmarks; no measurement on error revealing effectiveness Contributions • Propose feedback-directed random test generation – Randomized creation of new test inputs is guided by feedback about the execution of previous inputs – Goal is to avoid redundant and illegal inputs • Empirical evaluation – Evaluate coverage and error-detection ability on a large number of widely-used, well-tested libraries (780KLOC) – Compare against systematic input generation – Compare against undirected random input generation Sample Test Case public static void test1 () { LinkedList l1 = new LinkedList(); Object o1 = new Object(); l1.addFirst(o1); TreeSet t1 = new TreeSet(l1); Set s1 = Collections.unmodifiableSet(t1); Assert.assertTrue(s1.equals(s1)); } Randoop • Input: a class with multiple public methods. • Output: a set of test cases (sequences of method calls) • Main idea: – Build test inputs incrementally: New test inputs extend previous ones – As soon as a test input is created, execute it – Use execution results to guide generation The Oracle Problem If we are to do automatic testing, we must know what are the correct results, but how? Specification How to get a better specification in general? Algorithm “There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies.” 1980, C.A.R.Hoare Randoop Example Date s = new Date(2006, 2, 14); Assert specification assertTrue(s.equals(s)); How do we randomly generate construct parameter values like 2006, 2, 14? Randoop Example HashSet s = new HashSet(); Randomly pick a public method s.add(“”); Assert specification assertTrue(s.equals(s)); The default value for String “” is used since there is no other String in the system. Randoop Example HashSet s = new HashSet(); Randomly pick a public method s.add(“”); Randomly pick a public method s.isEmpty(); Assert specification assertTrue(s.equals(s)); A method is probably an observer method if it has no parameters; it is public and non-static; it returns primitive values; and its name is size, count, length, toString, or begins with get or is. Randoop Example Date d = new Date(2006, 2, 14); Randomly pick a public method d.setMonth(-1); // pre: argument >= 0 A sequence of method calls result in an exception is added to errSeqs. Randoop Example Date d = new Date(2006, 2, 14); Randomly pick a public method d.setMonth(-1); // pre: argument >= 0 d.setDay(5); Assert specification assertTrue(s.equals(s)); Classifying a sequence start execute and check contracts contract violated? yes minimize sequence no components no sequence redundant? yes discard sequence contractviolating test case Redundancy Checking • Randoop maintains a set of objects for each type. • A sequence (of method calls) is redundant if the objects created during its execution are members of the above set. – Use equals() to compare – Or user-defined more sophisticated checking Some Randoop options • Avoid use of null statically… …and dynamically Object o = new Object(); LinkedList l = new LinkedList(); l.add(null); Object o = returnNull(); LinkedList l = new LinkedList(); l.add(o); • Biased random selection – Favor smaller sequences – Favor methods that have been less covered – Use constants mined from source code Research Question How effective would Randoop be? How do we judge whether one set of random test cases are better than another set? Coverage • Code block coverage: a set of random test cases are better if it covers more code blocks. – For instance, consider each branch as a block • Predicate coverage: given a set of predicates, a set of random test cases are better if it covers more valuations of the predicates. – For instance, consider the predicates to be the propositions in the program. Coverage Achieved by Randoop data structure time (s) branch cov. Bounded stack (30 LOC) 1 100% Unbounded stack (59 LOC) 1 100% BS Tree (91 LOC) 1 96% Binomial heap (309 LOC) 1 84% Linked list (253 LOC) 1 100% Tree map (370 LOC) 1 81% Heap array (71 LOC) 1 100% Is this representative? Predicate Coverage Binary tree Binomial heap 102 predicate coverage predicate coverage 55 feedback-directed 54 best systematic 53 undirected random 52 96 90 undirected random 84 0 0.5 1 1.5 2 2.5 0 5 time (seconds) 10 15 time (seconds) Fibonacci heap Tree map 100 107 feedback-directed 96 predicate coverage predicate coverage best systematic feedback-directed best systematic 92 88 undirected random best systematic feedback-directed 106 105 undirected random 104 103 84 0 20 40 60 tim e (seconds) 80 100 0 10 20 30 time (seconds) 40 50 Bug Detection JDK (2 libraries) LOC Classes 53K 272 (java.util, javax.xml) Apache commons (5 libraries) 114K 974 582K 3330 A (logging, primitives, chain jelly, math, collections) .Net framework (5 libraries) How would Korat perform on these examples? C Methodology • Ran Randoop on each library – Used default time limit (2 minutes) • Contracts: – – – – – o.equals(o)==true o.equals(o) throws no exception o.hashCode() throws no exception o.toString() throw no exception No null inputs and: • Java: No NullPointerEexceptions • .NET: No NPEs, out-of-bounds, of illegal state exceptions Results test cases output errorrevealing tests cases distinct errors 32 29 8 Apache commons 187 29 6 .Net framework 192 192 192 Total 411 250 206 JDK Errors found: examples • JDK Collections classes have 4 methods that create objects violating o.equals(o) contract • Javax.xml creates objects that cause hashCode and toString to crash, even though objects are well-formed XML constructs • Apache libraries have constructors that leave fields unset, leading to NPE on calls of equals, hashCode and toString (this only counts as one bug) • Many Apache classes require a call of an init() method before object is legal—led to many false positives • .Net framework has at least 175 methods that throw an exception forbidden by the library specification (NPE, out-of-bounds, of illegal state exception) • .Net framework has 8 methods that violate o.equals(o) • .Net framework loops forever on a legal but unexpected input Regression testing • Randoop can create regression oracles • Generated test cases using JDK 1.5 – Randoop generated 41K regression test cases • Ran resulting test cases on – JDK 1.6 Beta • 25 test cases failed Object o = new Object(); LinkedList l = new LinkedList(); l.addFirst(o); l.add(o); assertEquals(2, l.size()); // expected to pass assertEquals(false, l.isEmpty()); // expected to pass – Sun’s implementation of the JDK • 73 test cases failed – Failing test cases pointed to 12 distinct errors – These errors were not found by the extensive compliance test suite that Sun provides to JDK developers Evaluation: summary • Feedback-directed random test generation: – Is effective at finding errors • Discovered several errors in real code (e.g. JDK, .NET framework core libraries) – Can outperform systematic input generation • On previous benchmarks and metrics (coverage), and • On a new, larger corpus of subjects, measuring error detection – Can outperform undirected random test generation Conclusion • Feedback-directed random test generation – Finds errors in widely-used, well-tested libraries – Can outperform systematic test generation – Can outperform undirected test generation • Randoop: – Easy to use—just point at a set of classes – Has real clients: used by product groups at Microsoft • A mid-point in the systematic-random space of input generation techniques Exercise 2 Apply Randoop, manually, to OrderSet.java • To create 2 valid tests, one redundant test and one illegal sequence. • Create a test case to expose the bug. Research Question How do we improve Korat or Randoop?