Metamorphic Testing Techniques to Detect Defects in Applications without Test Oracles Christian Murphy Thesis Defense April 12, 2010 Overview Software testing is important! Certain types of applications are particularly hard to test because there is no “test oracle” Machine Learning, Discrete Event Simulation, Optimization, Scientific Computing, etc. Even when there is no oracle, it is possible to detect defects if properties of the software are violated My research introduces and evaluates new techniques for testing such “non-testable programs” [Weyuker, Computer Journal’82] 2 Motivating Example: Machine Learning 3 Motivating Example: Simulation Length of Stay versus Utilization 300 16 14 12 units of time 200 10 150 8 6 100 4 50 2 0 0 0 2 4 6 8 10 percent utilization 250 LOS Doctor Utilization Nurse Utilization Triage Utilization Clerk Utilization 12 number of beds 4 Problem Statement Partial oracles may exist for a limited subset of the input domain in applications such as Machine Learning, Discrete Event Simulation, Scientific Computing, Optimization, etc. Obvious errors (e.g., crashes) can be detected with certain inputs or testing techniques However, it is difficult to detect subtle computational defects in applications without test oracles in the general case 5 What do I mean by “defect”? Deviation of the implementation from the specification Violation of a sound property of the software “Discrete localized” calculation errors Off-by-one Incorrect sentinel values for loops Wrong comparison or mathematical operator Misinterpretation of specification Parts of input domain not handled Incorrect assumptions made about input 6 Observation Many programs without oracles have properties such that certain changes to the input yield predictable changes to the output We can detect defects in these programs by looking for any violations of these “metamorphic properties” This is known as “metamorphic testing” [T.Y. Chen et al., Info. & Soft. Tech vol.4, 2002] 7 Research Goals Facilitate the way that metamorphic testing is used in practice Develop new testing techniques based on metamorphic testing Demonstrate the effectiveness of metamorphic testing techniques 8 Hypotheses For programs that do not have a test oracle, an automated approach to metamorphic testing is more effective at detecting defects than other approaches An approach that conducts function-level metamorphic testing in the context of a running application will further increase the effectiveness It is feasible to continue this type of testing in the deployment environment, with minimal impact on the end user 9 Contributions 1. A set of guidelines to help identify metamorphic properties 2. New empirical studies comparing the effectiveness of metamorphic testing to other approaches 3. An approach for detecting defects in non-deterministic applications called Heuristic Metamorphic Testing 4. A new testing technique called Metamorphic Runtime Checking based on function-level metamorphic properties 5. A generalized technique for testing in the deployment environment called In Vivo Testing 10 Outline Background Related Work Metamorphic Testing Metamorphic Testing Empirical Studies Metamorphic Runtime Checking Future Work & Conclusion 11 Other Approaches [Baresi & Young, 2001] Formal specifications A complete Embedded assertions Can check that the software behaves as expected Algebraic properties Used specification is essentially a test oracle to generate test cases for abstract datatypes Trace checking & Log file analysis Analyze intermediate results and sequence of executions 12 Metamorphic Testing [Chen et al., 2002] Initial test case Transformation function based on metamorphic properties of f New test case x f f(x) and f(t(x)) t t(x) f(x) are “pseudo-oracles” f f(t(x)) If new test case output f(t(x)) is as expected, it is not necessarily correct However, if f(t(x)) is not as expected, either f(x) or f(t(x)) – or both! – is wrong 13 Metamorphic Testing Example Consider a function to determine the standard deviation of a set of numbers Initial input a b c d e f std_dev s New test case #1 c e b a f d std_dev s ? New test a+2 b+2 c+2 d+2 e+2 f+2 std_dev case #2 s ? New test case #3 2a 2b 2c 2d 2e 2f std_dev 2s ? 14 Outline Background Related Work Metamorphic Testing Metamorphic Testing Empirical Studies Metamorphic Runtime Checking Future Work & Conclusion 15 Empirical Study Is metamorphic testing more effective than other approaches in detecting defects in applications without test oracles? Approaches investigated Metamorphic Testing Using metamorphic properties of the entire application Runtime Assertion Using Daikon-detected program invariants Partial Checking Oracle Simple inputs for which correct output can easily be determined 16 Applications Investigated Machine Learning C4.5: decision tree classifier MartiRank: ranking Support Vector Machines (SVM): vector-based classifier PAYL: anomaly-based intrusion detection system Discrete Event Simulation JSim: used in simulating hospital ER Information Retrieval Lucene: Apache framework’s text search engine Optimization gaffitter: genetic algorithm approach to bin-packing problem 17 Methodology Mutation testing was used to seed defects into each application Comparison operators were reversed Math operators were changed Off-by-one errors were introduced For each program, we created multiple versions, each with exactly one mutation We ignored mutants that yielded outputs that were obviously wrong, caused crashes, etc. Effectiveness is determined by measuring what percentage of the mutants were “killed” 18 Experimental Results Partial Oracle Runtime Assertion Checking Metamorphic Testing C4.5 MartiRank SVM PAYL JSim Lucene gaffitter TOTAL 0 20 40 60 80 100 120 % of Mutants Killed 19 Analysis of Results Assertions are good for checking bounds and relationships but not for changes to values Metamorphic testing particularly good for detecting errors in loop conditions Metamorphic testing was not very effective for PAYL (5%) and gaffitter (33%) fewer properties identified defects had little impact on output 20 Outline Background Related Work Metamorphic Testing Metamorphic Testing Empirical Studies Metamorphic Runtime Checking Future Work & Conclusion 21 Metamorphic Runtime Checking Results of previous study revealed limitations of scope and robustness in metamorphic testing What if we consider the metamorphic properties of individual functions and check those properties as the entire program is running? A combination of metamorphic testing and runtime assertion checking 22 Metamorphic Runtime Checking Tester specifies the metamorphic properties of individual functions using a special notation in the code (based on JML) Pre-processor instruments code with corresponding metamorphic tests Tester runs entire program as normal (e.g., to perform system tests) Violation of any property reveals a defect 23 MRC Model of Execution Function f is about to be executed with input x in state S The metamorphic test is conducted at the same point in the program execution as the original function call The metamorphic test runs in parallel with the rest of the application Create a sandbox for the test Transform input to get t(x) Execute f(x) to get result Execute f(t(x)) Metamorphic test Send result to test Compare outputs Program continues Report violations 24 Empirical Study Can Metamorphic Runtime Checking detect defects not found by system-level metamorphic testing? Same mutants used in previous study 29% were not found by metamorphic testing Metamorphic properties identified at function level using suggested guidelines 25 M TO TA L ga ffi tte r Lu ce ne JS im PA YL SV M ar t iR an k C4 .5 % of defects detected Experimental Results 120 100 80 MRC only 60 Both MT only 40 20 0 26 Analysis of Results Scope: Function-level testing allowed us to: identify additional metamorphic properties execute more tests Robustness: Metamorphic testing “inside” the application detected subtle defects that did not have much effect on the overall program output 27 Combined Results Partial Oracle Runtime Assertion Checking Metamorphic Testing MT + MRC C4.5 MartiRank SVM PAYL JSim Lucene gaffitter TOTAL 0 20 40 60 80 100 120 28 Outline Background Related Work Metamorphic Testing Metamorphic Testing Empirical Studies Metamorphic Runtime Checking Future Work & Conclusion 29 Results Demonstrated that metamorphic testing advances the state of the art in detecting defects in applications without test oracles Proved that Metamorphic Runtime Checking will reveal defects not found by using system-level properties Showed that it is feasible to continue this type of testing in the deployment environment, with minimal impact on the end user 30 Short-Term Opportunities Automatic detection of metamorphic properties Using dynamic and/or static techniques Fault localization Once a defect has been detected, figure out where it occurred and how to fix it Implementation issues Reducing overhead Handling external databases, network traffic, etc. 31 Long-Term Directions Testing of multi-process or distributed applications in these domains Collaborative defect detection and notification Investigate the impact on the software development processes used in the domains of non-testable programs 32 Contributions & Accomplishments 1. A set of metamorphic testing guidelines 2. New empirical studies 3. [Murphy, Shen, Kaiser; ISSTA’09] Metamorphic Runtime Checking 5. [Xie, Ho, Murphy, Kaiser, Xu, Chen; QSIC’09] Heuristic Metamorphic Testing 4. [Murphy, Kaiser, Hu, Wu; SEKE’08] [Murphy, Shen, Kaiser; ICST’09] In Vivo Testing [Murphy, Kaiser, Vo, Chu; ICST’09] [Murphy, Vaughan, Ilahi, Kaiser; AST’10] 33 Thank you! 34 Backup Slides! Motivation 35 Assessment of Quality 1994: Hatton et al. pointed out a “disturbing” number of defects due to calculation errors in scientific computing software [TSE vol.20] 2007: Hatton reports that “many scientific results are corrupted, perhaps fatally so, by undiscovered mistakes in the software used to calculate and present those results” [Computer vol.40] 36 Effectiveness Complexity vs. Effectiveness Metamorphic Runtime Checking System-level Metamorphic Testing Formal Specifications Embedded Assertions Algebraic Specifications Trace Checking & Log Analysis Complexity 37 Metamorphic Motivation Properties 38 Categories of Metamorphic Properties Additive: Increase (or decrease) numerical values by a constant Multiplicative: Multiply numerical values by a constant Permutative: Randomly permute the order of elements in a set Invertive: Negate the elements in a set Inclusive: Add a new element to a set Exclusive: Remove an element from a set Compositional: Compose a set 39 Sample Metamorphic Properties 1. 2. 3. 4. 5. 6. 7. 8. Permuting the order of the examples in the training data should not affect the model If all attribute values in the training data are multiplied by a positive constant, the model should stay the same If all attribute values in the training data are increased by a positive constant, the model should stay the same Updating a model with a new example should yield the same model created with training data originally containing that example If all attribute values in the training data are multiplied by -1, and an example to be classified is also multiplied by -1, the classification should be the same Permuting the order of the examples in the testing data should not affect their classification If all attribute values in the training data are multiplied by a positive constant, and an example to be classified is also multiplied by the same positive constant, the classification should be the same If all attribute values in the training data are increased by a positive constant, and an example to be classified is also increased by the same positive constant, the classification should be the same 40 Other Classes of Properties (1) Statistical Same mean, variance, etc. as the original Heuristic Approximately equal to the original Semantically Equivalent Domain specific 41 Other Classes of Properties (2) Noise Based Add/Change Partial Change data that should not affect result to part of input only affects part of output Compositional New input relies on original output ShortestPath(a, b) = ShortestPath(a, c) + ShortestPath(c, b) 42 Automatic Detection of Properties Static Use machine learning to model what code looks like that exhibits certain properties, then determine whether other code matches that model Use symbolic execution to check “algebraically” Dynamic Observe multiple executions and infer properties 43 Automated Motivation Metamorphic Testing 44 Automated Metamorphic Testing Tester specifies the application’s metamorphic properties Test framework does the rest: Transform inputs Execute program with each input Compare outputs according to specification 45 AMST Model 46 Specifying Metamorphic Properties 47 Heuristic Motivation Metamorphic Testing 48 Statistical Metamorphic Testing Introduced by Guderlei & Mayer in 2007 The application is run multiple times with the same input to get a mean value μo and variance σo Metamorphic properties are applied The application is run multiple times with the new input to get a mean value μ1 and variance σ1 If the means are not statistically similar, then the property is considered violated 49 Heuristic Metamorphic Testing When we expect that a change to the input will produce “similar” results, but cannot determine the expected similarity in advance Use input X to generate outputs M1 through Mk Use some metric to create a profile of the outputs Use input X’ (created according to a metamorphic property) to generate outputs N1 through Nk Create a profile of those outputs Use statistical techniques (e.g. Student t-test) to check that the profile of outputs N is similar to that of outputs M 50 Heuristic Metamorphic Testing x x y1 profile of n nd_f y1…yy x nd_f nd_f 2 t(x) nd_f y’1 profile of y’1…y’ nd_fn t(x) y’2 the t(x) ynprofilesDodemonstrate nd_f y’n the expected relationship? 51 HMT Example 2 1 4 1 ? 1 1 ? ? 1 2 1 ? 2 1 2 ? 3 ? ? ? ? 3 ? 2 2 3 4 4 2 4 3 3 4 3 ? ? 4 4 ? permute sort P ? = sort P’ Build abased profileon based on normalized equivalence Build a profile normalized equivalence and compare it statistically to the first profile 52 HMT Empirical Study Is Heuristic Metamorphic Testing more effective than other approaches in detecting defects in nondeterministic applications without test oracles? Approaches investigated Heuristic Metamorphic Testing Embedded Assertions Partial Oracle Applications investigated MartiRank: sorting sparse data sets JSim: non-deterministic event timing 53 HMT Study Results & Analysis Heuristic Metamorphic Testing killed 59 of the 78 mutants Partial oracle and assertion checking ineffective for JSim because no single execution was outside the specified range 54 Metamorphic Motivation Runtime Checking 55 Extensions to JML 56 Creating Test Functions /*@ @meta std_dev(\multiply(A, 2)) == \result * 2 */ public double __std_dev(double[] A) { ... } protected boolean __MRCtest0_std_dev (double[] A, double result) { return Columbus.approximatelyEqualTo (__std_dev(Columbus.multiply(A, 2)), result * 2); } 57 Instrumentation public double std_dev(double[] A) { // call original function and save result double result = __std_dev(A); // create sandbox int pid = Columbus.createSandbox(); // program continues as normal if (pid != 0) return result; else { // run test in child process if (!__MRCtest0_std_dev(A, result)) Columbus.fail(); // handle failure Columbus.exit(); // clean up } } 58 MRC: Case Studies We investigated the WEKA and RapidMiner toolkits for Machine Learning in Java For WEKA, we tested four apps: Naïve Bayes, Support Vector Machines (SVM), C4.5 Decision Tree, and k-Nearest Neighbors For RapidMiner, we tested one app: Naïve Bayes 59 MRC: Case Study Setup For each of the five apps, we specified 4-6 metamorphic properties of selected methods (based on our knowledge of the expected behavior of the overall application) Testing was conducted using data sets from UCI Machine Learning Repository Goal was to determine whether the properties held as expected 60 MRC: Case Study Findings Discovered defects in WEKA k-NN and WEKA Naïve Bayes related to modifying the machine learning “model” This was the result of a variable not being updated appropriately Discovered a defect in RapidMiner Naïve Bayes related to determining confidence There was an error in the calculation 61 Metamorphic Testing Motivation Experimental Study 62 Approaches Not Investigated Formal specification Issues related to completeness Prev. work converted specifications to invariants Algebraic properties Not appropriate at system-level Automatic detection only supported in Java Log/trace file analysis Need more detailed knowledge of implementation Pseudo-oracles None appropriate for applications investigated 63 Methodology: Metamorphic Testing Each variant (containing one mutation) acted as a pseudo-oracle for itself: Program was run to produce an output with the original input dataset Metamorphic properties applied to create new input datasets Program run on new inputs to create new outputs If outputs not as expected, the mutant had been killed (i.e. the defect had been detected) 64 Methodology: Partial Oracle Data sets were chosen so that the correct output could be calculated by hand These data sets were typically smaller than the ones used for other approaches To ensure fairness, the data sets were selected so that the line coverage was approximately the same for each approach 65 Methodology: Runtime Assertion Checking Daikon was used to detect program invariants in the “gold standard” implementation Because Daikon can generate spurious invariants, programs were run with a variety of inputs, and obvious spurious invariants were discarded Invariants then checked at runtime 66 Defects Detected in Study #1 67 Study #1: SVM Results Permuting the input was very effective at killing offby-one mutants Many functions in SVM analyze a set of numbers (mean, standard dev, etc.) Off-by-one mutants caused some element of the set to be omitted By permuting, a different number would be omitted This revealed the defect 68 Study #1: SVM Example Permuting the input reveals this defect because both m_I1 and m_I4 will be different Partial oracle does not because only one element is omitted, so one will remain same; for small data sets, this did not affect the overall result 69 Study #1: C4.5 Results Negating the input was very effective C4.5 creates a decision tree in which nodes contain clauses like “if attrn > α then class = C” If the data set is negated, those nodes should change to “if attrn ≤ -α then class = C”, i.e. both the operator and the sign of α In most cases, only one of the changes occurred 70 Study #1: C4.5 Example Mutant causes ClassFreq to have negative values, violating assertion Permuting the order of elements does not affect the output in this case 71 Study #1: MartiRank Results Permuting and negating were effective at killing comparison operator mutants MartiRank depends heavily on sorting Permuting and negating change which numbers get sorted and what the result should be, thus inducing the differences in the final sorted list 72 Study #1: Effectiveness of Properties 73 Study #1: Lucene Results Most mutants gave a non-zero score to the term “foo”, thus L3 detected the defect 74 Study #1: gaffitter Results G1: increasing the number of generations should increase the overall quality G2: multiplying item and bin sizes by a constant should not affect the solution Most of defects killed by G1 related to incorrectly selecting candidate solutions 75 Empirical Studies: Threats to Validity Representativeness of selected programs Types of defects Data sets Daikon-generated program invariants Selection of metamorphic properties 76 Metamorphic Runtime Motivation Checking Experimental Study 77 Study #2 Results If we only consider functions for which metamorphic properties were identified, there were 189 total mutants MRC detected 96.3%, compared to 67.7% for system-level metamorphic testing 78 Study #2 PAYL Results Both functions call numerous other functions, but we can circumvent restrictions on the input domain Permuting input tends to kill off-by-one mutants 79 Study #2 gaffitter Results 80 Study #2: gaffitter Example Genetic Algorithm takes two sets and “crosses over” at a particular element 1 2 3 4 5 1 2 3 9 6 7 8 9 Metamorphic Property: If we switch the order, the new output should be predictable 6 7 8 9 6 7 8 4 5 1 2 3 4 5 Simply, the elements not included in the original cross-over 81 Study #2: gaffitter Example Now consider a defect in which the cross-over happens at the wrong point 1 2 3 4 5 1 2 3 8 9 6 7 8 9 Metamorphic property is violated: elements 3 and 8 should not appear in both sets 6 7 8 9 6 7 8 3 4 5 1 2 3 4 5 82 Study #2: gaffitter Example Correct implementation 1 2 3 4 5 1 2 3 8 9 6 7 8 9 Erroneous implementation 1 2 3 4 5 This defect is only detected by system-level metamorphic testing if element 8 has any impact on the “quality” of the final solution. However, a single element is unlikely to do so. 1 2 3 9 6 7 8 9 83 Study #2 Lucene Results MRC killed three mutants not killed by MT All three were in the idf function 84 Study #2: Lucene Example Search query results are ordered according to a score “ROMEO or JULIET” Act 3 Scene 5 Act 2 Scene 4 Act 5 Scene 1 5.837 4.681 3.377 Consider a defect in which the scores are off by one. The results stay the same because only the order is important. “ROMEO or JULIET” Act 3 Scene 5 6.837 Partial oracle does not reveal this defect because the scores cannot be calculated in advance. Act 2 Scene 4 Act 5 Scene 1 5.681 4.377 85 Study #2: Lucene Example System-level metamorphic property: changing the query order shouldn’t affect result “ROMEO or JULIET” “JULIET or ROMEO” Act 3 Scene 5 Act 2 Scene 4 Act 5 Scene 1 6.837 5.681 4.377 Act 3 Scene 5 Act 2 Scene 4 Act 5 Scene 1 5.681 4.377 6.837 Even though the defect exists, the property still holds and the defect is not detected. 86 Study #2: Lucene Example The score itself is computed as the result of many subcalculations. Score(q) = ∑Similarity(f)*Weight(qi) + … + idf(q) + … Metamorphic Runtime Checking can detect that there is an error in this function by checking its individual (mathematical) properties. 87 In Vivo Testing Motivation 88 Generalization of MRC In Metamorphic Runtime Checking, the software tests itself Why only run metamorphic tests? Why limit ourselves only to applications without test oracles? Why not allow the software to continue testing itself as it runs in the production environment? 89 In Vivo Testing An approach whereby software tests itself in the production environment by running any type of test (unit, integration, “parameterized unit”, etc.) at specified program points Tests are run in a sandbox so as not to affect the original process Invite implementation: less than half a millisecond overhead per test 90 Example of Defect: Cache private int numItems = 0, currSize = 0; private int maxCapacity = 1024; // in bytes public int getNumItems() { Number of Their size return numItems; items in (in bytes) } the cache Maximum public boolean addItem(CacheItem i) throws ... capacity { Should only be incremented numItems++; within “if” block if (currSize + i.size < maxCapacity) { add(i); currSize += i.size; return true; } else { return false; } } 91 Insufficient Unit Test public void testAddItem() { Cache c = new Cache(); assert(c.addItem(new CacheItem())) assert(c.getNumItems() == 1); assert(c.addItem(new CacheItem())) assert(c.getNumItems() == 2); } 1. Assumes an empty/new cache 2. Doesn’t take into account various states that the cache can be in 92 Defects Targeted 1. 2. 3. 4. 5. Unit tests that make incomplete assumptions about the state of objects in the application Possible field configurations that were not tested in the lab A legal user action that puts the system in an unexpected state A sequence of unanticipated user actions that breaks the system Defects that only appear intermittently 93 In Vivo: Model of Execution Function is about to be executed Run a test? NO Execute function Rest of program continues Yes Create sandbox Run test Stop Fork 94 Writing In Vivo Tests /* Method to be tested */ public boolean addItem(CacheItem i) { . . . } /* In JUnit Vivo style test */ public boolean void testAddItem()CacheItem { i) { Cache c = new Cache(); this; int oldNumItems = getNumItems(); i)) CacheItem())) if (c.addItem(new assert 1); return (c.getNumItems() == oldNumItems+1; else return true; } 95 Instrumentation /* Method to be tested */ public boolean __addItem(CacheItem i) { . . . } /* In Vivo style test */ public boolean testAddItem(CacheItem i) { ... } public boolean addItem(CacheItem i) { if (Invite.runTest(“Cache.addItem”)) { Invite.createSandboxAndFork(); if (Invite.isTestProcess()) { if (testAddItem(i) == false) Invite.fail(); else Invite.succeed(); Invite.destroySandboxAndExit(); } } return __addItem(i); } 96 In Vivo Testing: Case Studies Applied testing approach to two caching systems OSCache 2.1.1 Apache JCS 1.3 Both had known defects that were found by users (no corresponding unit tests for these defects) Goal: demonstrate that “traditional” unit tests would miss these but In Vivo testing would detect them 97 In Vivo Testing: Experimental Setup An undergraduate student created unit tests for the methods that contained the defects These tests passed in “development” Student was then asked to convert the unit tests to In Vivo tests Driver created to simulate real usage in a “deployment environment” 98 In Vivo Testing: Discussion In Vivo testing revealed all defects, even though unit testing did not Some defects only appeared in certain states, e.g. when the cache was at full capacity These are the very types of defects that In Vivo testing is targeted at However, the approach depends heavily on the quality of the tests themselves 99 In Vivo Testing: Performance 100 More Robust Sandboxes “Safe” test case selection [Willmor and Embury, ICSE’06] Copy-on-write database snapshots MS SQL Server v8 101 In Vivo Testing: Related Work Self-checking Software [A.Orso et al, ISSTA’02] Skoll: [A.Memon et al., ICSE’03] Cooperative Bug Isolation [B.Liblit et al., PLDI’03] COTS components [S.Beydeda, COMPSAC’06] Gamma Property-based Software Testing D.Rosenblum: runtime assertion checking I.Nunes: checking algebraic properties [ICFEM’06] 102 Related Work Motivation 103 Limitations of Other Approaches Formal specification languages Issues related to completeness Balance between expressiveness and implementability Algebraic properties Useful for data structures, but not for arbitrary functions or entire programs Limitations of previous work in runtime checking Log/trace file analysis Requires careful planning in advance 104 Previous Work in MT T.Y.Chen et al.: applying metamorphic testing to applications without oracles [Info. & Soft. Tech. vol.44, 2002] Domain-specific testing [J.Mayer and R.Guderlei, QSIC’07] Bioinformatics [T.Y.Chen et al., BMC Bioinf. 10(24), 2009] Middleware [W.K.Chan et al., QSIC’05] Others… Graphics 105 Previous Studies [Hu et al., SOQUA’ 06] Invariants hand-generated Smaller programs Only deterministic applications Didn’t consider partial oracle 106 Developer Effort [Hu et al., SOQUA’ 06] Students were given three-hour training sessions on MT and on assertion checking Given three hours to identify metamorphic properties and program invariants Averaged about the same number of metamorphic properties as invariants The metamorphic properties were more effective at killing mutants 107 Fault Localization Delta debugging [Zeller, FSE’02] Compare trace of failed execution vs. successful ones Cooperative Bug Isolation et al., PLDI’03] Numerous instances report results and failed execution is compared to those [Liblit Statistical approach [Baah, Gray, Harrold; SoQUA’06] Combines model of normal behavior with runtime monitoring 108