Lecture 7 Advanced Topics in Testing Mutation Testing • Mutation testing concerns evaluating test suites for their inherent quality, i.e. ability to reveal errors. • Need an objective method to determine quality • Differs from structural coverage since it tries to define “what is an error” • Basic idea is to inject defined errors into the SUT and evaluate whether a given test suite finds them. • “Killing a mutation” Basic Idea • We can statistically estimate (say) fish in a lake by releasing a number of marked fish, and then counting the number of marked fish in a subsequent small catch. • Example: release 20 fish • Catch 40 fish, 4 marked. • Then 1 in 10 is marked, so estimate 200 fish • Pursue the same idea with marked SW bugs? Mutations and Mutants • The “marked fish” are injected errors, termed mutations • The mutated code is termed a mutant • Example: replace < by > in a one Boolean expression • if ( x < 0 ) then … becomes if ( x > 0 ) then … • If test suite finds mutation we say this particular mutant is killed • Make large set of mutants – typically using a checklist of known mutations - mutation tool Mutation score • Idea is that if we can kill a mutant we can identify a real bug too • Mutants which are semantically equivalent to the original code are called equivalents • Write Q P if Q and P are equivalents • Clearly cannot kill equivalents • Mutation score % = Number of killed mutants total number of non-equivalent mutants *100 Why should it work? • Two assumptions are used in this field Competent programmer hypothesis i.e. “The program is mostly correct” and the Coupling Effect Semantic Neighbourhoods • Let Φ be the set of all programs semantically close to P (defined in various ways) • Φ is neighbourhood of P • Let T be a test suite, f:DD be a functional spec of P • Traditionally assume t T P.x = f(x) x D P.x = f(x) • i.e. T is a reliable test suite • Requires exhaustive testing Competent Programmer Hypothesis • P is pathological iff P Φ • Assume programmers have some competence Mutation testing assumption Either P is pathological or else t T P.x = f(x) x D P.x = f(x) • Can now focus on building a test suite T that would distinguish P from all other programs in Φ Coupling Effect • The competent programmer hypothesis limits the problem from infinite to finite. • But remaining problem is still too large Coupling effect says that there is a small subset μ Φ such that: We only need to distinguish P from all programs in μ by tests in T Problems • Can we be sure the coupling effect holds? Do simple syntactic changes define a set? • Can we detect and count equivalents? If we can’t kill a mutant Q is Q P or is Q just hard to kill? • How large is μ? May still be too large to be practical? Equivalent Mutants • Offut and Pan [1997] estimated 9% of all mutants equivalent • Bybro [2003] concurs with 8% • Automatic detection algorithms (basically static analysers) detect about 50% of these • Use theorem proving (verification) techniques Coupling Effect • For Q an incorrect version of P • Semantic error size = PrQ.x P.x • If for every semantically large fault there is an overlap with at least one small syntactic fault then the coupling effect holds. • Selective mutation based on a small set of semantically small errors - “Hard to kill” Early Research: 22 Standard (Fortran) mutation operators AAR Array reference for array reference replacement ABS Absolute value insertion ACR Array reference for constant replacement AOR Arithmetic operator replacement ASR Array reference for scalar replacement CAR Constant for array reference replacement CNR Comparable array name replacement CRP Constants replacement CSR Constant for Scalar variable replacement DER Do statement End replacement DSA Data statement alterations GLR Goto label replacement LCR Logical connector replacement ROR Relational operator replacement RSR Return statement replacement SAN Statement analysis SAR Scalar for array replacement SCR Scalar for constant replacement SDL Statement deletion SRC Source constant replacement SVR Scalar variable replacement UOI Unary operator insertion Recent Research: Java Mutation Operators • • • • • • • First letter is category A = access control E = common programming mistakes I = inheritance J = Java-specific features O = method overloading P = polymorphism AMC Access modifier change EAM Accessor method change EMM Modifier method change EOA Reference assignment and content assignment replacement EOC Reference comparison and content comparison replacement IHD Hiding variable deletion IHI Hiding variable insertion IOD Overriding method deletion IOP Overriding method calling position change IOR Overridden method rename IPC Explicit call of parent’s constructor deletion ISK super keyword deletion JDC Java supported default constructor create JID Member variable initialisation deletion JSC static modifier change JTD this keyword deletion OAO Argument order change OAN Argument number change OMD Overloaded method deletion OMR Overloaded method contents change PMD Instance variable deletion with parent class type PNC new method call with child class type PPD Parameter variable declaration with child class type PRV Reference asssignment with other compatible type Practical Example • • • • Triangle program Myers “complete” test suite (13 test cases) Bybro [2003] Java mutation tool and code 88% mutation score, 96% statement coverage Status of Mutation Testing • Various strategies: weak mutation, interface mutation, specification-based mutation • Our version is called strong mutation • Many mutation tools available on the internet • Cost of generating mutants and detecting equivalents has come down • Not yet widely used in industry • Still considered “academic”, not understood? Learning-based Testing 1. Specification-based Black-box Testing 2. Learning-based Testing paradigm (LBT) - connections between learning and testing - testing as a search problem - testing as an identification problem - testing as a parameter inference problem 3. Example frameworks: 1. Procedural systems 2. Boolean reactive systems Specification-based Black-box Testing 1. System requirement (Sys-Req) 2. System under Test (SUT ) 3. Test verdict pass/fail (Oracle step) Sys-Req Test case pass/fail Output TCG SUT Oracle Constraint solver Language runtime Constraint checker Procedural System Example: Newton’s Square root algorithm Postcondition Ι y*y – x Ι ≤ ε precondition x ≥ 0.0 x=4.0 TCG Input Constraint solver x=4.0 satisfies x ≥ 0.0 y=2.0 SUT Newton Code Output Oracle Constraint checker Verdict x=4.0, y=2.0 satisfies Ι y*y – x Ι ≤ ε Reactive System Example: Coffee Machine Sys-Req: always( in=$1 implies after(10, out=coffee) ) In0 := $1 TCG Input Constraint solver SUT Coffee machine out11 := coffee Output pass/fail Oracle Constraint checker in0 := $1, out11 := coffee Satisfies always( in=1$ implies after(10, out=coffee)) Key Problem: Feedback Problem: How to modify this architecture to.. 1.Improve next test case using previous test outcomes 2.Execute a large number of good quality tests? 3.Obtain good coverage? 4.Find bugs quickly? Learning-Based Testing Sys-Req TCG pass/fail Input SUT Output Oracle Sys-Model Verdict Learner “Model based testing without a model” Basic Idea … LBT is a search heuristic that: 1.Incrementally learns an SUT model 2.Uses generalisation to predict bugs 3.Uses best prediction as next test case 4.Refines model according to test outcome Abstract LBT Algorithm 1. 2. 3. 4. 5. Use (i1 , o1), … , (ik , ok) to learn model Mk Model check Mk against Sys-Req Choose “best counterexample” ik+1 from step 2 Execute ik+1 on SUT to produce ok+1 Check if (ik+1 , ok+1) satisfies Sys-Req a) Yes: terminate with ik+1 as a bug b) No: goto step 1 Difficulties lie in the technical details … General Problems Difficulty is to find combinations of models, requirements languages and checking algorithms (M, L, A) so that … 1. models M are: - expressive, - compact, - partial and/or local (an abstraction method) - easy to manipulate and learn 2. M and L are feasible to model check with A Incremental Learning • Real systems are too large to be completely learned • Complete learning is not necessary to test many requirements (e.g. use cases) • We use incremental (on-the-fly) learning – Generate a sequence of refined models • M0 M1 … Mi – Convergence in the limit Example: Boolean reactive systems 1. SUT: reactive systems 2. Model: deterministic Kripke structure 3. Requirements: propositional linear temporal logic (PLTL) 4. Learning: IKL incremental learning algorithm 5. Model Checker: NuSMV LBT Architecture A Case Study: Elevator Model C1 I C3 C3 C1 ck W1, W3 !W1, W3 ck C2 C2 !W2,cl, !Stop, !@1, !@2, !@3 C1 I C3 W2, cl,!Stop, !@1, !@2, !@3 C1 I C2 I C3 W3, cl,!Stop, !@1, !@2, !@3 C1 I C2 I C3 C1 I C2 C3 C2 ck W1, W3 W1, !W3 C1 ck C1 C1 C2 C2 I C3 C3 ck I C3 C2 ck I C2 !W1, !W2 Tick ck !W1, W2 ck ck !W3, Stop, !@1, !@2, @3 Tick I C3 I C2 C3 cl Tick C2 I C3 C2 !W1, W3 ck C2 C1 C3 !W1, !W3 C3 ck W1, W2 W1, !W2 ck C3 C1 C3 I ck C3 I ck C3 C1 I C3 I ck Tick cl C2 C1 C3 !cl C2 !cl !cl,!W2, W3 C1 !cl,W2,!W3 W2, W3 !W1,!W3 !cl,!W1, W3 !cl,!W1, W3 C1 C1 C2 C1 !W1, Stop, @1, !@2, !@3 C1 C2 !cl,W1,!W3 !cl,W1,!W3 C2 C2 C3 C1 I ck C1 I ck !W1,!W2 !cl,W1, W3 !cl,W1, W3 C2 C1 I C3 I ck C2 C2 ck ck ck C3 C3 W2, W3 C2 C1 C1 I C2 I C3 ck !W1, !W3 !W1, W3 W2, !W3 C2 ck C2 C1 I C2 C2 I C3 C3 C3 !W2, W3 C1 ck ck W1, !W3 W1, W3 !W2, !W3 C1 C1 I C2 I C3 C1 I C3 W1,cl, !Stop, !@1, !@2, !@3 C1 I C2 W2,cl, !Stop, !@1, !@2, !@3 ck C2 C2 C3 W1, !W3 W1, W3 ck C1 I C3 !W2, cl, !Stop, !@1, !@2, !@3 !cl,!W1, W2 C2 I ck C1 !W2, Stop, !@1, @2, !@3 ck C3 C3 Tick Tick C1 !cl !W2,!W3 cl Tick C3 C3 C1 C1 !cl,W1,!W2 C2 C1 I ck !cl,W1, W2 C1 I C2 I ck Elevator Results Req t first (sec) t total (sec) MCQ first MCQ tot PQ first PQ tot RQ first RQ tot Req 1 0.34 1301.3 1.9 81.7 1574 729570 1.9 89.5 Req 2 0.49 1146 3.9 99.6 2350 238311 2.9 98.6 Req 3 0.94 525 1.6 21.7 6475 172861 5.7 70.4 Req 4 0.052 1458 1.0 90.3 15 450233 0.0 91 Req 5 77.48 2275 1.2 78.3 79769 368721 20.5 100.3 Req 6 90.6 1301 2.0 60.9 129384 422462 26.1 85.4 Conclusions • A promising approach … • Flexible general heuristic, • many models and requirement languages seem possible • Many SUT types might be testable • procedural, reactive, real-time etc. Open Questions • Benchmarking? • Scalability? (abstraction, infinite state?) • Efficiency? (model checking and learning?)