Assessing the Influence of Multiple Test Case Selection on Mutation Experiments Marcio E. Delamaro and Jeff Offutt George Mason University & Universidade de São Paulo USA & Brazil www.cs.gmu.edu/~offutt/ offutt@gmu.edu delamaro@icmc.usp.br www.icmc.usp.br/pessoas/delamaro/ of 17 A Recent Experimental Procedure P Create mutants M T Add tests until MS = 100 Creating a “universe” of tests Mutation 2014 © Delamaro & Offutt 2 of 17 Experimental Procedure T Mop1 P Mop2 Mop75 Mutation 2014 “Only one test set? Not good enough!” Top1 Top2 Top75 © Delamaro & Offutt M M M 3 of 17 Additional Test Sets T Mop1 P Mop2 Mop75 Mutation 2014 Top1-1 TT op1-2-i op1 -N T op1 TT op2-1-i Top2 op2-N TT op75-1-i Top75 op75-N © Delamaro & Offutt M M M 4 of 17 Multiple Test Sets 1) How many test sets are needed? 2) Does additional test sets really help? Perceived Benefit Individual test sets may vary in effectiveness because of the specific values Generating N test sets may overcome that variance Does reality match perception? Mutation 2014 © Delamaro & Offutt 5 of 17 Answering the Question We decided to answer this question by measuring the performance of each of 10 sets of tests and studying their variances Mutation 2014 © Delamaro & Offutt 6 of 17 Experimental Setup • Subjects : 39 C programs – One to 20 functions ( 189 total ) – 7 to 390 LOC ( 2853 total ) • Mutation tool : Proteum – 104 to 11,100 mutants ( 66,480 total ) – We used mutation score as a proxy for effectiveness • Tests : Hand-constructed test sets to kill all non-equivalent mutants ( the test universe U ) – 5 to 142 tests ( 814 total ) – Equivalence determined by hand – 3 to 2062 equivalent mutants ( 7829 total ) Mutation 2014 © Delamaro & Offutt 7 of 17 Collecting Data For each program : 1. Generated statement deletion (SSDL) mutants 2. Created 10 sets of tests to kill all SSDL mutants All tests taken from the universe U Tests picked in random order from U 3. Measured size of each test set 4. Computed MS of each test set on all mutants 5. Collected statistics of distribution and central tendency for each test set mean, median, min, max, standard deviation Mutation 2014 © Delamaro & Offutt 8 of 17 Research Questions RQ1 : How different are different SSDLadequate test sets in terms of mutation score ? RQ2 : How different are different SSDLadequate test sets in terms of cost (number of tests) ? Mutation 2014 © Delamaro & Offutt 9 of 17 Biggest and Smallest Largest and smallest spreads in mutation scores of SSDL-adequate tests over all mutants Program P4 P19 LOC 9 9 SD .0904 .0901 MS: Max – Min .2971 .2448 P38 10 .0707 .1892 P22 P28 349 390 .0042 .0034 .0148 .0112 56 73.15 .0029 .0071 .0097 .0245 P31 Average Mutation 2014 © Delamaro & Offutt 10 of 17 Program Size vs. Spread • Is the spread correlated with the program size ? • Spearman rank correlation is used to compare two series of numbers for correlation – 1 or -1 means they are perfectly correlated – 0 means no correlation Strong correlations LOC and SD : -.65 LOC and Max-Min : -.63 Good news for experimentalists … Creating 10 test sets for a 10 line program is easy. Creating 10 test sets for a 1000 line program is impractical ! Mutation 2014 © Delamaro & Offutt 11 of 17 Average Spread Stat Average Minimum Average Maximum MS : Max-Min SD Values .9093 .9338 .0245 .0071 One-way ANOVA No statistical differences among means Mutation 2014 © Delamaro & Offutt 12 of 17 Threats to Validity • Representativeness of programs – Different sources, different domains • Size of programs – Most studies of this nature are related to unit testing – Large programs would be impractical • Manual steps – Constructing the universe of tests – Identifying equivalent mutants • A single comparison point—SSDL mutation – Other criteria could be used • We used 10 sets – Would results be different with 5 or 100? Mutation 2014 © Delamaro & Offutt 13 of 17 Conclusions Previous researchers assumed selecting only one adequate test set could interfere with results So created multiple test sets But this assumption was made without evidence !! Mutation 2014 © Delamaro & Offutt 14 of 17 Key Findings We found significant differences among different test sets For some programs, but not all Differences were less with larger programs Differences statistically disappeared when averaged over all 39 programs Mutation 2014 © Delamaro & Offutt 15 of 17 Recommendations If only a few, small subjects are used, use multiple test sets If many or larger subjects are used, don’t bother Mutation 2014 © Delamaro & Offutt 16 of 17 Contact Jeff Offutt offutt@gmu.edu http://cs.gmu.edu/~offutt/ Marcio Delamaro delamaro@icmc.usp.br http://www.icmc.usp.br/pessoas/delamaro/ Mutation 2014 © Delamaro & Offutt 17 of 17