Assessing the Influence of Multiple Test Case Selection on Mutation Experiments

advertisement
Assessing the Influence of
Multiple Test Case Selection on
Mutation Experiments
Marcio E. Delamaro and Jeff Offutt
George Mason University & Universidade de São Paulo
USA & Brazil
www.cs.gmu.edu/~offutt/
offutt@gmu.edu
delamaro@icmc.usp.br
www.icmc.usp.br/pessoas/delamaro/
of 17
A Recent Experimental
Procedure
P
Create mutants
M
T
Add tests until MS = 100
Creating a “universe” of tests
Mutation 2014
© Delamaro & Offutt
2 of 17
Experimental Procedure
T
Mop1
P
Mop2
Mop75
Mutation 2014
“Only one test set?
Not good enough!”
Top1
Top2
Top75
© Delamaro & Offutt
M
M
M
3 of 17
Additional Test Sets
T
Mop1
P
Mop2
Mop75
Mutation 2014
Top1-1
TT
op1-2-i
op1 -N
T
op1
TT
op2-1-i
Top2
op2-N
TT
op75-1-i
Top75
op75-N
© Delamaro & Offutt
M
M
M
4 of 17
Multiple Test Sets
1) How many test sets are needed?
2) Does additional test sets really help?
Perceived Benefit
Individual test sets may vary in
effectiveness because of the specific values
Generating N test sets may overcome that
variance
Does reality match perception?
Mutation 2014
© Delamaro & Offutt
5 of 17
Answering the Question
We decided to answer this question
by measuring the performance of
each of 10 sets of tests and studying
their variances
Mutation 2014
© Delamaro & Offutt
6 of 17
Experimental Setup
• Subjects : 39 C programs
– One to 20 functions ( 189 total )
– 7 to 390 LOC ( 2853 total )
• Mutation tool : Proteum
– 104 to 11,100 mutants ( 66,480 total )
– We used mutation score as a proxy for effectiveness
• Tests : Hand-constructed test sets to kill all non-equivalent
mutants ( the test universe U )
– 5 to 142 tests ( 814 total )
– Equivalence determined by hand
– 3 to 2062 equivalent mutants ( 7829 total )
Mutation 2014
© Delamaro & Offutt
7 of 17
Collecting Data
For each program :
1. Generated statement deletion (SSDL) mutants
2. Created 10 sets of tests to kill all SSDL mutants
All tests taken from the universe U
Tests picked in random order from U
3. Measured size of each test set
4. Computed MS of each test set on all mutants
5. Collected statistics of distribution and central
tendency for each test set
mean, median, min, max, standard deviation
Mutation 2014
© Delamaro & Offutt
8 of 17
Research Questions
RQ1 : How different are different SSDLadequate test sets in terms of mutation
score ?
RQ2 : How different are different SSDLadequate test sets in terms of cost
(number of tests) ?
Mutation 2014
© Delamaro & Offutt
9 of 17
Biggest and Smallest
Largest and smallest spreads in mutation scores
of SSDL-adequate tests over all mutants
Program
P4
P19
LOC
9
9
SD
.0904
.0901
MS: Max – Min
.2971
.2448
P38
10
.0707
.1892
P22
P28
349
390
.0042
.0034
.0148
.0112
56
73.15
.0029
.0071
.0097
.0245
P31
Average
Mutation 2014
© Delamaro & Offutt
10 of 17
Program Size vs. Spread
• Is the spread correlated with the program size ?
• Spearman rank correlation is used to compare two series
of numbers for correlation
– 1 or -1 means they are perfectly correlated
– 0 means no correlation
Strong correlations
LOC and SD : -.65
LOC and Max-Min : -.63
Good news for experimentalists …
Creating 10 test sets for a 10 line program is easy.
Creating 10 test sets for a 1000 line program is
impractical !
Mutation 2014
© Delamaro & Offutt
11 of 17
Average Spread
Stat
Average Minimum
Average Maximum
MS : Max-Min
SD
Values
.9093
.9338
.0245
.0071
One-way ANOVA
No statistical differences among means
Mutation 2014
© Delamaro & Offutt
12 of 17
Threats to Validity
• Representativeness of programs
– Different sources, different domains
• Size of programs
– Most studies of this nature are related to unit testing
– Large programs would be impractical
• Manual steps
– Constructing the universe of tests
– Identifying equivalent mutants
• A single comparison point—SSDL mutation
– Other criteria could be used
• We used 10 sets
– Would results be different with 5 or 100?
Mutation 2014
© Delamaro & Offutt
13 of 17
Conclusions
Previous researchers assumed selecting
only one adequate test set could interfere
with results
So created multiple test sets
But this assumption was made
without evidence !!
Mutation 2014
© Delamaro & Offutt
14 of 17
Key Findings
We found significant differences among
different test sets
For some programs, but not all
Differences were less with larger programs
Differences statistically disappeared
when averaged over all 39 programs
Mutation 2014
© Delamaro & Offutt
15 of 17
Recommendations
If only a few, small subjects are used,
use multiple test sets
If many or larger subjects are used,
don’t bother
Mutation 2014
© Delamaro & Offutt
16 of 17
Contact
Jeff Offutt
offutt@gmu.edu
http://cs.gmu.edu/~offutt/
Marcio Delamaro
delamaro@icmc.usp.br
http://www.icmc.usp.br/pessoas/delamaro/
Mutation 2014
© Delamaro & Offutt
17 of 17
Download