Slides - Texas Tech University Departments

advertisement
The Influence of Size and Coverage on
Test Suite Effectiveness
Akbar Siami Namin
James H. Andrews
Department of Computer Science
Texas Tech University at Abilene
Abilene, TX, USA
akbar.namin@ttu.edu
Department of Computer Science
University of Western Ontario
London, Ontario, Canada
andrews@csd.uwo.ca
International Symposium on Software Testing and Analysis, ISSTA 2009, Chicago, USA
July 2009
Outline

Motivation

Related work

Experimental procedure

Data analysis

Case studies

Discussion

Conclusion and research direction
2
Motivation
Test Suite
A Familiar Procedure: Coverage-based Test Adequacy
0/3
Test Suite Effectiveness
0/20
Program P
Coverage Degree
(Line Coverage)
= Faults
3
Motivation
Test Suite
A Familiar Procedure: Coverage-based Test Adequacy
TC-1
1/3
Test Suite Effectiveness
9/20
Program P
Coverage Degree
(Line Coverage)
= Faults
4
Motivation
Test Suite
A Familiar Procedure: Coverage-based Test Adequacy
TC-2
TC-1
2/3
Test Suite Effectiveness
13/20
Program P
Coverage Degree
(Line Coverage)
= Faults
5
Motivation
Test Suite
A Familiar Procedure: Coverage-based Test Adequacy
TC-3
TC-2
TC-1
2/3
Test Suite Effectiveness
16/20
Program P
Coverage Degree
(Line Coverage)
= Faults
6
Motivation
Test Suite
A Familiar Procedure: Coverage-based Test Adequacy
TC-4
TC-3
TC-2
TC-1
2/3
Test Suite Effectiveness
18/20
Program P
Coverage Degree
(Line Coverage)
= Faults
7
Motivation
Test Suite
A Familiar Procedure: Coverage-based Test Adequacy
TC-5
TC-4
TC-3
TC-2
TC-1
3/3
Test Suite Effectiveness
20/20
Program P
Coverage Degree
(Line Coverage)
= Faults
8
Motivation
Test Suite
(Known) Variables Involved in Coverage-based Test Adequacy
TC-5
TC-4
TC-3
TC-2
TC-1
3/3
Test Suite Effectiveness
20/20
Program P
Coverage Degree
(Line Coverage)
= Faults
9
Motivation
Test Suite
Coverage Degree: The Influencing Variable
TC-5
TC-4
TC-3
TC-2
TC-1
3/3
Test Suite Effectiveness
20/20
Program P
Coverage Degree
(Line Coverage)
Influence?
= Faults
10
Motivation
Test Suite
Is Coverage Degree the Only Influencing Variable?
TC-5
TC-4
TC-3
Size
TC-2
TC-1
3/3
20/20
Influence?
Test Suite Effectiveness
Program P
Coverage Degree
(Line Coverage)
Influence?
= Faults
11
Motivation
Test Suite
The Size of Test Suite has Increased from 1 to 5
The purpose of this study
TC-5
TC-4
TC-3
Size
TC-2
TC-1
3/3
20/20
Influence?
Test Suite Effectiveness
Program P
Coverage Degree
(Line Coverage)
Influence?
= Faults
12
Motivation
The Research Question


Is the effectiveness of a test suite because of:

Its size?

Its structural coverage degree?
What are the impacts of size and coverage on a test
suite?
13
Motivation
Visualizations – Relationships Between Pairs of Variables(2D)
14
Motivation
Visualizations - Relationship Among All Three Variables (3D)
15
Related work


[Frankl and Weiss,1993; Frankl and Iakounenko, 1998]

Test suites that achieve higher coverage tend to be
more effective (fault detection power)

Effectiveness is constant until high coverage levels
are achieved at which point it increases rapidly
[Andrews et al.; 2005, 2006]

Mutants can act similarly to real faults

Study confirmed the linkage between coverage and
effectiveness (mutant detection power)

Coverage-based test suites more effective than
random-based test suites of the same size
16
Related work
Cont’d

[Rothermel et al., 2002]


The reduced test suites while preserving coverage are
more effective than those that were reduced to the
same size randomly eliminating test cases
In all above:

The support is indirect because the increase in
effectiveness might be a result of the method of
construction
17
Experimental Procedure
Goal and Approach

Goal - Study relationships among coverage degree, size
and effectiveness

Approach


Generate a set of test suites of various sizes

Compute their coverage degree

Compute their mutant detection power
Apply appropriate statistical analysis to determine
the relationship among:

Independent variables “size” and “coverage
degree”

Dependent variable “mutant detection rate”
18
Experimental Procedure
Test suite generation and coverage measurement

For each program

100 random-based test suites of each size from 1 to 50


Variable “SIZE”, 0 < SIZE < 51
For each test suite:

Measured the block, decision, C-use, and P-use
coverage degrees using ATAC

Variable “CovDeg” with four instances for each
of four coverage criterion

Conducting four similar analyses
19
Experimental Procedure
Test suite effectiveness

For each program

Re-used mutants from an earlier study [Siami Namin et
al., 2008]


Using Proteum mutant generator
Also, for each test suite

Measured mutant detection rate “AM”

Test suite effectiveness
20
Experimental Procedure
Description of subject programs – The Siemens set
Programs
printtokens
printtokens2
replace
schedule
schedule2
tcas
totinfo
#Lines of
Code
343
355
513
296
263
137
281
#Test
Cases
#Mutants
#Selected
#Equivalent
4130 11741
4115 10266
5542 23847
2650 4130
2710 6552
1608 4935
1052 8767
1966
1963
1969
1964
1964
4935
1958
415
21
0
204
467
0
218
21
Data Analysis
Proportions of feasible coverage for all criteria using ATAC
Programs
%Block
%Decision
%C-Uses
%P-Uses
printtokens
95(231/242)
94(102/108)
98(179/183)
93(146/157)
printtokens2
99(242/244)
98(158/161)
99(132/134)
99(207/210)
replace
97(277/287)
94(154/163)
94(367/389)
89(378/426)
schedule
99(159/161)
95(52/55)
99(114/115)
96(75/78)
schedule2
98(169/173)
94(83/88)
97(88/91)
94(83/88)
Tcas
99(105/106)
90(45/50)
98(42/43)
91(31/34)
totinfo
96(145/151)
81(70/86)
93(141/152)
77(98/128)
22
Data Analysis
Statistical Techniques Applied

Visualizations – Shown earlier in this talk

ANCOVA

Principal component analysis

Correlation of coverage and effectiveness

Regression models
23
Data Analysis
ANCOVA


Variables (factors)

Continuous dependent variable: Mutant detection
rate

Continuous independent variable: Coverage degree

Discrete independent variable: Size
p-values

< 0.001 for the two independent variables (factors)


Both size and coverage degree strongly influence
effectiveness
Often an interaction between two variables
24
Data Analysis
Principal Component Analysis
25
Data Analysis
Correlation of Coverage and Effectiveness
26
Data Analysis
Purposes of Generating Regression Models

To determine whether:

including COVDEG in the models improves
goodness of their fits

Transforming the data would affect the goodness of
fits
27
Data Analysis
Regression Models Generated and Examined

AM | log(AM) ~ SIZE

AM | log(AM) ~ log(SIZE)

AM | log(AM) ~ COVDEG

AM | log(AM) ~ log(COVDEG)

AM | log(AM) ~ SIZE + COVDEG

AM | log(AM) ~ log(SIZE) + COVDEG

AM | log(AM) ~ SIZE + log(COVDEG)

AM | log(AM) ~ log(SIZE) + log(COVDEG)
28
Data Analysis
A Summary Comparison of the Regression Models
 AM | log(AM) ~ SIZE + COVDEG: Better than
AM | log(AM) ~ SIZE
AM | log(AM) ~ COVDEG
 AM | log(AM) ~ log(SIZE): Better than
AM | log(AM) ~ SIZE

Important indication:

Information about SIZE or COVDEG alone does not
yield as good a prediction of effectiveness as information
both SIZE and COVDEG
29
Data Analysis
The Best Regression Model

AM | log(AM) ~ SIZE

AM | log(AM) ~ log(SIZE)

AM | log(AM) ~ COVDEG

AM | log(AM) ~ log(COVDEG)

AM | log(AM) ~ SIZE + COVDEG
 AM | log(AM) ~ log(SIZE) + COVDEG

AM | log(AM) ~ SIZE + log(COVDEG)

AM | log(AM) ~ log(SIZE) + log(COVDEG)
30
Data Analysis
A summary of the linear models AM=B1.log(SIZE)+B2.CovDeg
Programs
Block
Decision
C-Uses
P-Uses
printtokens
Adjusted R2
MSE
0.999
0.000
0.999
0.000
0.999
0.000
0.998
0.001
Printtokens2
Adjusted R2
MSE
0.998
0.000
0.998
0.000
0.998
0.001
0.998
0.000
replace
Adjusted R2
MSE
0.971
0.012
0.971
0.011
0.970
0.012
0.972
0.011
schedule
Adjusted R2
MSE
0.998
0.001
0.998
0.001
0.998
0.001
0.998
0.001
schedule2
Adjusted R2
MSE
0.997
0.002
0.996
0.002
0.997
0.002
0.997
0.002
tcas
Adjusted R2
MSE
0.959
0.014
0.960
0.014
0.960
0.014
0.960
0.014
totinfo
Adjusted R2
MSE
0.995
0.004
0.994
0.004
0.994
0.004
0.994
0.004
31
Data Analysis
Predicted vs. actual AM for AM ~ B1.log(size)+B2.coverage
32
Data Analysis
Predicted vs. actual AM for AM ~ B1.log(size)+B2.coverage
33
Data Analysis
Predicted vs. actual AM for AM ~ B1.log(size)+B2.coverage
34
Data Analysis
Predicted vs. actual AM for AM ~ B1.log(size)+B2.coverage
35
Case Studies
Cross-checking With Other Programs

gzip.c (SIR Repository)


5680 LOC; 214 test cases
concordance.c

Introduced for the first time as a subject program

Originally developed by Ralph L. Meyer

Jamie Andrews at UWO

Organized the code into one single file

13 real faults identified

372 test cases designed (black-box testing)

1490 LOC
36
Case Studies
Mutants Generation for the Subject Programs of Case Studies

gzip.c

Used Proteum (Delamaro et al.) to generate mutants


108 operators, 493402 mutants
Used the sufficient set of operators identified by
Siami Namin et al.

28 operators, 38621 mutants (7.8%)

For feasibility selected 1% of the sufficient set


28 operators, 317 mutants
concordance.c

867 non-equivalent mutants generated using the
mutant generator used by Andrews et al.
37
Case Studies
Procedures for the Subject Programs of Case Studies

Similar procedure for generating test suites

Coverage tool: gcov


Line coverage
Mutant detection rates also computed
38
Case Studies
Goodness of fit of models measured by adjusted R2
Model of AM or AF
Adj. R2
gzip.c
AM
Adj. R2
Concordance.c
AM
Adj. R2
Concordance.c
AF
size
0.4796
0.8254
0.9127
coverage
0.8103
0.9973
0.9010
size+coverage
0.8259
0.9986
0.9579
log(size)
0.5563
0.9663
0.9628
log(size)+coverage
0.9905
0.9988
0.9643
39
Case Studies
Predicted vs. actual AM for AM ~ B1.log(size)+B2.coverage
40
Case Studies
Predicted vs. actual AF for AM ~ B1.log(size)+B2.coverage
41
Discussion
A Non-Linear Relationship among SIZE, COVDEG, and AM
AM ~ B1.log(SIZE) + B2.COVDEG

Explaining log(SIZE) part of the model:

Harder to find faults
1. Adding a test case to a test suite improves the
effectiveness if the added test case finds another
faults
2. The detected faults by the added test case is likely to
be revealed by the test suite
3. The added test case is unlikely to improve the
effectiveness if the test suite is already big enough
42
Discussion
A Non-Linear Relationship among SIZE, COVDEG, and AM
AM ~ B1.log(SIZE) + B2.COVDEG

Explaining COVDEG part of the model:

Faults associated with particular elements in the code


A test case exercising some elements associated with
the faults is more likely to force a failure than one
that does not
Regardless of the size of a test suite, a fault is more
likely to be exposed by a test case if it covers new
elements
43
Discussion
Implications for Software Testers
1. Achieving high coverage leads to higher effectiveness
2. Because of log(SIZE) +COVDEG

Achieving higher coverage becomes more important
than size as size grows
44
Conclusion & Research Directions
Influence of SIZE and COVDEG on Effectiveness of Test Suites

Conclusion

Both SIZE and COVDEG independently influence
the effectiveness

The relationship is not linear



AM ~ B1.log(SIZE) + B2.COVDEG
concordance.c as a new subject program
Future work

More experimental studies are needed

To validate the results

Validate generated models
45
Thank You
International Symposium on Software Testing and Analysis, ISSTA 2009, Chicago, USA
July 2009
46
Download