The Influence of Size and Coverage on Test Suite Effectiveness Akbar Siami Namin James H. Andrews Department of Computer Science Texas Tech University at Abilene Abilene, TX, USA akbar.namin@ttu.edu Department of Computer Science University of Western Ontario London, Ontario, Canada andrews@csd.uwo.ca International Symposium on Software Testing and Analysis, ISSTA 2009, Chicago, USA July 2009 Outline Motivation Related work Experimental procedure Data analysis Case studies Discussion Conclusion and research direction 2 Motivation Test Suite A Familiar Procedure: Coverage-based Test Adequacy 0/3 Test Suite Effectiveness 0/20 Program P Coverage Degree (Line Coverage) = Faults 3 Motivation Test Suite A Familiar Procedure: Coverage-based Test Adequacy TC-1 1/3 Test Suite Effectiveness 9/20 Program P Coverage Degree (Line Coverage) = Faults 4 Motivation Test Suite A Familiar Procedure: Coverage-based Test Adequacy TC-2 TC-1 2/3 Test Suite Effectiveness 13/20 Program P Coverage Degree (Line Coverage) = Faults 5 Motivation Test Suite A Familiar Procedure: Coverage-based Test Adequacy TC-3 TC-2 TC-1 2/3 Test Suite Effectiveness 16/20 Program P Coverage Degree (Line Coverage) = Faults 6 Motivation Test Suite A Familiar Procedure: Coverage-based Test Adequacy TC-4 TC-3 TC-2 TC-1 2/3 Test Suite Effectiveness 18/20 Program P Coverage Degree (Line Coverage) = Faults 7 Motivation Test Suite A Familiar Procedure: Coverage-based Test Adequacy TC-5 TC-4 TC-3 TC-2 TC-1 3/3 Test Suite Effectiveness 20/20 Program P Coverage Degree (Line Coverage) = Faults 8 Motivation Test Suite (Known) Variables Involved in Coverage-based Test Adequacy TC-5 TC-4 TC-3 TC-2 TC-1 3/3 Test Suite Effectiveness 20/20 Program P Coverage Degree (Line Coverage) = Faults 9 Motivation Test Suite Coverage Degree: The Influencing Variable TC-5 TC-4 TC-3 TC-2 TC-1 3/3 Test Suite Effectiveness 20/20 Program P Coverage Degree (Line Coverage) Influence? = Faults 10 Motivation Test Suite Is Coverage Degree the Only Influencing Variable? TC-5 TC-4 TC-3 Size TC-2 TC-1 3/3 20/20 Influence? Test Suite Effectiveness Program P Coverage Degree (Line Coverage) Influence? = Faults 11 Motivation Test Suite The Size of Test Suite has Increased from 1 to 5 The purpose of this study TC-5 TC-4 TC-3 Size TC-2 TC-1 3/3 20/20 Influence? Test Suite Effectiveness Program P Coverage Degree (Line Coverage) Influence? = Faults 12 Motivation The Research Question Is the effectiveness of a test suite because of: Its size? Its structural coverage degree? What are the impacts of size and coverage on a test suite? 13 Motivation Visualizations – Relationships Between Pairs of Variables(2D) 14 Motivation Visualizations - Relationship Among All Three Variables (3D) 15 Related work [Frankl and Weiss,1993; Frankl and Iakounenko, 1998] Test suites that achieve higher coverage tend to be more effective (fault detection power) Effectiveness is constant until high coverage levels are achieved at which point it increases rapidly [Andrews et al.; 2005, 2006] Mutants can act similarly to real faults Study confirmed the linkage between coverage and effectiveness (mutant detection power) Coverage-based test suites more effective than random-based test suites of the same size 16 Related work Cont’d [Rothermel et al., 2002] The reduced test suites while preserving coverage are more effective than those that were reduced to the same size randomly eliminating test cases In all above: The support is indirect because the increase in effectiveness might be a result of the method of construction 17 Experimental Procedure Goal and Approach Goal - Study relationships among coverage degree, size and effectiveness Approach Generate a set of test suites of various sizes Compute their coverage degree Compute their mutant detection power Apply appropriate statistical analysis to determine the relationship among: Independent variables “size” and “coverage degree” Dependent variable “mutant detection rate” 18 Experimental Procedure Test suite generation and coverage measurement For each program 100 random-based test suites of each size from 1 to 50 Variable “SIZE”, 0 < SIZE < 51 For each test suite: Measured the block, decision, C-use, and P-use coverage degrees using ATAC Variable “CovDeg” with four instances for each of four coverage criterion Conducting four similar analyses 19 Experimental Procedure Test suite effectiveness For each program Re-used mutants from an earlier study [Siami Namin et al., 2008] Using Proteum mutant generator Also, for each test suite Measured mutant detection rate “AM” Test suite effectiveness 20 Experimental Procedure Description of subject programs – The Siemens set Programs printtokens printtokens2 replace schedule schedule2 tcas totinfo #Lines of Code 343 355 513 296 263 137 281 #Test Cases #Mutants #Selected #Equivalent 4130 11741 4115 10266 5542 23847 2650 4130 2710 6552 1608 4935 1052 8767 1966 1963 1969 1964 1964 4935 1958 415 21 0 204 467 0 218 21 Data Analysis Proportions of feasible coverage for all criteria using ATAC Programs %Block %Decision %C-Uses %P-Uses printtokens 95(231/242) 94(102/108) 98(179/183) 93(146/157) printtokens2 99(242/244) 98(158/161) 99(132/134) 99(207/210) replace 97(277/287) 94(154/163) 94(367/389) 89(378/426) schedule 99(159/161) 95(52/55) 99(114/115) 96(75/78) schedule2 98(169/173) 94(83/88) 97(88/91) 94(83/88) Tcas 99(105/106) 90(45/50) 98(42/43) 91(31/34) totinfo 96(145/151) 81(70/86) 93(141/152) 77(98/128) 22 Data Analysis Statistical Techniques Applied Visualizations – Shown earlier in this talk ANCOVA Principal component analysis Correlation of coverage and effectiveness Regression models 23 Data Analysis ANCOVA Variables (factors) Continuous dependent variable: Mutant detection rate Continuous independent variable: Coverage degree Discrete independent variable: Size p-values < 0.001 for the two independent variables (factors) Both size and coverage degree strongly influence effectiveness Often an interaction between two variables 24 Data Analysis Principal Component Analysis 25 Data Analysis Correlation of Coverage and Effectiveness 26 Data Analysis Purposes of Generating Regression Models To determine whether: including COVDEG in the models improves goodness of their fits Transforming the data would affect the goodness of fits 27 Data Analysis Regression Models Generated and Examined AM | log(AM) ~ SIZE AM | log(AM) ~ log(SIZE) AM | log(AM) ~ COVDEG AM | log(AM) ~ log(COVDEG) AM | log(AM) ~ SIZE + COVDEG AM | log(AM) ~ log(SIZE) + COVDEG AM | log(AM) ~ SIZE + log(COVDEG) AM | log(AM) ~ log(SIZE) + log(COVDEG) 28 Data Analysis A Summary Comparison of the Regression Models AM | log(AM) ~ SIZE + COVDEG: Better than AM | log(AM) ~ SIZE AM | log(AM) ~ COVDEG AM | log(AM) ~ log(SIZE): Better than AM | log(AM) ~ SIZE Important indication: Information about SIZE or COVDEG alone does not yield as good a prediction of effectiveness as information both SIZE and COVDEG 29 Data Analysis The Best Regression Model AM | log(AM) ~ SIZE AM | log(AM) ~ log(SIZE) AM | log(AM) ~ COVDEG AM | log(AM) ~ log(COVDEG) AM | log(AM) ~ SIZE + COVDEG AM | log(AM) ~ log(SIZE) + COVDEG AM | log(AM) ~ SIZE + log(COVDEG) AM | log(AM) ~ log(SIZE) + log(COVDEG) 30 Data Analysis A summary of the linear models AM=B1.log(SIZE)+B2.CovDeg Programs Block Decision C-Uses P-Uses printtokens Adjusted R2 MSE 0.999 0.000 0.999 0.000 0.999 0.000 0.998 0.001 Printtokens2 Adjusted R2 MSE 0.998 0.000 0.998 0.000 0.998 0.001 0.998 0.000 replace Adjusted R2 MSE 0.971 0.012 0.971 0.011 0.970 0.012 0.972 0.011 schedule Adjusted R2 MSE 0.998 0.001 0.998 0.001 0.998 0.001 0.998 0.001 schedule2 Adjusted R2 MSE 0.997 0.002 0.996 0.002 0.997 0.002 0.997 0.002 tcas Adjusted R2 MSE 0.959 0.014 0.960 0.014 0.960 0.014 0.960 0.014 totinfo Adjusted R2 MSE 0.995 0.004 0.994 0.004 0.994 0.004 0.994 0.004 31 Data Analysis Predicted vs. actual AM for AM ~ B1.log(size)+B2.coverage 32 Data Analysis Predicted vs. actual AM for AM ~ B1.log(size)+B2.coverage 33 Data Analysis Predicted vs. actual AM for AM ~ B1.log(size)+B2.coverage 34 Data Analysis Predicted vs. actual AM for AM ~ B1.log(size)+B2.coverage 35 Case Studies Cross-checking With Other Programs gzip.c (SIR Repository) 5680 LOC; 214 test cases concordance.c Introduced for the first time as a subject program Originally developed by Ralph L. Meyer Jamie Andrews at UWO Organized the code into one single file 13 real faults identified 372 test cases designed (black-box testing) 1490 LOC 36 Case Studies Mutants Generation for the Subject Programs of Case Studies gzip.c Used Proteum (Delamaro et al.) to generate mutants 108 operators, 493402 mutants Used the sufficient set of operators identified by Siami Namin et al. 28 operators, 38621 mutants (7.8%) For feasibility selected 1% of the sufficient set 28 operators, 317 mutants concordance.c 867 non-equivalent mutants generated using the mutant generator used by Andrews et al. 37 Case Studies Procedures for the Subject Programs of Case Studies Similar procedure for generating test suites Coverage tool: gcov Line coverage Mutant detection rates also computed 38 Case Studies Goodness of fit of models measured by adjusted R2 Model of AM or AF Adj. R2 gzip.c AM Adj. R2 Concordance.c AM Adj. R2 Concordance.c AF size 0.4796 0.8254 0.9127 coverage 0.8103 0.9973 0.9010 size+coverage 0.8259 0.9986 0.9579 log(size) 0.5563 0.9663 0.9628 log(size)+coverage 0.9905 0.9988 0.9643 39 Case Studies Predicted vs. actual AM for AM ~ B1.log(size)+B2.coverage 40 Case Studies Predicted vs. actual AF for AM ~ B1.log(size)+B2.coverage 41 Discussion A Non-Linear Relationship among SIZE, COVDEG, and AM AM ~ B1.log(SIZE) + B2.COVDEG Explaining log(SIZE) part of the model: Harder to find faults 1. Adding a test case to a test suite improves the effectiveness if the added test case finds another faults 2. The detected faults by the added test case is likely to be revealed by the test suite 3. The added test case is unlikely to improve the effectiveness if the test suite is already big enough 42 Discussion A Non-Linear Relationship among SIZE, COVDEG, and AM AM ~ B1.log(SIZE) + B2.COVDEG Explaining COVDEG part of the model: Faults associated with particular elements in the code A test case exercising some elements associated with the faults is more likely to force a failure than one that does not Regardless of the size of a test suite, a fault is more likely to be exposed by a test case if it covers new elements 43 Discussion Implications for Software Testers 1. Achieving high coverage leads to higher effectiveness 2. Because of log(SIZE) +COVDEG Achieving higher coverage becomes more important than size as size grows 44 Conclusion & Research Directions Influence of SIZE and COVDEG on Effectiveness of Test Suites Conclusion Both SIZE and COVDEG independently influence the effectiveness The relationship is not linear AM ~ B1.log(SIZE) + B2.COVDEG concordance.c as a new subject program Future work More experimental studies are needed To validate the results Validate generated models 45 Thank You International Symposium on Software Testing and Analysis, ISSTA 2009, Chicago, USA July 2009 46