Submitted for publication. Please do not distribute. Made available to support ICSE14 submission. Comparing Pairwise Testing in Software Product Lines: A Framework Assessment Roberto E. Lopez-Herrejon1 , Javier Ferrer2 , Evelyn Nicole Haslinger1 , Francisco Chicano2 , Alexander Egyed1 , and Enrique Alba2 1 Systems Engineering and Automation Johannes Kepler University, Linz,Austria {roberto.lopez, evelyn.haslinger, alexander.egyed}@jku.at 2 University of Malaga, Malaga, Spain {ferrer, chicano, eat}@lcc.uma.es Abstract. Software Product Lines (SPLs) are families of related software systems, which provide different feature combinations. In the past few years many SPL testing approaches have been proposed, among them, those that support pairwise testing. Recently a comparison framework for pairwise testing in SPLs has been proposed, however it has only been applied to two algorithms, none of them search based, and only on a limited number of case studies. In this paper we address these two limitations by comparing two search based approaches with a leading greedy alternative on 181 case studies taken from two different sources. We make a critical assessment of this framework based on an analysis of how useful it is to select the best algorithm among the existing approaches and an examination of the correlation we found between its metrics. Keywords: Feature Models, Genetic Algorithms, Simulated Annealing, Software Product Lines, Pairwise Testing 1 Introduction A Software Product Line (SPL) is a family of related software systems, which provide different feature combinations [1]. The effective management and realization of variability – the capacity of software artifacts to vary [2] – can lead to substantial benefits such as increased software reuse, faster product customization, and reduced time to market. Systems are being built, more and more frequently, as SPLs rather than individual products because of several technological and marketing trends. This fact has created an increasing need for testing approaches that are capable of coping with large numbers of feature combinations that characterize SPLs. Many testing alternatives have been put forward [3–6]. Salient among them are those that support pairwise testing [7–13]. With all these pairwise testing approaches available the question now is: how do they compare? To answer this question, Perrouin et al. have recently proposed a comparison framework [14]. However, this framework has only been applied to two algorithms, none of them search based, and only on a limited number of case studies. In this paper we address these two limitations by comparing two search based approaches with a leading greedy alternative on 181 case studies taken from two different sources. Our interest (and our contribution) is in answering these two questions: – RQ1: Is there any correlation between the framework’s metrics?. – RQ2: Is this framework useful to select the best one among the testing algorithms? Based on the obtained answers, we make a critical assessment of the framework and suggest open venues for further research. The organization of the paper is as follows. In Section 2 we provide some background on the Feature Models used in Software Product Lines. Section 3 describes Combinatorial Interaction Testing and how it is applied to test SPLs and Section 4 presents the algorithms used in our experimental section. In Section 5 the comparison framework of Perrouin et al. is presented and it is evaluated in Section 6. Some related work is highlighted in Section 7 and we conclude the paper and outline future work in Section 8. 2 Feature Models and Running Example Feature models have become a de facto standard for modelling the common and variable features of an SPL and their relationships collectively forming a tree-like structure. The nodes of the tree are the features which are depicted as labelled boxes, and the edges represent the relationships among them. Thus, a feature model denotes the set of feature combinations that the products of an SPL can have [15]. Figure 1 shows the feature model of our running example, the Graph Product Line (GPL), a standard SPL of basic graph algorithms that has been extensively used as a case study in the product line community [16]. A product has feature GPL (the root of the feature model) which contains its core functionality, a driver program (Driver) that sets up the graph examples (Benchmark) to which a combination of graph algorithms (Algorithms) are applied. The types of graphs (GraphType) can be either directed (Directed) or undirected (Undirected), and can optionally have weights (Weight). Two graph traversal algorithms (Search) are available: either Depth First Search (DFS) or Breadth First Search (BFS). A product must provide at least one of the following algorithms: numbering of nodes in the traversal order (Num), connected components (CC), strongly connected components (SCC), cycle checking (Cycle), shortest path (Shortest), minimum spanning trees with Prim’s algorithm (Prim) or Kruskal’s algorithm (Kruskal). In a feature model, each feature (except the root) has one parent feature and can have a set of child features. Notice here that a child feature can only be included in a feature combination of a valid product if its parent is included as well. The root feature is always included. There are four kinds of feature relationships: Fig. 1. Graph Product Line Feature Model – Mandatory features are depicted with a filled circle. A mandatory feature is selected whenever its respective parent feature is selected. For example, features Driver and GraphType. – Optional features are depicted with an empty circle. An optional feature may or may not be selected if its respective parent feature is selected. An example is feature Weight. – Exclusive-or relations are depicted as empty arcs crossing over a set of lines connecting a parent feature with its child features. They indicate that exactly one of the features in the exclusive-or group must be selected whenever the parent feature is selected. For example, if feature Search is selected, then either feature DFS or feature BFS must be selected. – Inclusive-or relations are depicted as filled arcs crossing over a set of lines connecting a parent feature with its child features. They indicate that at least one of the features in the inclusive-or group must be selected if the parent is selected. If for instance, feature Algorithms is selected then at least one of the features Num, CC, SCC, Cycle, Shortest, Prim, and Kruskal must be selected. Besides the parent-child relations, features can also relate across different branches of the feature model with the so called Cross-Tree Constraints (CTC). Figure 1 shows some of the CTCs of our feature model3 . For instance, Cycle requires DFS means that whenever feature Cycle is selected, feature DFS must also be selected. As another example, Prim excludes Kruskal means that both features cannot be selected at the same time on any product. These constraints as well as those implied by the hierarchical relations between features are usually expressed and checked using propositional logic, for further details refer to [18]. Definition 1. Feature List (FL) is the list of features in a feature model. The FL for the GPL feature model is [GPL, Driver, Benchmark, GraphType, Directed, Undirected, Weight, Search, DFS, BFS, Algorithms, Num, CC, SCC, Cycle, Shortest, Prim, Kruskal]. 3 In total, the feature model has 13 CTCs for further details please refer to [16] and [17]. FS GPL Dri Gtp W Se Alg B D U DFS BFS N CC SCC Cyc Sh Prim Kru fs0 X X X X X X X X fs1 X X X X X X X XX X X fs2 X X X X X XX X X X fs3 X X X X X X XX X X X fs4 X X X X X X XX X X X X X fs5 X X X X X X X XX XX X X fs6 X X X X X X X X X XX X fs7 X X X X X X X XX XX X Driver (Dri), GraphType (Gtp), Weight (W), Search (Se), Algorithms (Alg), Benchmark (B), Directed (D), Undirected (U), Num (N), Cycle (Cyc), Shortest (Sh), Kruskal (Kr). Table 1. Sample Feature Sets of GPL Definition 2. A feature set, also called product in SPL, is a 2-tuple [sel,sel] where sel and sel are respectively the set of selected and not-selected features of a member product4 . Let FL be a feature list, thus sel, sel ⊆ FL, sel ∩ sel = ∅, and sel ∪ sel = FL. The terms p.sel and p.sel respectively refer to the set of selected and not-selected features of product p. Definition 3. A feature set fs is valid in feature model fm, i.e. valid(fs, fm) holds, iff fs does not contradict any of the constraints introduced by fm. We will denote with F S the set of valid feature sets for a feature model (we omit the feature model in the notation to alleviate the notation). Definition 4. A feature f is a core feature if it is selected in all the valid feature sets of a feature model fm, and is a variant feature if it is selected in some of the feature sets. For example, the feature set fs0=[{GPL, Driver, GraphType, Weight, Algorithms, Benchmark, Undirected, Prim}, {Search, Directed, DFS, BFS, Num, CC, SCC, Cycle, Shortest, Kruskal}] is valid. As another example, a feature set with features DFS and BFS would not be valid because it violates the constraint of the exclusive-or relation which establishes that these two features cannot appear selected together in the same feature set. In our running example, the feature model denotes 73 valid feature sets. Some of them are depicted in Table 1, where selected features are ticked (X) and unselected features are empty. For the sake of space, in this table we use as column labels shortened names for some features (e.g. Dri for feature Driver). Notice that features GPL, Driver, Benchmark, GraphType and Algorithms are core features and the remaining features are variant features. 4 Definition based on [18]. 3 Combinatorial Interaction Testing in SPLs Combinatorial Interaction Testing (CIT) is a testing approach that constructs samples to drive the systematic testing of software system configurations [19]. When applied to SPL testing, the idea is to select a representative subset of products where interaction errors are more likely to occur rather than testing the complete product family [19]. In this section we provide the basic terminology of CIT within the context of SPLs. Definition 5. A t-set ts is a 2-tuple [sel,sel] representing a partially configured product, defining the selection of t features of the feature list FL, i.e. ts.sel ∪ ts.sel ⊆ F L ∧ ts.sel ∩ ts.sel = ∅ ∧ |ts.sel ∪ ts.sel| = t. We say t-set ts is covered by feature set fs iff ts.sel ⊆ f s.sel ∧ ts.sel ⊆ f s.sel. Definition 6. A t-set ts is valid in a feature model fm if there exists a valid feature set fs that covers ts. The set of all valid t-sets for a feature model is denoted with T S (we also omit here the feature model in the notation for the sake of clarity). Definition 7. A t-wise covering array tCA for a feature model fm is a set of valid feature sets that covers all valid t-sets denoted by fm.5 We also use the term test suite in this paper to refer to a covering array. Let us illustrate these concepts for pairwise testing, meaning t=2. From the feature model in Figure 1, a valid 2-set is [{Driver},{Prim}]. It is valid because the selection of feature Driver and the non-selection of feature Prim do not violate any constraints. As another example, the 2-set [{Kruskal,DFS}, ∅] is valid because there is at least one feature set, for instance fs1 in Table 1, where both features are selected. The 2-set [∅, {SCC,CC}] is also valid because there are valid feature sets that do not have any of these features selected, for instance feature sets fs0, fs1, and fs3. Notice however that the 2-set [∅,{Directed, Undirected}] is not valid. This is because feature GraphType is present in all the feature sets (mandatory children of the root) so either Directed or Undirected must be selected. In total, our running example has 418 valid 2-sets. Based on Table 1, the three valid 2-sets just mentioned above are covered as follows. The 2-set [{Driver},{Prim}] is covered by feature sets fs1, fs2, fs3, fs4, fs6, and fs7. Similarly, the 2-set [{Kruskal,DFS}, ∅] is covered by feature set fs1, and [∅, {SCC,CC}] is covered by feature sets fs0, fs2, and fs3. 4 Algorithms Overview In this section we briefly sketch the three testing algorithms we analyse in this paper: CASA, PGS, and ICPL. 5 Definition based on [20]. 4.1 CASA Algorithm CASA is a simulated annealing algorithm that was designed to generate n-wise covering arrays for SPLs [9]. CASA relies on three nested search strategies. The outermost search performs one-sided narrowing, pruning the potential size of the test suite to be generated by only decreasing the upper bound. The mid-level search performs a binary search for the test suite size. The innermost search strategy is the actual simulated annealing procedure, which tries to find a pairwise test suite of size N for feature model FM. For more details on CASA please refer to [9]. 4.2 Prioritized Genetic Solver The Prioritized Genetic Solver (PGS) is an evolutionary approach proposed by Ferrer et al. [21] that constructs a test suite taking into account priorities during the generation. PGS is a constructive genetic algorithm that adds one new product to the partial solution in each iteration until all pairwise combinations are covered. In each iteration the algorithm tries to find the product that adds the most coverage to the partial solution. This paper extends and adapts PGS for SPL testing. PGS has been implemented using jMetal [22], a Java framework aimed at the development, experimentation, and study of metaheuristics for solving optimization problems. For further details on PGS, please refer to [21]. 4.3 ICPL ICPL is a greedy approach to generate n-wise test suites for SPLs, which has been introduced by Johansen et. al [20]6 . It is basically an adaptation of Chvátal’s algorithm to solve the set cover problem. First, the set T S of all valid t-sets that need to be covered is generated. Next, the first feature set (product) f s is produced by greedily selecting a subset of t-sets in T S that constitute a valid product in the input feature model and added to the (initially empty) test suite tCA. Henceforth all t-sets that are covered by product f s are removed from T S. ICPL then proceeds to generate products and adding them to the test suite tCA until T S is empty, i.e. all valid t-sets are covered by at least on product. To increase ICPLs performance Johansen et. al made several enhancements to the algorithm, they for instance parallelized the data independent processing steps. For further details on ICPL please refer to [20]. 5 Comparison Framework In this section we present the four metrics that constitute the framework proposed by Perrouin et al. for the comparison of pairwise testing approaches for SPLs [14]. We define them based on the terminology presented in Section 2 and Section 3. 6 ICPL stands for “ICPL Covering array generation algorithm for Product Lines”. For the following metric definitions, let tCA be a t-wise covering array of feature model fm. Metric 1. Test Suite Size is the number of feature sets selected in a covering array for a feature model, namely n = |tCA|. Metric 2. Performance is the time required for an algorithm to compute a test suite covering array. Metric 3. Test Suite Similarity. This metric is defined based on Jaccard’s similarity index and applied to variable features. Let F M be the set of all possible feature models, f s and gs be two feature sets in F S, and var : F S × F M → F L be an auxiliary function that returns from the selected features of a feature set those features that are variable according to a feature model. The similarity index of two feature sets is thus defined as follows: ( Sim(f s, gs, f m) = |var(f s,f m) ∩ var(gs,f m)| |var(f s,f m) ∪ var(gs,f m)| if var(f s, f m) ∪ var(gs, f m) 6= ∅ 0 otherwise (1) And the similarity value for the entire covering array is defined as: P T estSuiteSim(tCA, f m) = f si∈tCA P f sj∈tCA Sim(f si, f sj, f m) | tCA |2 (2) It should be noted here that the second case of the similarity index, when there are no variable features on both feature sets is not part of the original proposed comparison framework [14]. We added it because we found several feature sets formed with only core features in the feature models analysed in Section 6.3. Metric 4. Tuple Frequency. Let occurrence : T S × 2F S → N be an auxiliary function that counts the occurrence of a t-set in all the feature sets of a covering array of a single feature model. The metric is thus defined as: T upleF requency(ts, tCA) = occurence(ts, tCA) | tCA | (3) The first two metrics are the standard measurements used for comparison between different testing algorithms, not only within the SPL domain. To the best of our understanding, the intuition behind the Test Suite Similarity is that the more dissimilar the feature sets are, as the value approaches 0, the higher chances to detect any faulty behaviour when the corresponding t-wise tests are instrumented and performed. Along the same lines, the rationale behind tuple frequency is that by reducing this number, the higher the chances of reducing repeating the execution of t-wise tests. Let us provide some examples for the latter two metrics. Consider for instance, feature sets fs0, fs1, fs2 and fs7 from Table 1. The variable features in those feature sets are: var(f s0, gpl) = {U ndirected, W eight, P rim} var(f s1, gpl) = {U ndirected, W eight, Search, DF S, Connected, Kruskal} var(f s2, gpl) = {Directed, Search, DF S, N umber, Cycle} var(f s7, gpl) = {U ndirected, W eight, Search, DF S, Connected, N umber, Cycle} An example is the similarity value between feature sets fs0 and fs2, that is Sim(f s0, f s2, gpl) = 0/8 = 0.0. The value is zero because those two feature sets do not have any selected variable features in common. Now consider Sim(f s1, f s7, gpl) = 5/8 = 0.625 which yields a high value because those feature sets have the majority of their selected features in common. To illustrate the Tuple Frequency metric, let us assume that the set of feature sets in Table 1 is a 2-wise covering array of GPL denoted as tCAgpl 7 . For example, the 2-set ts0 = [{Driver},{Prim}] is covered by feature sets fs1, fs2, fs3, fs4, fs6, and fs7. Thus, its frequency is equal to occurrence(ts0, tCAgpl )/8 = 6/8 = 0.75. As another example, the 2-set ts1 = [{Kruskal,DFS}, ∅] is covered by feature set fs1. Thus its frequency is equal to occurrence(ts1, tCAgpl )/8 = 1/8 = 0.125. 6 Evaluation For the comparison of CASA, PGS and ICPL we created two groups of feature models based on their provenance. In this section, we describe the salient characteristics of both groups, their evaluation with the comparison framework of Section 5, and a thorough comparison analysis of the three algorithms. All the data and code used in our analysis is available on [17]. 6.1 Feature Model Selection Open Source Projects. The first group of feature models comes from existing SPLs whose feature models and their code implementation is readily available as open source projects. Some of them were reverse engineered from single systems and some were developed as SPLs from the outset. We selected the feature models for this group from the SPL Conqueror project [23]. The goal of this project is to provide reliable estimates of measurable non-functional properties such as performance, main memory consumption, and footprint. Table 2 summarizes the 13 feature models of this group. The rationale behind the creation of this group is that they actually represent real case studies, used both in academic and industrial environments. Feature Models Repository. The second group of selected feature models comes from the SPLOT repository [24]. This repository contains a large number 7 There are 24 2-wise pairs, out of the 418 pairs that GPL contains, which are no covered. SPL Name NF NP Description Apache 10 256 web server BerkeleyDBF 9 256 database BerkeleyDBM 9 3840 database BerkeleyDBP 27 1440 database Curl 14 1024 command line LinkedList 26 1440 data structures LLVM 12 1024 compiler tool NFs: Number of Features, NP: Number SPL Name PKJab Prevayler SensorNetwork Wget x264 ZipMe NF NP Description 12 72 messenger 6 32 persistence 27 16704 simulation 17 8192 file retrieval 17 2048 video encoding 8 64 compression library of Products, Desc: Description of SPL domain Table 2. Open SPL Projects Characteristics. of feature models that are not necessarily associated with open source projects and that has been made available to the product line community carrying out research on formal analysis and reasoning of feature models. From this source, we selected 168 feature models with: i) number of features ranging from 10 to 67, with a median of 16.5 and an interquartile range of 10, and ii) number of products between 14 and 73,728 with a median of 131.5 and an interquartile range of 535.25. The rationale behind the creation of this group was to assess the framework on a larger and more diverse set of feature models. Figure 2 depicts the feature models selected based on their number of features and products. Notice that the filled circles correspond to the open source projects. Number of Products 1,60E+04 1,60E+03 1,60E+02 1,60E+01 0 10 20 30 Number of Features 40 50 SPLOT 60 70 SPLConquerorExamples Fig. 2. Summary of analysed feature models. 6.2 Experimental Set Up The three algorithms, CASA, PGS and ICPL, are non-deterministic. For this reason we performed 30 independent runs for a meaningful statistical analysis. All the executions were run in a cluster of 16 machines with Intel Core2 Quad processors Q9400 (4 cores per processor) at 2.66 GHz and 4 GB memory running Ubuntu 12.04.1 LTS and managed by the HT Condor 7.8.4 cluster manager. Since we have 3 algorithms and 181 feature models the total number of independent runs is 3 · 181 · 30 = 16, 290. Once we obtained the resulting test suites we applied the metrics defined in Section 5 and we report summary statistics of these metrics. In order to check if the differences between the algorithms are statistically significant or just a matter of chance, we applied the Wilcoxon rank-sum [25] test and highlight in the tables the differences that are statistically significant. 6.3 Analysis The first thing we noticed is the inadequacy of the metric Tuple Frequency for our comparison purposes. Its definition, as shown in Equation (3), applies on a per tuple basis. In other words, given a test suite, we should compute this metric for each tuple and we should provide the histogram of the tuple frequency. This is what the authors of [14] do. This means we should show 16, 290 histograms, one per test suite we have. Obviously, this is not viable, so we thought in alternatives to summarize all this information. The first idea, also present in [14], is to compute the average of the tuple frequencies in a test suite taking into account all the tuples. But, unfortunately we found that this average says nothing about the test suite. We can prove using counting arguments that this average depends only on the number of features and the number of valid tuples of the model, so this average is the same for all the test suites associated to a feature model. The reader can find the formal proof in the paper’s additional resources [17]. In the following we omit any information related to the Tuple Frequency metric and defer to future work the issue of how to summarize this metric. We computed the remaining three metrics on all the independent runs of the three algorithms. In order to answer RQ1 we calculated the Spearman rank’s correlation coefficient for each pair of metrics. Table 3 shows the results obtained plus the correlation values with the number of products and the number of features of the feature models (first two columns and two rows of the table). Products Features Size Performance Similarity Products Features Size Performance Similarity 1 0.591 0.593 0.229 -0.127 0.591 1 0.545 0.283 -0.159 0.593 0.545 1 0.278 -0.371 0.229 0.283 0.278 1 0.168 -0.127 -0.159 -0.371 0.168 1 Table 3. Spearman’s correlation coefficients of all models and algorithms. We can observe a positive and relatively high correlation among the number of products, features and test suite size. This is somewhat expected because the number of valid products is expected to increase when more features are added to a feature model. In the same sense, more features usually implies more combinations of features that must be covered by the test suite and this usually means that more test cases must be added. Regarding performance, we expect the algorithms to take more time to generate the test suites for larger models. The positive correlation between the performance and the three previous size-related measures (products, features and size) supports this idea. However, the value is too low (less than 0.3) to clearly claim that larger models require more computation time. The correlation coefficient between the similarity metric and the rest of metrics is low and the sign changes, which means that there is no correlation. We would expect larger test suites to have more similar test suites (positive correlation with size), but this is not the case in general. In our opinion we can find the justification for this somehow unexpected result in the definition of the similarity metric. In Equation (1) the similarity between two products does not take into account the features that are not present in the two products, which we would argue is also a source of similarity. In other words, two products are similar not only due to the shared selected features they have, but also due to their shared unselected features. Thus, our conclusion regarding this metric is that it is not well-defined and the intuition on how it should behave does not work, as evidenced in Table 3. Let us now analyse the metrics results grouped by algorithms to address RQ2. In Table 4 we show the results for the models in the SPLOT repository and in Table 5 we show the results for the 13 open source models. In these two tables we highlight with dark gray the results that are better, with statistically significant difference, than the other two algorithms, and with a light gray the results that are better than only one of the other algorithms Algorithm CASA ICPL PGS Size Performance Similarity 9.31 3617.26 0.3581 11.06 304.11 0.3309 10.88 77245.21 0.3635 Table 4. Average of the metrics computed on the test suites generated by the three algorithms for the models in SPLOT. The results for the SPLOT models reveal that CASA is the best algorithm regarding the size of the test suite (with a statistically significant difference) and PGS is the second best algorithm. If we focus on computation time, ICPL is the clear winner followed by CASA. PGS is outperformed by CASA in test suite size and computation time. Regarding the similarity metric, ICPL is the algorithm providing more dissimilar products and CASA is the second one. This is somewhat counterintuitive. ICPL is the algorithm providing larger test suites in their solutions. We would expect that a solution with more products to have more similar products because the number of features is finite in the feature models. Thus, we would expect higher similarity between the products of larger test suites. The results of ICPL, however, contradict this idea. In our opinion the reason is, again, that the similarity measure is not well-defined because it is just considering the features that appear in any product but not the ones that don’t appear in both products even though the t-tuples with unselected features must all be covered as well. Model Algor. CASA Apache ICPL PGS CASA BerkeleyDBF ICPL PGS CASA BerkeleyDBM ICPL PGS CASA BerkeleyDBP ICPL PGS CASA SensorNetwork ICPL PGS CASA Prevayler ICPL PGS CASA LinkedList ICPL PGS Size Performance Similarity Model Algor. Size Performance Similarity 6 550.00 0.3614 CASA 8.00 933.33 0.3527 8.00 217.70 0.3044 Curl ICPL 12.00 276.37 0.2634 8.07 20139.17 0.3482 PGS 12.07 42065.53 0.3541 6.00 583.33 0.3660 CASA 6.00 583.33 0.3620 7.00 169.63 0.3232 LLVM ICPL 9.00 228.77 0.2320 8.13 20007.07 0.3485 PGS 8.80 26453.40 0.3622 30.00 7600.00 0.3456 CASA 6.00 616.67 0.3726 31.00 511.70 0.2684 PKJab ICPL 7.00 195.80 0.3439 30.87 188908.70 0.3731 PGS 7.80 24700.60 0.3675 9.80 3500.00 0.3904 CASA 9.00 733.33 0.3560 10.00 365.83 0.3576 Wget ICPL 12.00 343.33 0.2685 11.37 51854.07 0.3796 PGS 12.40 51234.40 0.3499 10.67 1650.00 0.3748 CASA 16.00 2966.67 0.3510 13.00 431.43 0.3630 x264 ICPL 17.00 343.20 0.2568 13.00 69371.53 0.3806 PGS 16.90 70243.93 0.3668 6.00 533.33 0.3518 CASA 6.00 533.33 0.3654 8.00 153.10 0.2677 ZipMe ICPL 7.00 160.90 0.3428 6.70 20138.13 0.3690 PGS 7.33 23574.37 0.3781 12.03 2533.33 0.4055 14.00 451.43 0.3996 14.87 75813.10 0.4081 Table 5. Average of the metrics computed on the test suites generated by the three algorithms for the 13 open source models. Let us finally analyse the results of the algorithms for the open source models. Regarding the test suite size we again observe that CASA obtains the best results (smaller test suites). The second best in this case is usually ICPL (with the only exception of Prevayler). Regarding the computation time, ICPL is the fastest algorithm followed by CASA in all the models. PGS is the slowest algorithm. If we observe the similarity metric the trend is also clear: ICPL obtains the lowest similarity measures in all the models. In summary, the results obtained for the open source models support the same conclusions obtained for the SPLOT models: CASA obtains the smallest test suites, ICPL is the fastest algorithm and it also obtains the most dissimilar products. 7 Related Work In the area of Search-Based Software Engineering a major research focus has been software testing [26, 27]. A recent overview by McMinn highlights the major achievements made in the area and some of the open questions and challenges that remain [28]. There exists substantial literature on SPL testing [3–6]; however, except for a few efforts [9,13], the application of search based techniques to SPL remains largely unexplored. A related work on computing covering arrays with a simulated annealing approach is presented by Torres-Jimenez et al. [29]. They propose improvements of the state-of-the-art techniques for Constrained Combinatorial Interaction Testing (CCIT), an extension to CIT that also considers constraints on the inputs. In the search based literature, there exists a plethora of articles that compare testing algorithms using different metrics. For example, Mansour et al. [30], where the authors compare five algorithms for regression testing using eight different metrics (including quantitative and qualitative criteria). Similarly, Uyar et al. [31] compare different metrics implemented as fitness functions to solve the problem of test input generation in software testing. To the best of our knowledge, in the literature on test case generation there is no well-known comparison framework for the research and practitioner community to use. Researchers usually apply their methods to open source programs and compute some metrics directly such as the success rate, the number of test cases and performance. The closest to a common comparison framework, can be traced back to the work of Rothermel and Harrold [32] where they propose a framework for regression testing. 8 Conclusions and Future Work In this paper, we make an assessment of the comparison framework for pairwise testing of SPLs proposed by Perrouin et al. on two search based algorithms (CASA and PGS) and one greedy approach (ICPL). We used a total of 181 feature models from two different provenance sources. We identified two shortcomings of the two novel metrics of this framework: similarity does not consider features that are not selected in a product, and tuple frequency is applicable on a per tuple basis only. Overall, CASA obtains the smallest test suites, ICPL is the fastest algorithm and it also obtains the most dissimilar products. We plan to expand our work by considering other testing algorithms, larger feature models, but, most importantly, by researching other metrics that could provide more useful and discerning criteria for algorithm selection. For this latter goal we will follow the guidelines for metrics selection suggested in [33]. In addition, we want to study how to enhance the effectiveness of covering arrays by considering information from the associated source code, along the lines of Shi et al. [34]. Acknowledgments This research is partially funded by the Austrian Science Fund (FWF) project P21321-N15 and Lise Meitner Fellowship M1421-N15, the Spanish Ministry of Economy and Competitiveness and FEDER under contract TIN2011-28194 (roadME project). References 1. Pohl, K., Bockle, G., van der Linden, F.J.: Software Product Line Engineering: Foundations, Principles and Techniques. Springer (2005) 2. Svahnberg, M., van Gurp, J., Bosch, J.: A taxonomy of variability realization techniques. Softw., Pract. Exper. 35(8) (2005) 705–754 3. Engström, E., Runeson, P.: Software product line testing - a systematic mapping study. Information & Software Technology 53(1) (2011) 2–13 4. da Mota Silveira Neto, P.A., do Carmo Machado, I., McGregor, J.D., de Almeida, E.S., de Lemos Meira, S.R.: A systematic mapping study of software product lines testing. Information & Software Technology 53(5) (2011) 407–423 5. Lee, J., Kang, S., Lee, D.: A survey on software product line testing. In: 16th International Software Product Line Conference. (2012) 31–40 6. do Carmo Machado, I., McGregor, J.D., de Almeida, E.S.: Strategies for testing products in software product lines. ACM SIGSOFT Software Engineering Notes 37(6) (2012) 1–8 7. Perrouin, G., Sen, S., Klein, J., Baudry, B., Traon, Y.L.: Automated and scalable t-wise test case generation strategies for software product lines. In: ICST, IEEE Computer Society (2010) 459–468 8. Oster, S., Markert, F., Ritter, P.: Automated incremental pairwise testing of software product lines. In Bosch, J., Lee, J., eds.: SPLC. Volume 6287 of Lecture Notes in Computer Science., Springer (2010) 196–210 9. Garvin, B.J., Cohen, M.B., Dwyer, M.B.: Evaluating improvements to a metaheuristic search for constrained interaction testing. Empirical Software Engineering 16(1) (2011) 61–102 10. Hervieu, A., Baudry, B., Gotlieb, A.: Pacogen: Automatic generation of pairwise test configurations from feature models. In Dohi, T., Cukic, B., eds.: ISSRE, IEEE (2011) 120–129 11. Lochau, M., Oster, S., Goltz, U., Schürr, A.: Model-based pairwise testing for feature interaction coverage in software product line engineering. Software Quality Journal 20(3-4) (2012) 567–604 12. Cichos, H., Oster, S., Lochau, M., Schürr, A.: Model-based coverage-driven test suite generation for software product lines. In: Proceedings of Model Driven Engineering Languages and Systems. (2011) 425–439 13. Ensan, F., Bagheri, E., Gasevic, D.: Evolutionary search-based test generation for software product line feature models. In Ralyté, J., Franch, X., Brinkkemper, S., Wrycza, S., eds.: CAiSE. Volume 7328 of Lecture Notes in Computer Science., Springer (2012) 613–628 14. Perrouin, G., Oster, S., Sen, S., Klein, J., Baudry, B., Traon, Y.L.: Pairwise testing for software product lines: comparison of two approaches. Software Quality Journal 20(3-4) (2012) 605–643 15. Kang, K., Cohen, S., Hess, J., Novak, W., Peterson, A.: Feature-Oriented Domain Analysis (FODA) Feasibility Study. Technical Report CMU/SEI-90-TR-21, Software Engineering Institute, Carnegie Mellon University (1990) 16. Lopez-Herrejon, R.E., Batory, D.S.: A standard problem for evaluating productline methodologies. In Bosch, J., ed.: GCSE. Volume 2186 of Lecture Notes in Computer Science., Springer (2001) 10–24 17. Lopez-Herrejon, R.E., Ferrer, J., Haslinger, E.N., Chicano, F., Egyed, A., Alba, E.: Paper code and data repository (2013) http://neo.lcc.uma.es/staff/javi/resources.html. 18. Benavides, D., Segura, S., Cortés, A.R.: Automated analysis of feature models 20 years later: A literature review. Inf. Syst. 35(6) (2010) 615–636 19. Cohen, M.B., Dwyer, M.B., Shi, J.: Constructing interaction test suites for highlyconfigurable systems in the presence of constraints: A greedy approach. IEEE Trans. Software Eng. 34(5) (2008) 633–650 20. Johansen, M.F., Haugen, Ø., Fleurey, F.: An algorithm for generating t-wise covering arrays from large feature models. In: 16th International Software Product Line Conference. (2012) 46–55 21. Ferrer, J., Kruse, P.M., Chicano, J.F., Alba, E.: Evolutionary algorithm for prioritized pairwise test data generation. In Soule, T., Moore, J.H., eds.: GECCO, ACM (2012) 1213–1220 22. Durillo, J.J., Nebro, A.J.: jmetal: A java framework for multi-objective optimization. Advances in Engineering Software 42(10) (2011) 760 – 771 23. Siegmund, N., Rosenmüller, M., Kästner, C., Giarrusso, P.G., Apel, S., Kolesnikov, S.S.: Scalable prediction of non-functional properties in software product lines: Footprint and memory consumption. Information & Software Technology 55(3) (2013) 491–507 24. SPLOT: Software Product Line Online Tools (2013) http://www.splotresearch.org. 25. Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures. Chapman & Hall/CRC; 4 edition (2007) 26. Harman, M., Mansouri, S.A., Zhang, Y.: Search-based software engineering: Trends, techniques and applications. ACM Comput. Surv. 45(1) (2012) 11 27. de Freitas, F.G., de Souza, J.T.: Ten years of search based software engineering: A bibliometric analysis. In Cohen, M.B., Cinnéide, M.Ó., eds.: SSBSE. Volume 6956 of Lecture Notes in Computer Science., Springer (2011) 18–32 28. McMinn, P.: Search-based software testing: Past, present and future. In: ICST Workshops, IEEE Computer Society (2011) 153–163 29. Torres-Jimenez, J., Rodriguez-Tello, E.: New bounds for binary covering arrays using simulated annealing. Information Sciences 185 (2012) 137–152 30. Mansour, N., Bahsoon, R., Baradhi, G.: Empirical comparison of regression test selection algorithms. Journal of Systems and Software 57 (2001) 79–90 31. Uyar, H.T., Uyar, A.S., Harmanci, E.: Pairwise sequence comparison for fitness evaluation in evolutionary structural software testing. In: Genetic and Evolutionary Computation Conference. (2006) 1959–1960 32. Rothermel, G., Harrold, M.J.: A framework for evaluating regression test selection techniques. In: Proceedings of the 16th International Conference on Software Engineering. (1994) 201–210 33. Meneely, A., Smith, B.H., Williams, L.: Validating software metrics: A spectrum of philosophies. ACM Trans. Softw. Eng. Methodol. 21(4) (2012) 24 34. Shi, J., Cohen, M.B., Dwyer, M.B.: Integration testing of software product lines using compositional symbolic execution. In de Lara, J., Zisman, A., eds.: FASE. Volume 7212 of Lecture Notes in Computer Science., Springer (2012) 270–284 35. de Almeida, E.S., Schwanninger, C., Benavides, D., eds.: 16th International Software Product Line Conference, SPLC ’12, Salvador, Brazil - September 2-7, 2012, Volume 1. In de Almeida, E.S., Schwanninger, C., Benavides, D., eds.: SPLC (1), ACM (2012) 36. Whittle, J., Clark, T., Kühne, T., eds.: Model Driven Engineering Languages and Systems, 14th International Conference, MODELS 2011, Wellington, New Zealand, October 16-21, 2011. Proceedings. In Whittle, J., Clark, T., Kühne, T., eds.: MoDELS. Volume 6981 of Lecture Notes in Computer Science., Springer (2011)