Leveraging Field Data for Impact Analysis and Regression Testing orso|term|harrold

advertisement
Leveraging Field Data for Impact Analysis and
Regression Testing
Alessandro Orso, Taweesup Apiwattanapong, and Mary Jean Harrold
College of Computing
Georgia Institute of Technology
{orso|term|harrold}@cc.gatech.edu
ABSTRACT
Software products are often released with missing functionality, errors, or incompatibilities that may result in failures,
inferior performances, or user dissatisfaction. In previous
work, we presented the Gamma approach, which facilitates
remote analysis and measurement of deployed software and
permits gathering of program-execution data from the field.
In this paper, we investigate the use of the Gamma approach
to support and improve two fundamental tasks performed by
software engineers during maintenance: impact analysis and
regression testing. We present a new approach that leverages field data to perform these two tasks. The approach
is efficient in that the kind of field data that we consider
require limited space and little instrumentation. We also
present a set of empirical studies that we performed, on a
real subject and on a real user population, to evaluate the
approach. The results of the studies show that the use of
field data is effective and, for the cases considered, can considerably affect the results of dynamic analyses.
Categories and Subject Descriptors
D.2.5 [Software Engineering]: Testing and Debugging—
Monitors, Testing tools, Tracing
General Terms
Algorithms, Experimentation, Reliability, Verification
Keywords
Software engineering, Gamma technology, impact analysis,
regression testing
1. INTRODUCTION
In recent years, we have witnessed a fundamental shift
in the way in which software is developed and deployed.
Decades ago, relatively few software systems existed, and
these systems were often custom-built and run on a limited number of mostly disconnected computers. After the
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ESEC/FSE’03, September 1–5, 2003, Helsinki, Finland.
Copyright 2003 ACM 1-58113-743-5/03/0009 ...$5.00.
PC revolution, the number of computers and software systems has increased dramatically. Moreover, the continuous
growth of the Internet has significantly increased the connectivity of computing devices and led to a situation in which
most of these systems and computers are interconnected.
Although these changes create new software development
challenges, such as shortened development cycles and increased frequency of software updates, they also represent
new opportunities that, if suitably exploited, may provide
solutions to both new and existing software quality and performance problems. For example, consider quality assurance tasks such as testing and analysis. With rare exceptions, these activities are currently performed in-house, on
developer platforms, using developer-provided inputs. The
underlying assumption is that the software is tested and analyzed in the same way that it will be used in the field.
Unfortunately, this assumption is rarely satisfied. Consequently, considerable resources can be wasted on exercising
configurations that do not occur in the field and code entities that are not exercised by actual users. Conversely,
in-house testing can miss exercising configurations and behaviors that actually occur in the field, thus lowering our
confidence in the testing.
Our previous research suggests that software development
can significantly benefit by augmenting analysis and measurement tasks performed in-house with analysis and measurement tasks performed on the software deployed in the
field. We call this the Gamma [11] approach. There are two
main advantages of the Gamma approach: (1) analyses rely
on actual field data, rather than on synthetic user data, and
(2) analyses leverage the vast and varied resources of a user
community, not just the limited, and often homogeneous,
resources found in a typical development environment.
There are many scenarios in which the Gamma approach
can be exploited, and numerous tasks that can benefit from
using the Gamma approach. In previous work, we investigated the use of the Gamma approach for collecting coverage information from deployed instances of the software
for use in creating user profiles, determining classes of users
of the software, and assessing the costs and identifying the
issues associated with collecting field data [1, 4]. We also
investigated the use of Gamma for visualization of programexecution data collected from the field for use in investigating aspects of the behavior of software after deployment [10].
In this paper, we investigate the use of the Gamma approach to support and improve the tasks of maintaining
evolving software. We address two fundamental tasks that
are routinely performed by software engineers in maintain-
ing evolving software: impact analysis and regression testing. To the best of our knowledge, all existing techniques
for performing these tasks (e.g., [2, 3, 9, 15, 17]) rely on
in-house data only.
We present a technique for performing these tasks that
uses field data collected from actual users of the deployed
software. To use the techniques in realistic settings, we must
constrain both the instrumentation required to collect the
field data and the amount of field data collected. Our techniques require only lightweight instrumentation and collect
data on the order of a few kilobytes per execution.
We also present a set of empirical studies that evaluate our
approach. For the studies, we used a real program and a real
setting: we instrumented a version of a program-analysis
tool developed within our research group, released it to a set
of users, and collected dynamic data while the users used the
tool for their work. Our studies show that, for this system,
field data can be effective in improving the quality of the
dynamic analyses we considered. The studies also show that,
for the same system, traditional techniques would compute
results that do not reflect the actual use of the system.
The main contributions of this paper are:
• a technique to perform impact analysis that leverages
data collected from the field;
• a technique to support regression testing that leverages
data collected from the field; and
• a set of empirical studies that show how those dynamic
analyses can benefit from the use of real field data.
2. SCENARIO
In this section, we sketch the scenario that we investigate.
We refer to the scenario throughout the paper to illustrate
our techniques.
D is the developer of a software product P , whose latest
version has been released to a number of users. During maintenance, D is considering modifying P . Before performing
the changes, D assesses the impact of such changes on P
and, indirectly, on the users. Based on the assessment, D
selects the changes to be applied to P and, after performing
the changes, obtains a new version of the software product,
P 0 . At this point, D retests P 0 by selecting a subset of the
test suite used to test P and by adding new test cases to test
the affected and added features. After testing is complete,
D releases P 0 to the users.
This scenario involves two main tasks on the developer’s
side: predictive impact analysis and regression testing. Because we want to use field data to support these tasks, we
impose two requirements on D.
The first requirement is that D provides a method for instrumenting the software. There are a number of ways in
which this can be done, including performing instrumentation before releasing the software, providing an instrumenter
on-site that performs the instrumentation on demand, or
dynamically updating the software while it is running [12].
For this work, we assume that developer D adds lightweight
instrumentation to program P before releasing it.
D can add instrumentation at different levels and to different extents based on the context in which P is used. For
example, beta testers may tolerate a considerably higher
overhead than users of the final product. In the rest of the
paper, we assume that D instruments P to gather informa-
tion about the coverage of either basic blocks1 (hereafter
referred to as blocks) or methods.
The second requirement is that, when users execute P ,
coverage information is collected and sent back to D. The
way information is collected may vary. For example, for a
platform that is always connected, the data may be sent
back as soon as they are produced, whereas for a mostly
disconnected platform, the data could be stored locally and
sent back when a connection is available. We assume that D
collects, over time, a set of execution data per user. The data
for each execution contain the coverage information related
to that execution. For example, if the execution information
collected is method coverage, the data for each execution
consist of the set of methods covered by that execution.
3.
IMPACT ANALYSIS
Software impact analysis is the task of estimating the
parts of the software that can be affected if a proposed software change is made. Impact-analysis information can be
used when planning changes, making changes, and tracking
the effects of changes (e.g., in the case of regression errors).
One traditional way to estimate the impact of one or more
changes is to use static information about the code (i.e.,
dependences) to identify the parts of the software that may
be affected by such changes. For example, inserting changes
in the code and performing forward slicing2 from the change
points yield a set of statements that may be affected by the
changes. For another example, transitive closure on the call
graph3 starting from the method(s) where the changes occur
results in an (unsafe) estimate of the set of methods affected
by the changes [3]. These techniques are generally imprecise
and tend to overestimate the effects of changes.
Recently, researchers have considered more precise ways
to assess the impact of changes, using information gathered
during execution. For example, Law and Rothermel defined
a technique for impact analysis [9] that uses whole path profiling [8] to estimate the effects of changes.
Although techniques based on such execution traces can
achieve better results than traditional impact-analysis approaches [9], they are constrained by the quality of the data
used: if these techniques are applied in-house, the use of
synthetic inputs may limit their effectiveness; if the set of
inputs is inadequate, so will be the results of the analysis.
However, using real data gathered from the field may not
be an option for these techniques: (1) the size of the execution traces generated during execution can easily approach
thousands of megabytes, and (2) algorithms that compress
the traces to a reasonable size can be computationally expensive and cannot be straightforwardly used on-line, while
the execution traces are produced.
To use field data, a technique must constrain both the
instrumentation required to collect the data and the data
collected for each execution. Our approach for impact analysis requires only lightweight instrumentation and collects
data on the order of a few kilobytes per execution.
1
A basic block is a sequence of statements in which control
enters at the first statement and exits at the last statement
without halt or possibility of branching except at the end.
2
Forward slicing determines, for a point p in program P
and a set of variables V , those statements in P that may be
affected by the values of variables in V at p [19].
3
A call graph is a directed graph in which nodes represent
functions (or methods) and an edge between two nodes A
and B means that A may call B.
algorithm ChangeImpact
input: Program P ,
Set of execution data E; e ∈ E is a set of entities,
Changed entity c
output: Set IE of entities impacted by c
declare: Set of affected executions AEx, initially empty,
Set of entities SL
algorithm ImpactAnalysis
input: Program P ,
Set of execution data E; e ∈ E is a set of entities,
Change C; c ∈ C is an entity-level change
output: IEC, the impact set for C, initially empty
use: ChangeImpact(P, E, c) returns the affected entities set IE
for program P , executions E, and change c.
begin ChangeImpact
/* identify users’ executions that go through c */
1: for each execution-data set e ∈ E do
2: if e ∩ {c} 6= ∅ then
3:
AEx = AEx ∪ e
4: end if
5: end for
/* identify entities in users’ executions affected by c */
6: SL = forward slice of P starting at c with variables in c
7: IE = SL ∩ AEx
8: return IE
end ChangeImpact
begin ImpactAnalysis
1: for each change c ∈ C do
2: IEC = IEC ∪ ChangeImpact(P, E, c)
3: end for
4: return IEC
end ImpactAnalysis
Figure 1: Algorithm for performing impact analysis
for an entity-level change.
3.1 Impact Analysis Using Field Data
According to the scenario described in Section 2, in considering a set of candidate changes for program P , developer D analyzes the expected impact of each change on P
to make an informed decision about which changes should
be implemented and which changes should be postponed.
In illustrating the technique, we use the term entities to
indicate either blocks or methods, depending on the level
of instrumentation and detail chosen by developer D. We
use C to represent a change to be performed on program
P . C consists of one or more entity-level changes c, which
are changes to single entities in P . In the general case,
developer D will have a set of candidate changes C1 , C2 ,
. . ., Cn to perform on the system, each one consisting of one
or more entity-level changes. We use E to represent a set
of execution data e, where each e is expressed as the set of
entities traversed by an execution.
Given a change C, our technique for impact analysis identifies an impact set—a set of entities potentially impacted
by change C. To this end, our technique computes, for each
entity-level change c in C, an approximate dynamic slice
based on the execution data for executions that traverse c.
The impact set is the union of the slices computed for each
c in C.
The algorithms to compute the impact set, ChangeImpact
and ImpactAnalysis, are shown in Figures 1 and 2, respectively. ChangeImpact performs change-impact analysis for
an entity-level change c, whereas ImpactAnalysis uses the
results of ChangeImpact to compute the impact set for a set
of entity-level changes C.
Algorithm ChangeImpact takes three inputs: (1) a program P , (2) a set of execution data for P , E, and (3) a
changed entity c. The algorithm consists of two main steps.
In the first step (lines 1–5), the algorithm identifies those
executions that are affected by c. To this end, the algorithm
identifies the set of users’ executions that traverse c and
stores them in AEx, the set of affected executions.
In the second step (lines 6–8), ChangeImpact identifies
the entities impacted by c. First, the algorithm computes a
forward static slice SL using c and the variables in c as the
slicing criterion. Then, the algorithm computes IE as the
intersection of SL and AEx, and returns it. IE is the set
of entities impacted by c.
Algorithm ImpactAnalysis uses ChangeImpact and computes the impact set for change C, which may consist of mul-
Figure 2: Algorithm for performing impact analysis
for a change C.
tiple entity-level changes. This algorithm also takes three
inputs: (1) a program P , (2) a set of execution data for P ,
E, and (3) a change C.
The algorithm consists of a loop (lines 1–3) that simply
invokes ChangeImpact for each change c in C and computes
the union of the resulting IE sets. After all c’s are processed,
the resulting set, IEC, is returned (line 4). IEC provides
developer D with an estimate of the parts of P that would
be impacted by C, according to the way P is actually used
in the field.
3.2
User-Sensitive Impact Analysis
A useful byproduct of using field data for impact analysis
is that it lets developer D assess the impact on the users
of a given change or set of changes. By knowing how users
use the software, D can estimate how and to what extent
various changes can affect different users.4
To the best of our knowledge, this is a new kind of impact analysis that provides the developer with another piece
of information to further support the decision about which
changes to integrate into the system.
There are a number of ways in which we can use field data
to compute the impact on the user population. We present
three such computations.
Collective Percentage (CP) is the percentage of users’
executions affected by C. CP is the ratio of the number of
executions in E that traverse at least one change c in C to
the total number of executions E.
CP =
ke ∈ E traversing at least one c ∈ Ck
kEk
(1)
For example, if C consists of two entity-level changes, and
the execution data collected show that 1,500 of the 10,000
executions exercise at least one of the two changes, CP is
15%. This measure can give developer D a general idea of
the overall impact of the changes on the user population,
based on the way the program is actually being used.
Percentage Per User (PPUi ) is the percentage of users’
executions affected by C per user. To compute P P U i , we
use Equation 1 applied to the execution data for user Ui ,
rather than to all executions in E. For each Ui , the result is
the percentage of executions of Ui , Ei , that may be affected
by the changes.
PPUi =
ke ∈ Ei traversing at least one c ∈ Ck
kEi k
(2)
4
The term user refers to a role instead of an actual entity.
For a large user population, we may need to aggregate the
field information at different levels (e.g., per deployment site
or per company).
P P U i provides finer-grained information about the effects
of changes than CP . Because CP is cumulative over all
executions, it can underestimate or overestimate the impact
of a change on particular users. For example, a value of 50%
for CP for a change C may underestimate the impact of C
on some users—those for which most executions traverse C.
In such cases, finer-grained information provided by P P U i
may lead to a more informed decision on whether or not to
implement C.
Percentage of Affected Users (PAU) is the percentage
of all users U affected by C. This measure is obtained by
considering as affected those users for which the percentage
of affected executions is greater than zero.
PAU =
kUi : P P Ui > 0k
kU k
(3)
This third measure lets developer D reason about the possible effects of changes in terms of the number of users of
deployed instances of the software.
Developer D can use all three kinds of information to make
decisions during maintenance. For example, D could decide
to postpone some change(s) because the impact on the users
would be significant. For another example, D may decide to
release the new version of P only to a subset of the users—
the ones least affected by the changes.
4. REGRESSION TESTING
As software evolves during development and maintenance,
regression testing is applied to modified software to provide
confidence that the changed parts behave as intended and
that the unchanged parts have not been adversely affected
by the modifications.
For P 0 , a modified version of program P , let C be the
set of entity-level changes between P and P 0 , T be the test
suite used to test P and T 0 the test suite to regression test
P 0 . When developing T 0 , three problems arise: (1) which
test cases in T should be used to test P 0 (regression test
selection); (2) which new test cases must be developed (test
suite augmentation); and (3) which order should be used to
run the test cases (test-suite prioritization).
There are a number of ways to obtain T 0 that differ in the
kind of analysis performed and in precision and efficiency
(e.g., [5, 6, 14, 15, 16, 18, 20]). Our technique leverages
field data to support regression testing. We use the results
of impact analysis to help selecting, augmenting, and prioritizing T 0 .
First, based on coverage information for P , our technique
selects an initial set T 0 that consists of all test cases in T
that traverse at least one change in C.
Then, we use the set of affected entities identified by impact analysis to assess whether, according to the field data,
T 0 is adequate for P 0 . This step is based on the intuition
that, in the field, executions that traverse changed parts of
the code are more likely to traverse affected entities than
other entities in the program. Therefore, we consider T 0
adequate for P 0 if, for each entity-level change c and each
affected entity ae in the IE set for c, there exists at least
one test case in T 0 that traverses ae after traversing c. We
call the affected entities for which no such test case exists
critical entities. If T 0 is not adequate for P 0 , T 0 will need to
be augmented with additional test cases to cover the critical
entities.
algorithm RegressionTesting
input: Program P ,
Set of test cases T for P ,
Set of execution data E; e ∈ E is a set of entities,
Change C; c ∈ C is an entity-level change
output: Set T 0 of test cases in T that traverse at least
one critical entity, initially empty,
Sets of critical entities CE[], one for each entitylevel change
use: cov(t) returns the entities in P covered by test case t,
ChangeImpact(P, E, c) returns the impact set IE
for program P , executions E, and change c.
declare: Impact set IE,
Set of entities exercised by a test case EXIE,
begin RegressionTesting
1: for each change c ∈ C do
2: IE = ChangeImpact(P, E, c)
3: CE[c] = IE
4: for each test cases t ∈ T do
/* If the test case does not traverse c, skip it */
5:
if cov(t) ∩ {c} == ∅ then
6:
continue
7:
end if
8:
T 0 = T 0 ∪ {t}
/* Identify affected entities covered by t */
9:
EXIE = cov(t) ∩ IE
10:
if EXIE 6= ∅ then
11:
CE[c] = CE[c] − EXIE
12:
end if
13: end for
14: end for
15: return T 0 , CE[]
Figure 3: Algorithm for identifying the initial T 0 and
the critical entities.
Additionally, we can use the information on the affected
entities to prioritize the test cases in T 0 , by giving a higher
priority to test cases that cover a higher number of affected
entities. By doing so, we prioritize according to the way
in which we expect the program to be used in the field.
This way of prioritizing can be combined with other existing prioritization techniques (e.g., [16, 18]). For example,
the number of affected entities covered can be an additional
parameter of the prioritization.
Figure 3 shows algorithm RegressionTesting, which selects the initial set T 0 and computes the set of critical entities. The algorithm takes four inputs: (1) a program P , (2)
a set of test cases for P , T , (3) a set of execution data for
P , E, and (4) a change C.
For each entity-level change c in C (loop 1–14), the algorithm first identifies the entities impacted by c, IE, using
algorithm ChangeImpact (see Section 3.1) and initializes the
set of critical entities for change c, CE[c], to IE.
Then, for each test case t in T (loop 4–13), the algorithm
checks whether t traverses change c and, if not, continues
to the next test case (lines 5–7). If t does traverse c, the
algorithm inserts t in T 0 (line 8) and inserts the affected
entities exercised by t in set EXIE (line 9). If EXIE is not
empty, the algorithm removes the entities in EXIE from
the set of critical entities for c (lines 10 and 11).
When all entity-level changes have been processed, T 0 contains all test cases that traverse at least one change, and
CE[] contains one set of critical entities for each change c in
C. At this point, RegressionTesting returns T 0 and CE[]
(line 15).
As stated above, each critical entity is an additional test
requirement that should be satisfied before releasing P 0 —the
requirement can be expressed as the coverage of that entity
by a test case that also traverses c. Developer D can thus
use the sets of critical entities to augment T 0 . D can also
use the information about the number of affected entities
covered by the test cases in T 0 to prioritize the test cases.
Our technique is unsafe for two reasons. First, it does not
consider the sequence in which entities are executed, thus
missing the distinction between executions that traverse an
entity before and after traversing a change c. Second, it uses
information about coverage on version P to approximate
information about P 0 , although executions with the same
inputs may cover different entities in P and P 0 . However,
we expect such approximations to lower the overhead of the
technique, so making it practical and still effective.
5. EMPIRICAL EVALUATION
To validate the techniques that we presented and to assess
the usefulness of using field data for impact analysis and
regression testing, we performed a set of empirical studies.
5.1 Experimental Infrastructure
For the studies, we used a real system: Java Architecture
for Bytecode Analysis (Jaba),5 which is a framework for analyzing Java programs developed in Java within our research
group; Jaba consists of 550 classes, 2,800 methods, and approximately 60KLOC. Jaba provides components that read
bytecode from Java class files and perform analyses such as
control flow and data flow.
We instrumented Jaba for different kinds of coverage and
released it to a set of users who agreed to have information
collected during execution.
We distributed the first release of the instrumented Jaba
to nine users, who used it for two months. This first release
helped us tune the approach in terms of instrumentation,
data collection, and interaction with the user’s platform [10].
Using the information that we obtained from this first release, we created a second instrumented version of Jaba,
and distributed it to 11 users. The studies reported in this
paper are based on the data collected using the second release of our tool. Five of the 11 users had already used
Jaba for their work (and were part of the first data collection experiment), whereas the other six users had just
started projects that involved the use of Jaba.
Seven of the eleven users involved in the studies are working in our lab: four are part of our group and use Jaba for
their research; another two are students working in our department who use Jaba for two graduate-level projects; the
last one is a Ph.D. student who is using a regression testing
tool built on top of Jaba. The remaining four users are
three researchers and a student working in three different
universities, one of which is abroad.
When the users run Jaba, dynamic information is collected and sent back to a server in our lab that continuously collects and stores this information. The information
includes profiling and coverage data at block and method
levels. The data for the different executions are stored in a
database and can be retrieved at various levels of granularity
and aggregation (e.g., per-user, per day, or per-execution).
To instrument and collect the data, we used our Gammatella tool [10]. When instrumenting, the tool also includes in the program the network-communication code that
is used to send data back to our central server. On the
server side, the tool performs both the data-collection and
the data-storage tasks.
5
http://www.cc.gatech.edu/aristotle/Tools/jaba.html
Using Gammatella, we gathered data for 18 weeks, during which we collected approximately 2,000 executions for
both versions of Jaba. The data set that we use for the
studies consists of the 1100 executions collected during the
ten weeks after the release of the second version of Jaba.
Although we collected coverage information at both block
and method levels, for the empirical studies we used only
method-level data. We made this decision for two main
reasons: (1) instrumentation to collect coverage information
at the method level has lower overhead than instrumentation
for block coverage and is more likely to be used in practice,
and (2) results at the method level are less precise than
results at the block level (i.e., good results at the method
level imply as good, if not better, results at the block level).
5.2
Study 1: Impact Analysis Using Field Data
The goal of the first study is twofold: (1) to assess whether
using field data, instead of synthetic data, can yield different
analysis results, and (2) to assess the effectiveness of our
technique over traditional approaches to impact analysis.
To achieve our goal, we performed an experiment in which
we compared the results of performing predictive impact
analysis using four techniques:
Dynamic impact analysis using field data. Developer
D applies our technique for predictive impact analysis (Section 3.1) and leverages field data; the field data used are
described in Section 5.1. We refer to this technique as the
FIELD technique.
Dynamic impact analysis using in-house data. D uses
our technique for predictive impact analysis, except that
D uses in-house data (i.e., coverage data for the internal
regression test suite). As a regression test suite, we used the
test suite that has been developed for Jaba over the years.
We refer to this technique as the IN-HOUSE technique.
Transitive closure on the call graph. Given C, expressed in terms of changed methods, D estimates the impact set by computing a transitive closure on the call graph
starting from nodes that correspond to changed methods.
We refer to this technique as the CALL-GRAPH technique.
Static slicing. D estimates the impact set by performing
forward slicing from the change points. In the study, we approximate a forward slice by computing simple reachability
on the interprocedural control-flow graph. We refer to this
technique as the SLICING technique.
To compare these techniques, we use the size of the impact sets they compute; a similar approach was used by
Law and Rothermel [9]. Comparing the impact sets computed by techniques FIELD and IN-HOUSE lets us assess
the effect of using field data on the results of the analysis
in a controlled way: the only parameter that varies between
the two techniques is the set of execution data considered.
Comparing the impact sets computed by technique FIELD
and techniques CALL-GRAPH and SLICING lets us assess
the results of our technique compared to traditional impactanalysis approaches.
In the experiment, the impact analysis techniques considered are the independent variables, whereas the resulting impact sets computed by each technique are the dependent variables. A parameter of the experiment is the set of
changes C considered. We performed two instances of the
experiment, using different sets of changes.
Table 1: Summary data for the comparison of techniques FIELD (FL), IN-HOUSE (IH ), CALL-GRAPH
(CG), and SLICING (SL) performed considering one change per method at a time.
AVG
STD
MAX
5.2.1
FL
257.39
341.39
802
IH
330.75
378.93
806
CG
71.42
272.85
1763
SL
1974.91
1036.38
2531
Experiment with Single Changes
In the first experiment, we considered a set of 2,800 changes,
each consisting of one change per method.6 For each change,
we computed the impact sets for the program using the four
techniques and compared the results. Because of the size
of the results, we cannot represent them in tabular form.
Thus, we provide a summary of the results in Table 1.
In the table, we report a number of measures. FL, IH, CG,
and SL are the sizes of the impact sets computed by the
FIELD, IN-HOUSE, CALL-GRAPH, and SLICING techniques, respectively. FL/IH is the ratio of the size of the
impact set computed by technique FIELD to the size of the
impact set computed by technique IN-HOUSE; FL/CG and
FL/SL are defined analogously. FL-IH is the set difference
between the impact set computed by technique FIELD and
the impact set computed by technique IN-HOUSE (i.e., the
number of methods that are considered impacted by technique FIELD, but are not considered impacted by technique
IN-HOUSE). IH-FL is defined analogously.
For each measure, we report the following information:
AVG, the average value of the measure computed for each
of the 2,800 changes considered; STD, its standard deviation; and MAX, the maximum value assumed. For example,
row AVG for column FL-IH shows the average number of
methods that are in the impact set produced by technique
FIELD, but not in the impact set produced by technique
IN-HOUSE, computed over the 2,800 changes considered.
This experiment is significant because it considers a complete distribution of changes within a program. However,
the experiment is representative only of situations in which
all changes to the program consist of only one entity-level
change and are independent from one another.
In most cases, changes consist of multiple correlated entitylevel changes—a change in method m1 may require a change
in method m2 to be implemented, and so on. In those cases,
the developer must estimate the joint impact of multiple
entity-level changes at once. To address those cases, we
performed a second experiment.
5.2.2
Experiment with Real Changes
In this experiment, we address the case of multiple correlated entity-level changes that must be performed together.
To ensure the meaningfulness of the considered changes,
instead of randomly aggregating method-level changes, we
used a set of real changes for the subject program. To this
end, we extracted the last 21 versions of Jaba from our
CVS repository. For each (version, subsequent-version) pair
(vi ,vi+1 ) of Jaba, we identified the changes between the two
versions and, for each change, (1) mapped it to the method
m containing the change, and (2) added m to the set of
changes Ci . The resulting sets of 20 changes, C1 to C20 , are
the sets that we used for our experiment.
6
Because the experiment was conducted at the method level,
the location of the change within the method is irrelevant.
FL/IH
0.88
0.3
2.09
FL-IH
34.95
78.56
676
IH-FL
108.31
204
761
FL/CG
70.08
156.78
802
FL/SL
0.11
0.17
1
Table 2 shows the number of methods changed for each
of the 20 sets. As the table shows, the number of methods
changed ranges from a minimum of 2, for change sets C6
and C16 , to a maximum of 178, for change set C8 .
The results of computing the impact sets using the four
considered techniques are shown in Table 3. The measures
reported in each column of Table 3 are the same as those
reported in Table 1, except that the first column, C, shows
the set of changes considered. Moreover, because for this
experiment we consider only 20 (instead of 2,800) sets of
changes, in the table we show both the results per set of
changes and the summary results.
5.2.3
Discussion
We discuss the results of this first study by addressing its
two goals separately.
The first goal was to assess whether the use of field data,
rather than synthetic data, can yield different analysis results. To this end, we consider the differences between the
results of techniques FIELD and IN-HOUSE. The data presented in Tables 1 and 3 clearly show that the results of
the analysis are affected significantly by the kind of data
considered.
Consider, for example, the results in Table 3: for 18 of
the 20 changes (all but C5 and C16 ), a significant number of
methods (68–139) are in the impact set computed by FIELD
but not in the impact set computed by IN-HOUSE and viceversa (97–122). Similar results can be observed in Table 1.
Also in this case, techniques FIELD and IN-HOUSE compute on average fairly dissimilar impact sets.
Considering Table 3, note that, for all change sets considered, the sizes of the impact sets computed by techniques
FIELD and IN-HOUSE are almost identical—what differs
is the composition of those sets. For the results in Table 1,
the situation is not as extreme: the average value for FL/IH
is 0.88, and its standard deviation is 0.3.
Note also that the above results are not due to the fact
that the sets of entities covered by the in-house test suite
and by the field executions are mostly disjoint. In fact,
the internal test suite and the field executions both cover
approximately 65% of the code and their coverage sets have
an 85% overlap.
The second goal of the study was to assess the effectiveness of our technique compared to traditional approaches
to impact analysis. The results in the tables show that our
technique FIELD computes impact sets that are considerably and consistently smaller (11% and 30% on average, for
the two experiments) than technique SLICING. Because,
unlike technique SLICING, technique FIELD is unsafe, we
cannot generally claim that FIELD produces better results
than SLICING. However, in all cases in which the user population we considered is a good representative of the actual
user population, which should be case when using stable
field data, technique FIELD provides, by definition, accurate information on the actual impact of the changes.
Table 2: Number of methods changed in the sets of real versions considered.
C sets
# methods
C1
15
C2
3
C3
15
C4
6
C5
3
C6
2
C7
3
C8
178
C9
95
C10
12
C11
87
C12
28
C13
6
C14
61
C15
22
C16
2
C17
61
C18
5
C19
6
C20
89
Table 3: Results for the comparison of techniques FIELD (FL), IN-HOUSE (IH), CALL-GRAPH (CG), and
SLICING (SL)performed considering real changes.
C
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
C11
C12
C13
C14
C15
C16
C17
C18
C19
C20
AVG
STD
MAX
FL
776
778
778
780
617
750
791
806
822
789
737
802
805
797
773
0
790
753
763
819
736.3
174.11
822
IH
784
771
784
778
617
765
796
794
785
800
766
797
788
784
751
0
767
759
761
793
732
172.14
800
CG
1608
1688
1546
56
10
1546
7
1896
1751
1547
1833
1593
1575
1724
35
3
1625
1546
322
1561
1173.6
1896
729.68
SL
2519
2519
2519
2522
2521
2519
2519
2617
2551
2519
2528
2522
2519
2524
2525
2519
2523
2519
2520
2522
2527.3
21.73
2617
As far as the comparison with technique CALL-GRAPH
is concerned, the results in Table 3 show that, in five of
20 cases—for change sets C4 , C5 , C7 , C15 , C19 —technique
FIELD selected impact sets that are larger (in some cases
significantly larger) than the corresponding sets computed
by CALL-GRAPH. These results indicate that the impact
sets computed with the CALL-GRAPH technique can be
highly inaccurate and have little or nothing to do with the
behavior of the program. In the remaining 15 cases, technique FIELD selected impact sets of about half the size of
the corresponding sets computed by CALL-GRAPH. The
results in Table 1 are similar in nature and show an even
larger difference between the results for the two techniques.
Also in this case, we cannot draw a general conclusion that
technique FIELD computes more accurate results than technique CALL-GRAPH. Nevertheless, we can conclude that
technique CALL-GRAPH is not likely to provide a good estimate when we want to assess the effect of changes based
on the program usage.
Our findings that static-analysis based techniques, such as
SLICING and CALL-GRAPH, can produce overestimates
and, in the case of CALL-GRAPH, unsafe estimates, are
consistent with the results obtained by Law and Rothermel [9].
5.3 Study 2: User Sensitive Impact Analysis
The goal of the second study is to assess the degree of
variation among the different users in the way they use the
program. The presence of different profiles among the users
is an important indicator of the usefulness of using field data:
it is difficult (if at all possible) to recreate the variety in the
user population with synthetic data produced in house, and
such variety may affect considerably the result of dynamic
analysis.
FL/IH
0.99
1.01
0.99
1
1
0.98
0.99
1.02
1.05
0.99
0.96
1.01
1.02
1.02
1.03
1
1.03
0.99
1
1.03
1.01
0.02
1.05
FL-IH
96
116
110
112
0
86
97
126
139
111
68
113
120
122
127
0
130
98
99
131
100.05
37.22
139
IH-FL
104
109
116
110
0
101
102
114
102
122
97
108
103
109
105
0
107
104
97
105
95.75
32.46
122
FL/CG
0.4876
0.4568
0.5071
13.8929
61.7000
0.4948
113.7143
0.4188
0.4483
0.5171
0.4179
0.5003
0.5003
0.4548
21.4571
0.0000
0.4720
0.4909
2.3634
0.5080
10.99
27.37
113.71
FL/SL
0.31
0.31
0.31
0.31
0.24
0.30
0.31
0.31
0.32
0.31
0.29
0.32
0.32
0.32
0.30
0.00
0.31
0.30
0.30
0.32
0.29
0.07
0.32
To reach our goal, we conducted the following experiment: we computed the impact of each change set considered
for the first experiment of Study 1 (i.e., 2,800 independent
changes, one per method), on the user population, in terms
of CP, PPU, and PAU (Section 3.2). For space reasons, we
show the results only for CP and PAU.
Figures 4 and 5 contain scatter plots that show the distribution of CP and PAU, respectively, when different methods
are modified. The horizontal axes represent the identifier of
the changed method, ranging from 1 to 2,800. The vertical
axes represent the values of CP and PAU, respectively.
As the figures show, both CP and PAU vary dramatically
depending on the location of the change, and both measures
appear evenly distributed in the data space. This result
indicates a considerable variety in users’ behaviors, which
can make it difficult to create synthetic data to exercise the
program in the same way in which users exercise it.
To gather evidence of this difficulty, we performed an additional study: we measured the differences in the value of
CP computed using our in-house data (i.e., coverage data
for the internal regression test suite) and the field data. For
each of the 2,800 changes considered, we compared the value
of CP. The difference between CP computed using in-house
and field data was 15% on average, with a standard deviation of 10 and a maximum of 52%. These results also show
that in-house data may provide a poor approximation of
field data. For example, if we use the number of executions
affected by a change as an estimate of the impact of such a
change, we compute results that can differ significantly for
field and in-house data.
Note that, for the study, we considered changes at the
method level, and thus we computed an overestimate of the
actual impact. If the real change is in a part of the method
that is not in the main path within the method (i.e., the
7
6
80
Number of users affected
Percentage of executions affected
100
60
40
20
0
5
4
3
2
1
0
500
1000
1500
Method changed
2000
0
2500
Figure 4: Distribution of CP based on the location
of the changes.
0
500
1000
1500
Method changed
2000
2500
Figure 5: Distribution of PAU based on the location
of the changes.
Table 4: Number of critical methods for 20 real changes (the total number of methods is 2,800).
Change
Avg #CM
Change
Avg #CM
C1
361.43
C11
0.95
C2
48.33
C12
204.93
C3
223.13
C13
232.83
C4
272.5
C14
127.19
change is not dominated by the method’s entry), then the
number of affected executions, and possibly users, would in
general be lower than what we have found, and the distribution of the impact information even more varied.
Although we performed this study only to assess the variety in the users’ behavior, the results provided us with
some interesting insights. For example, we found that CP,
PPU, and PAU are complemental measures: in Jaba, our
experimental subject, there are several methods for which
a change in the method affects less than 15% of the executions (low CP), but affects 100% of the users (high PAU).
For another example, we realized that there are many cases
in which almost no users are affected at all by the changes.
5.4 Study 3: Regression Testing Using Field
Data
Our third study is divided in two parts with distinct goals.
The goal of the first part of the study is to assess whether
the use of field data instead of in-house data actually yields
different requirements for regression testing. To this end, we
computed the set of critical entities (methods, in this case)
for the 20 real changes considered in Study 1 and for our
internal regression test suite.
Table 4 shows the results of this study. In the table we report, for each change C, the average number of critical methods for that change (Avg #CM ). We compute Avg #CM by
averaging the number of critical methods in the CE[] sets
C, obtained using the algorithm presented in Section 4.
The table provides additional evidence of the importance
of using field data. For most changes, the number of critical
methods is fairly high and for some changes it is extremely
high. Consider, for example, changes C1, C5, C9, and C15,
for which the number of critical methods is greater than
300. For those changes, there are more than 300 methods
that may be executed in the field by executions that traverse
at least one change and that our regression test suite does
C5
331
C15
304.59
C6
145
C16
0
C7
72
C17
215.26
C8
19
C18
74.4
C9
301
C19
99.17
C10
84.58
C20
96.14
not adequately exercise. The data show that there are only a
few cases in which the in-house regression testing adequately
exercises the program with respect to its use in the field.
In most cases, the complexity of the program under test
makes it impractical to develop a generally adequate set of
tests. Because the number of critical entities is a measure
of the adequacy of the (regression) testing performed inhouse, the use of field data can help testing by providing
precise directives on where to improve existing test suites.
In fact, based on the collected field data, we are extending
our internal test suite to (re)test Jaba similarly to the way
it is actually used.
The goal of the second part of the study is to assess the
imprecision introduced by the use of coverage information
for P to estimate information on P 0 (see Section 4). To
achieve our goal, we studied how the coverage at the method
level changed among a set of versions of Jaba. We first
selected 11 versions of Jaba and ran all versions against a
set of 200 test cases, while collecting coverage information
at the method level. Then, for each (version, consecutiveversion) pair (vi , vi+1 ) of Jaba and for each test case, we
compared the coverage for the two versions; to compare the
coverage, we assumed that methods with the same fullyqualified name and signature corresponded to each other.
Finally, we averaged the number of methods covered in vi
but not in vi+1 over all test cases.
The differences in coverage between versions provide an
indicator of the imprecision that we introduce by estimating
P 0 ’s coverage using P ’s coverage.
Table 5 shows the results of this study. For each version vi
of the program, the table reports the number of corresponding methods changed from vi to vi+1 (#MC ), the number
of test cases affected by the changes (#TCA), the average
number of corresponding methods covered in vi but not in
vi+1 (Avg #CB ), and the average number of corresponding
methods covered in vi+1 but not in vi (Avg #CA).
Table 5: Results for the estimate of coverage across
10 different versions of a program.
Ver
v1
v2
v3
v4
v5
v6
v7
v8
v9
v10
#MC
13
64
11
4
4
12
23
6
1
2
#TCA
198
78
198
198
198
198
198
198
21
156
Avg #CB (%)
5 (0.31%)
0
0
0
1 (0.06%)
1 (0.06%)
0
0
0
5 (0.3%)
Avg #CA (%)
0
0
0
0
0
9 (0.54%)
1 (0.06%)
0
0
1 (0.06%)
As the results show, the differences in coverage are negligible, even for changes that involve a number of methods
and are traversed by most test cases. In the worst case,
from v6 to v7 , the imprecision introduced by using coverage information collected on the old version of the program
is 10 methods, which is 0.6% of the total number of methods covered. Therefore, at least for the case we considered,
estimating coverage information does not introduce any considerable imprecision.
5.5 Threats to Validity
Like any empirical validation, ours has limitations. Some
threats to the validity of the studies are described along with
the studies. In the following, we discuss the limitations that
apply to the overall experimental design and setting.
Some limitations involve external validity. First, we have
considered the application of our techniques to a single program and test suite. Second, we have considered only a
limited number of users and, thus, collected only a small set
of field data. Therefore, we cannot claim generality for our
results. However, our subject program is a real program,
the test suite that we used for the experiments is the actual
regression test suite for the program, the users involved in
the experiment are real users, and the changes considered
are real. Nevertheless, additional studies with other subjects
are needed to address such questions of external validity.
Other limitations involve internal and construct validity.
We have approximated static slicing with reachability, which
may produce imprecise results. This imprecision generally
results in computing larger impact sets than those that slicing would compute, therefore affecting the results of the
comparison of the four techniques in Study 1. We have also
assumed that there is a given degree of stability in the users’
behavior over time. When we have collected enough historical data, we will need to study whether and to what extent
this assumption holds.
In short, our results support an existence argument: cases
exist in which it is feasible to use field data and their use
can produce benefits in impact analysis. Therefore, these
results motivate us to perform further research, followed by
carefully controlled experimentation, to investigate whether
such results will generalize.
6. RELATED WORK
To the best of our knowledge, our work represents the
first attempt at using field data to directly support impact
analysis and regression testing. However, other researchers
have investigated the idea of performing quality-assurance
activities, such as analysis and testing, after deployment.
The Perpetual Testing project recognizes the need to develop “seamless, perpetual analysis and testing of software
through development, deployment and evolution,” and proposes Residual Testing [13]. Although related, Residual
Testing uses field data with a different goal: continuously
monitoring for fulfillment of test obligations that were not
satisfied in the development environment.
Another approach is Expectation-Driven Event Monitoring (EDEM). EDEM uses software agents deployed over the
Internet to collect application-usage data to help developers
improve the fit between application design and use [7]. This
approach addresses the problem of monitoring deployed software and collecting field data, but it focuses on the humancomputer-interaction aspects of the problem.
Other related work does not consider the use of field data,
but it performs similar kinds of analyses.
Srivastava and Thiagarajan [18] present a system, Echelon, for prioritizing the set of test cases to be rerun on
an application during regression testing. Their technique is
based on identifying changes and mappings between old and
new versions of a program and on coverage estimation.
Law and Rothermel [9] define a technique for impact analysis based on executing a program with a set of inputs, collecting compressed traces for those inputs, and using the
traces to predict impact sets. The technique can improve
the accuracy and the precision of existing techniques based
on static analysis, but there is no evidence that it can be
efficiently used in the field.
7.
CONCLUSION
In this paper, we have presented the results of our investigation of how field data can be leveraged to support
and improve maintenance tasks. In particular, we focused
on impact analysis and regression testing. We defined new
techniques to perform those two tasks based on field data
and we performed an empirical evaluation of our approach.
Although preliminary, the results of our empirical evaluation are significant because they were obtained on a real
subject distributed to real users. These results are promising in that they show the twofold importance of field data:
(1) real users are different from simulated users, and (2) real
users are different from one another. Both aspects are generally not captured by in-house data; the use of such data
can therefore hamper the effectiveness of dynamic analyses.
Importantly, the studies performed and their results fueled further research, by suggesting a number of directions
for future work.
First, we will perform studies on the stability of users’ behavior. One of the underlying assumptions of our approach
is that the users’ behavior is stable, and thus historical field
data provide useful information for the future. The data
that we have so far are too limited to perform a study and
assess whether our assumption holds. Therefore, we will
expand our user population by releasing Jaba to a larger
number of users. We will use the field data collected on a
larger number of users and for a larger period of time to
study the dynamics of the user population and its behavior.
Second, we will investigate the use of statistical analysis to
perform clustering on the field data and to study whether it
is possible to identify discriminating characteristics among
users. We are especially interested in the use of these techniques to perform anomaly detection of deployed software.
We will also investigate the use of clustering to identify representative executions or users. The ability to do so would
be extremely useful for testing and analysis tasks (e.g., it
may allow the selection of a small set of test cases that approximate well the users’ behavior).
Third, we will continue to investigate the use of field data
for impact analysis and regression testing. We will verify
whether the impact sets computed by our technique reflect
the actual impact of changes in the field. To this end, we
will apply our technique to estimate the impact of future
changes to Jaba and use the data gathered from the field
after the new releases to assess how good is our estimate.
We will also experiment with prioritizing test cases based on
information on critical entities and study how this approach
affect the fault-detection capabilities of regression testing.
Fourth, we will investigate which other execution-related
information may be leveraged for dynamic analysis tasks. In
particular, we will consider information that is not controlflow related, such as memory occupation or flow of data in
the program, which may provide important information on
how a program is used in the field.
Fifth, based on our findings when we start collecting data
from an increasing number of users, we will investigate scalability issues. For example, we might discover that collecting
complete information from all users is not a viable solution
when a high number of users are involved, and define sampling techniques. For another example, we may discover
that monitoring at the site level, rather than at the user
level, lets us perform more efficient data collection.
Finally, we will investigate efficient mechanisms to record
users’ actions and inputs to enable recreation, at least partially, of users’ execution in-house (e.g., to augment a regression test suite). This challenging problem involves both
static analysis, to identify parts of the code where inputs are
received and that are good candidates for instrumentation,
and dynamic analysis, to collect input data.
Acknowledgments
This work was supported in part by National Science Foundation awards CCR-9988294, CCR-0096321, CCR-0205422,
SBE-0123532 and EIA-0196145 to Georgia Tech, and by the
State of Georgia to Georgia Tech under the Yamacraw Mission. Anil Chawla helped with the development of Gammatella and its use for our experiments. Saurabh Sinha
and the anonymous reviewers provided valuable feedback
on a previous version of the paper.
8. REFERENCES
[1] T. Apiwattanapong and M. J. Harrold. Selective path
profiling. In Proc. of the ACM Workshop on Program
Analysis for Software Tools and Engineering, pages
35–42, Nov 2002.
[2] R. S. Arnold and S. A. Bohner. Impact analysis towards a framework for comparison. In Proc. of the
International Conf. on Software Maintenance, pages
292–301, Sep 1993.
[3] S. Bohner and R. Arnold. IEEE Software Change
Impact Analysis. Computer Society Press, Los
Alamitos, CA, USA, 1996.
[4] J. Bowring, A. Orso, and M. J. Harrold. Monitoring
deployed software using software tomography. In Proc.
of the ACM Workshop on Program Ana lysis for
Software Tools and Engineering, pages 2–8, Nov 2002.
[5] Y. F. Chen, D. S. Rosenblum, and K. P. Vo.
TestTube: A system for selective regression testing. In
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
Proc. of the 16th International Conf. on Software
Engineering, pages 211–222, May 1994.
M. J. Harrold, J. Jones, T. Li, D. Liang, A. Orso,
M. Pennings, S. Sinha, S. Spoon, and A. Gujarathi.
Regression test selection for Java software. In Proc. of
the ACM Conf. on Object-Oriented Programming,
Systems, Languages, and Applications, pages 312–326,
Oct 2001.
D. M. Hilbert and D. F. Redmiles. Extracting
usability information from user interface events. ACM
Computing Surveys, 32(4):384–421, Dec 2000.
J. Larus. Whole program paths. In Proc. the ACM
Conf. on Programming Language Design and
Implementation, pages 1–11, May 1999.
J. Law and G. Rothermel. Whole program path-based
dynamic impact analysis. In Proc. of the 25th
International Conf. on Software Engineering, pages
308–318, May 2003.
A. Orso, J. Jones, and M. J. Harrold. Visualization of
program-execution data for deployed software. In
Proc. of the ACM symposium on Software
Visualization, pages 67–76, Jun 2003.
A. Orso, D. Liang, M. J. Harrold, and R. Lipton.
Gamma system: Continuous evolution of software
after deployment. In Proc. of the International
Symposium on Software Testing and Analysis, pages
65–69, Jul 2002.
A. Orso, A. Rao, and M. J. Harrold. A technique for
dynamic updating of Java software. In Proc. of the
IEEE International Conf. on Software Maintenance,
pages 649–658, Oct 2002.
C. Pavlopoulou and M. Young. Residual test coverage
monitoring. In Proc. of the 21st International Conf.
on Software Engineering, pages 277–284, May 1999.
G. Rothermel and M. J. Harrold. Selecting tests and
identifying test coverage requirements for modified
software. In ACM International Symposium on
Software Testing and Analysis, pages 169–184, Aug
1994.
G. Rothermel and M. J. Harrold. A safe, efficient
regression test selection technique. ACM Transactions
on Software Engineering and Methodology,
6(2):173–210, Apr 1997.
G. Rothermel, R. H. Untch, C. Chu, and M. J.
Harrold. Prioritizing test cases for regression testing.
IEEE Transactions on Software Engineering,
27(10):929–948, Oct 2001.
B. G. Ryder and F. Tip. Change impact analysis for
object-oriented programs. In Proc. of the ACM
Workshop on Program Analysis for Software Tools and
Engineering, pages 46–53, Jun 2001.
A. Srivastava and J. Thiagarajan. Effectively
prioritizing tests in development environment. In
Proc. of the International Symposium on Software
Testing and Analysis, pages 97–106, Jul 2002.
F. Tip. A survey of program slicing techniques.
Journal of Programming Languages, 3:121–189, 1995.
F. Vokolos and P. Frankl. Pythia: A regression test
selection tool based on text differencing. In
International Conf. on Reliability, Quality, and Safety
of Software Intensive Systems, May 1997.
Download