Metamorphic Testing Techniques to Detect Defects in Applications without Test Oracles Christian Murphy

advertisement
Metamorphic Testing
Techniques to Detect Defects in
Applications without Test Oracles
Christian Murphy
Thesis Defense
April 12, 2010
Overview

Software testing is important!

Certain types of applications are particularly hard to
test because there is no “test oracle”
 Machine
Learning, Discrete Event Simulation,
Optimization, Scientific Computing, etc.

Even when there is no oracle, it is possible to detect
defects if properties of the software are violated

My research introduces and evaluates new
techniques for testing such “non-testable
programs” [Weyuker, Computer Journal’82]
2
Motivating Example: Machine Learning
3
Motivating Example: Simulation
Length of Stay versus Utilization
300
16
14
12
units of time
200
10
150
8
6
100
4
50
2
0
0
0
2
4
6
8
10
percent utilization
250
LOS
Doctor
Utilization
Nurse
Utilization
Triage
Utilization
Clerk
Utilization
12
number of beds
4
Problem Statement
Partial oracles may exist for a limited subset
of the input domain in applications such as
Machine Learning, Discrete Event Simulation,
Scientific Computing, Optimization, etc.
 Obvious errors (e.g., crashes) can be
detected with certain inputs or testing
techniques


However, it is difficult to detect subtle
computational defects in applications without
test oracles in the general case
5
What do I mean by “defect”?

Deviation of the implementation from the
specification
Violation of a sound property of the software

“Discrete localized” calculation errors

 Off-by-one
 Incorrect
sentinel values for loops
 Wrong comparison or mathematical operator

Misinterpretation of specification
 Parts
of input domain not handled
 Incorrect assumptions made about input
6
Observation

Many programs without oracles have
properties such that certain changes to the
input yield predictable changes to the output

We can detect defects in these programs by
looking for any violations of these
“metamorphic properties”

This is known as “metamorphic testing”
 [T.Y.
Chen et al., Info. & Soft. Tech vol.4, 2002]
7
Research Goals

Facilitate the way that metamorphic testing is
used in practice

Develop new testing techniques based on
metamorphic testing

Demonstrate the effectiveness of metamorphic
testing techniques
8
Hypotheses

For programs that do not have a test oracle, an
automated approach to metamorphic testing is
more effective at detecting defects than other
approaches

An approach that conducts function-level
metamorphic testing in the context of a running
application will further increase the effectiveness

It is feasible to continue this type of testing in the
deployment environment, with minimal impact on
the end user
9
Contributions
1.
A set of guidelines to help identify metamorphic properties
2.
New empirical studies comparing the effectiveness of
metamorphic testing to other approaches
3.
An approach for detecting defects in non-deterministic
applications called Heuristic Metamorphic Testing
4.
A new testing technique called Metamorphic Runtime
Checking based on function-level metamorphic properties
5.
A generalized technique for testing in the deployment
environment called In Vivo Testing
10
Outline

Background
 Related
Work
 Metamorphic Testing
Metamorphic Testing Empirical Studies
 Metamorphic Runtime Checking
 Future Work & Conclusion

11
Other Approaches [Baresi & Young, 2001]

Formal specifications
 A complete

Embedded assertions
 Can

check that the software behaves as expected
Algebraic properties
 Used

specification is essentially a test oracle
to generate test cases for abstract datatypes
Trace checking & Log file analysis
 Analyze
intermediate results and sequence of executions
12
Metamorphic Testing [Chen et al., 2002]
Initial
test case
Transformation
function based on
metamorphic
properties of f
New
test case


x
f
f(x) and f(t(x))
t
t(x)
f(x)
are “pseudo-oracles”
f
f(t(x))
If new test case output f(t(x)) is as expected, it is not
necessarily correct
However, if f(t(x)) is not as expected, either f(x) or
f(t(x)) – or both! – is wrong
13
Metamorphic Testing Example

Consider a function to determine the standard
deviation of a set of numbers
Initial
input
a
b
c
d
e
f
std_dev
s
New test
case #1
c
e
b
a
f
d
std_dev
s ?
New test
a+2 b+2 c+2 d+2 e+2 f+2 std_dev
case #2
s ?
New test
case #3
2a 2b 2c 2d 2e 2f
std_dev
2s ?
14
Outline

Background
 Related
Work
 Metamorphic Testing
Metamorphic Testing Empirical Studies
 Metamorphic Runtime Checking
 Future Work & Conclusion

15
Empirical Study

Is metamorphic testing more effective than other
approaches in detecting defects in applications
without test oracles?

Approaches investigated
 Metamorphic

Testing
Using metamorphic properties of the entire application
 Runtime Assertion

Using Daikon-detected program invariants
 Partial

Checking
Oracle
Simple inputs for which correct output can easily be determined
16
Applications Investigated

Machine Learning
 C4.5:
decision tree classifier
 MartiRank: ranking
 Support Vector Machines (SVM): vector-based classifier
 PAYL: anomaly-based intrusion detection system

Discrete Event Simulation
 JSim:

used in simulating hospital ER
Information Retrieval
 Lucene: Apache

framework’s text search engine
Optimization
 gaffitter:
genetic algorithm approach to bin-packing
problem
17
Methodology

Mutation testing was used to seed defects into
each application
 Comparison operators were reversed
 Math operators were changed
 Off-by-one errors were introduced



For each program, we created multiple versions,
each with exactly one mutation
We ignored mutants that yielded outputs that were
obviously wrong, caused crashes, etc.
Effectiveness is determined by measuring what
percentage of the mutants were “killed”
18
Experimental Results
Partial Oracle
Runtime Assertion Checking
Metamorphic Testing
C4.5
MartiRank
SVM
PAYL
JSim
Lucene
gaffitter
TOTAL
0
20
40
60
80
100
120
% of Mutants Killed
19
Analysis of Results

Assertions are good for checking bounds and
relationships but not for changes to values

Metamorphic testing particularly good for
detecting errors in loop conditions

Metamorphic testing was not very effective for
PAYL (5%) and gaffitter (33%)
 fewer
properties identified
 defects had little impact on output
20
Outline

Background
 Related
Work
 Metamorphic Testing
Metamorphic Testing Empirical Studies
 Metamorphic Runtime Checking
 Future Work & Conclusion

21
Metamorphic Runtime Checking

Results of previous study revealed limitations
of scope and robustness in metamorphic
testing

What if we consider the metamorphic
properties of individual functions and check
those properties as the entire program is
running?

A combination of metamorphic testing and
runtime assertion checking
22
Metamorphic Runtime Checking

Tester specifies the metamorphic properties of
individual functions using a special notation in the
code (based on JML)

Pre-processor instruments code with corresponding
metamorphic tests

Tester runs entire program as normal (e.g., to
perform system tests)

Violation of any property reveals a defect
23
MRC Model of Execution
Function f is
about to be executed
with input x in state S
The metamorphic test is conducted at
the same point in the program execution
as the original function call
The metamorphic test runs in parallel with
the rest of the application
Create a
sandbox for
the test
Transform
input to
get t(x)
Execute f(x)
to get result
Execute
f(t(x))
Metamorphic test
Send result
to test
Compare
outputs
Program
continues
Report
violations
24
Empirical Study

Can Metamorphic Runtime Checking detect
defects not found by system-level
metamorphic testing?

Same mutants used in previous study
 29%

were not found by metamorphic testing
Metamorphic properties identified at function
level using suggested guidelines
25
M
TO
TA
L
ga
ffi
tte
r
Lu
ce
ne
JS
im
PA
YL
SV
M
ar
t iR
an
k
C4
.5
% of defects detected
Experimental Results
120
100
80
MRC only
60
Both
MT only
40
20
0
26
Analysis of Results

Scope: Function-level testing allowed us to:
 identify
additional metamorphic properties
 execute more tests

Robustness: Metamorphic testing “inside” the
application detected subtle defects that did not
have much effect on the overall program
output
27
Combined Results
Partial Oracle
Runtime Assertion Checking
Metamorphic Testing
MT + MRC
C4.5
MartiRank
SVM
PAYL
JSim
Lucene
gaffitter
TOTAL
0
20
40
60
80
100
120
28
Outline

Background
 Related
Work
 Metamorphic Testing
Metamorphic Testing Empirical Studies
 Metamorphic Runtime Checking
 Future Work & Conclusion

29
Results

Demonstrated that metamorphic testing advances
the state of the art in detecting defects in
applications without test oracles

Proved that Metamorphic Runtime Checking will
reveal defects not found by using system-level
properties

Showed that it is feasible to continue this type of
testing in the deployment environment, with
minimal impact on the end user
30
Short-Term Opportunities

Automatic detection of metamorphic properties
 Using

dynamic and/or static techniques
Fault localization
 Once
a defect has been detected, figure out
where it occurred and how to fix it

Implementation issues
 Reducing
overhead
 Handling external databases, network traffic, etc.
31
Long-Term Directions

Testing of multi-process or distributed
applications in these domains

Collaborative defect detection and notification

Investigate the impact on the software
development processes used in the domains
of non-testable programs
32
Contributions & Accomplishments
1.
A set of metamorphic testing guidelines

2.
New empirical studies

3.
[Murphy, Shen, Kaiser; ISSTA’09]
Metamorphic Runtime Checking

5.
[Xie, Ho, Murphy, Kaiser, Xu, Chen; QSIC’09]
Heuristic Metamorphic Testing

4.
[Murphy, Kaiser, Hu, Wu; SEKE’08]
[Murphy, Shen, Kaiser; ICST’09]
In Vivo Testing


[Murphy, Kaiser, Vo, Chu; ICST’09]
[Murphy, Vaughan, Ilahi, Kaiser; AST’10]
33
Thank you!
34
Backup
Slides!
Motivation
35
Assessment of Quality

1994: Hatton et al. pointed out a “disturbing”
number of defects due to calculation errors in
scientific computing software [TSE vol.20]

2007: Hatton reports that “many scientific
results are corrupted, perhaps fatally so, by
undiscovered mistakes in the software used to
calculate and present those results”
[Computer vol.40]
36
Effectiveness
Complexity vs. Effectiveness
Metamorphic
Runtime
Checking
System-level
Metamorphic
Testing
Formal
Specifications
Embedded
Assertions
Algebraic
Specifications
Trace Checking
& Log Analysis
Complexity
37
Metamorphic
Motivation
Properties
38
Categories of Metamorphic Properties
Additive: Increase (or decrease) numerical
values by a constant
 Multiplicative: Multiply numerical values by
a constant
 Permutative: Randomly permute the order
of elements in a set
 Invertive: Negate the elements in a set
 Inclusive: Add a new element to a set
 Exclusive: Remove an element from a set
 Compositional: Compose a set
39

Sample Metamorphic Properties
1.
2.
3.
4.
5.
6.
7.
8.
Permuting the order of the examples in the training data should not
affect the model
If all attribute values in the training data are multiplied by a positive
constant, the model should stay the same
If all attribute values in the training data are increased by a positive
constant, the model should stay the same
Updating a model with a new example should yield the same model
created with training data originally containing that example
If all attribute values in the training data are multiplied by -1, and an
example to be classified is also multiplied by -1, the classification
should be the same
Permuting the order of the examples in the testing data should not
affect their classification
If all attribute values in the training data are multiplied by a positive
constant, and an example to be classified is also multiplied by the
same positive constant, the classification should be the same
If all attribute values in the training data are increased by a positive
constant, and an example to be classified is also increased by the
same positive constant, the classification should be the same
40
Other Classes of Properties (1)

Statistical
 Same

mean, variance, etc. as the original
Heuristic
 Approximately

equal to the original
Semantically Equivalent
 Domain
specific
41
Other Classes of Properties (2)

Noise Based
 Add/Change

Partial
 Change

data that should not affect result
to part of input only affects part of output
Compositional
 New
input relies on original output
 ShortestPath(a, b) =
ShortestPath(a, c) + ShortestPath(c, b)
42
Automatic Detection of Properties

Static
 Use
machine learning to model what code looks
like that exhibits certain properties, then determine
whether other code matches that model
 Use symbolic execution to check “algebraically”

Dynamic
 Observe
multiple executions and infer properties
43
Automated
Motivation
Metamorphic
Testing
44
Automated Metamorphic Testing

Tester specifies the application’s metamorphic
properties

Test framework does the rest:
 Transform
inputs
 Execute program with each input
 Compare outputs according to specification
45
AMST Model
46
Specifying Metamorphic Properties
47
Heuristic
Motivation
Metamorphic
Testing
48
Statistical Metamorphic Testing

Introduced by Guderlei & Mayer in 2007

The application is run multiple times with the same
input to get a mean value μo and variance σo
Metamorphic properties are applied
The application is run multiple times with the new
input to get a mean value μ1 and variance σ1



If the means are not statistically similar, then the
property is considered violated
49
Heuristic Metamorphic Testing

When we expect that a change to the input will produce
“similar” results, but cannot determine the expected similarity
in advance

Use input X to generate outputs M1 through Mk
Use some metric to create a profile of the outputs




Use input X’ (created according to a metamorphic property) to
generate outputs N1 through Nk
Create a profile of those outputs
Use statistical techniques (e.g. Student t-test) to check that
the profile of outputs N is similar to that of outputs M
50
Heuristic Metamorphic Testing
x
x
y1
profile of
n
nd_f y1…yy
x
nd_f
nd_f
2
t(x) nd_f
y’1
profile of
y’1…y’
nd_fn
t(x)
y’2
the t(x)
ynprofilesDodemonstrate
nd_f
y’n
the expected
relationship?
51
HMT Example
2
1
4
1
?
1
1
?
?
1
2
1
?
2
1
2
?
3
?
?
?
?
3
?
2
2
3
4
4
2
4
3
3
4
3
?
?
4
4
?
permute
sort
P
?
=
sort
P’
Build abased
profileon
based
on normalized
equivalence
Build a profile
normalized
equivalence
and compare it statistically to the first profile
52
HMT Empirical Study

Is Heuristic Metamorphic Testing more effective than
other approaches in detecting defects in nondeterministic applications without test oracles?

Approaches investigated
 Heuristic
Metamorphic Testing
 Embedded Assertions
 Partial Oracle

Applications investigated
 MartiRank:
sorting sparse data sets
 JSim: non-deterministic event timing
53
HMT Study Results & Analysis

Heuristic Metamorphic Testing killed 59 of the
78 mutants

Partial oracle and assertion checking
ineffective for JSim because no single
execution was outside the specified range
54
Metamorphic
Motivation
Runtime
Checking
55
Extensions to JML
56
Creating Test Functions
/*@
@meta std_dev(\multiply(A, 2)) == \result * 2
*/
public double __std_dev(double[] A) {
...
}
protected boolean __MRCtest0_std_dev
(double[] A, double result) {
return Columbus.approximatelyEqualTo
(__std_dev(Columbus.multiply(A, 2)),
result * 2);
}
57
Instrumentation
public double std_dev(double[] A) {
// call original function and save result
double result = __std_dev(A);
// create sandbox
int pid = Columbus.createSandbox();
// program continues as normal
if (pid != 0) return result;
else { // run test in child process
if (!__MRCtest0_std_dev(A, result))
Columbus.fail(); // handle failure
Columbus.exit(); // clean up
}
}
58
MRC: Case Studies

We investigated the WEKA and RapidMiner
toolkits for Machine Learning in Java

For WEKA, we tested four apps:
 Naïve
Bayes, Support Vector Machines (SVM),
C4.5 Decision Tree, and k-Nearest Neighbors

For RapidMiner, we tested one app:
 Naïve
Bayes
59
MRC: Case Study Setup

For each of the five apps, we specified 4-6
metamorphic properties of selected methods
(based on our knowledge of the expected
behavior of the overall application)

Testing was conducted using data sets from
UCI Machine Learning Repository

Goal was to determine whether the properties
held as expected
60
MRC: Case Study Findings

Discovered defects in WEKA k-NN and WEKA
Naïve Bayes related to modifying the machine
learning “model”
 This
was the result of a variable not being updated
appropriately

Discovered a defect in RapidMiner Naïve
Bayes related to determining confidence
 There
was an error in the calculation
61
Metamorphic
Testing
Motivation
Experimental
Study
62
Approaches Not Investigated

Formal specification
 Issues
related to completeness
 Prev. work converted specifications to invariants

Algebraic properties
 Not
appropriate at system-level
 Automatic detection only supported in Java

Log/trace file analysis
 Need

more detailed knowledge of implementation
Pseudo-oracles
 None
appropriate for applications investigated
63
Methodology: Metamorphic Testing

Each variant (containing one mutation) acted
as a pseudo-oracle for itself:
 Program
was run to produce an output with the
original input dataset
 Metamorphic properties applied to create new
input datasets
 Program run on new inputs to create new outputs
 If outputs not as expected, the mutant had been
killed (i.e. the defect had been detected)
64
Methodology: Partial Oracle

Data sets were chosen so that the correct
output could be calculated by hand

These data sets were typically smaller than
the ones used for other approaches

To ensure fairness, the data sets were
selected so that the line coverage was
approximately the same for each approach
65
Methodology: Runtime Assertion Checking

Daikon was used to detect program invariants
in the “gold standard” implementation

Because Daikon can generate spurious
invariants, programs were run with a variety of
inputs, and obvious spurious invariants were
discarded

Invariants then checked at runtime
66
Defects Detected in Study #1
67
Study #1: SVM Results





Permuting the input was very effective at killing offby-one mutants
Many functions in SVM analyze a set of numbers
(mean, standard dev, etc.)
Off-by-one mutants caused some element of the set
to be omitted
By permuting, a different number would be omitted
This revealed the defect
68
Study #1: SVM Example


Permuting the input reveals this defect because both
m_I1 and m_I4 will be different
Partial oracle does not because only one element is
omitted, so one will remain same; for small data sets,
this did not affect the overall result
69
Study #1: C4.5 Results




Negating the input was very effective
C4.5 creates a decision tree in which nodes contain
clauses like “if attrn > α then class = C”
If the data set is negated, those nodes should
change to “if attrn ≤ -α then class = C”, i.e. both the
operator and the sign of α
In most cases, only one of the changes occurred
70
Study #1: C4.5 Example


Mutant causes ClassFreq to have negative values,
violating assertion
Permuting the order of elements does not affect the
output in this case
71
Study #1: MartiRank Results



Permuting and negating were effective at killing
comparison operator mutants
MartiRank depends heavily on sorting
Permuting and negating change which numbers get
sorted and what the result should be, thus inducing
the differences in the final sorted list
72
Study #1: Effectiveness of Properties
73
Study #1: Lucene Results

Most mutants gave a non-zero score to the
term “foo”, thus L3 detected the defect
74
Study #1: gaffitter Results
G1: increasing the number of generations
should increase the overall quality
 G2: multiplying item and bin sizes by a
constant should not affect the solution
 Most of defects killed by G1 related to
incorrectly selecting candidate solutions

75
Empirical Studies: Threats to Validity
Representativeness of selected programs
 Types of defects
 Data sets
 Daikon-generated program invariants
 Selection of metamorphic properties

76
Metamorphic
Runtime
Motivation
Checking
Experimental Study
77
Study #2 Results
If we only consider functions for which
metamorphic properties were identified, there
were 189 total mutants
 MRC detected 96.3%, compared to 67.7% for
system-level metamorphic testing

78
Study #2 PAYL Results


Both functions call numerous other functions, but we
can circumvent restrictions on the input domain
Permuting input tends to kill off-by-one mutants
79
Study #2 gaffitter Results
80
Study #2: gaffitter Example
Genetic Algorithm takes two sets and “crosses over” at a particular element
1 2 3 4 5
1 2 3 9
6 7 8 9
Metamorphic Property: If we switch the order,
the new output should be predictable
6 7 8 9
6 7 8 4 5
1 2 3 4 5
Simply, the elements not included
in the original cross-over
81
Study #2: gaffitter Example
Now consider a defect in which the cross-over happens at the wrong point
1 2 3 4 5
1 2 3 8 9
6 7 8 9
Metamorphic property is violated:
elements 3 and 8 should not appear in both sets
6 7 8 9
6 7 8 3 4 5
1 2 3 4 5
82
Study #2: gaffitter Example
Correct implementation
1 2 3 4 5
1 2 3 8 9
6 7 8 9
Erroneous implementation
1 2 3 4 5
This defect is only detected by
system-level metamorphic testing
if element 8 has any impact on the
“quality” of the final solution. However,
a single element is unlikely to do so.
1 2 3 9
6 7 8 9
83
Study #2 Lucene Results

MRC killed three mutants not killed by MT

All three were in the idf function
84
Study #2: Lucene Example
Search query results are ordered according to a score
“ROMEO or JULIET”
Act 3
Scene 5
Act 2
Scene 4
Act 5
Scene 1
5.837
4.681
3.377
Consider a defect in which the scores are off by one.
The results stay the same because only the order is important.
“ROMEO or JULIET”
Act 3
Scene 5
6.837
Partial oracle does not reveal this defect
because the scores cannot be calculated in advance.
Act 2
Scene 4
Act 5
Scene 1
5.681
4.377
85
Study #2: Lucene Example
System-level metamorphic property:
changing the query order shouldn’t affect result
“ROMEO or JULIET”
“JULIET or ROMEO”
Act 3
Scene 5
Act 2
Scene 4
Act 5
Scene 1
6.837
5.681
4.377
Act 3
Scene 5
Act 2
Scene 4
Act 5
Scene 1
5.681
4.377
6.837
Even though the defect exists,
the property still holds and the defect is not detected.
86
Study #2: Lucene Example
The score itself is computed as the result of many subcalculations.
Score(q) = ∑Similarity(f)*Weight(qi) + … + idf(q) + …
Metamorphic Runtime Checking
can detect that there is an error
in this function by checking its
individual (mathematical) properties.
87
In
Vivo
Testing
Motivation
88
Generalization of MRC

In Metamorphic Runtime Checking, the
software tests itself
Why only run metamorphic tests?
 Why limit ourselves only to applications
without test oracles?
 Why not allow the software to continue testing
itself as it runs in the production environment?

89
In Vivo Testing

An approach whereby software tests itself in
the production environment by running any
type of test (unit, integration, “parameterized
unit”, etc.) at specified program points

Tests are run in a sandbox so as not to affect
the original process

Invite implementation: less than half a
millisecond overhead per test
90
Example of Defect: Cache
private int numItems = 0, currSize = 0;
private int maxCapacity = 1024; // in bytes
public int
getNumItems()
{
Number
of
Their size
return
numItems;
items in
(in bytes)
}
the cache
Maximum
public boolean addItem(CacheItem
i) throws ...
capacity
{
Should only be incremented
numItems++;
within “if” block
if (currSize + i.size < maxCapacity) {
add(i); currSize += i.size;
return true;
} else { return false; }
}
91
Insufficient Unit Test
public void testAddItem() {
Cache c = new Cache();
assert(c.addItem(new CacheItem()))
assert(c.getNumItems() == 1);
assert(c.addItem(new CacheItem()))
assert(c.getNumItems() == 2);
}
1. Assumes an empty/new cache
2. Doesn’t take into account various states
that the cache can be in
92
Defects Targeted
1.
2.
3.
4.
5.
Unit tests that make incomplete assumptions
about the state of objects in the application
Possible field configurations that were not
tested in the lab
A legal user action that puts the system in an
unexpected state
A sequence of unanticipated user actions
that breaks the system
Defects that only appear intermittently
93
In Vivo: Model of Execution
Function is
about to be
executed
Run a test?
NO
Execute
function
Rest of
program
continues
Yes
Create
sandbox
Run test
Stop
Fork
94
Writing In Vivo Tests
/* Method to be tested */
public boolean addItem(CacheItem i) { . . . }
/* In
JUnit
Vivo style test */
public boolean
void
testAddItem()CacheItem
{
i) {
Cache c = new
Cache();
this;
int oldNumItems = getNumItems();
i)) CacheItem()))
if (c.addItem(new
assert
1);
return (c.getNumItems() == oldNumItems+1;
else return true;
}
95
Instrumentation
/* Method to be tested */
public boolean __addItem(CacheItem i) { . . . }
/* In Vivo style test */
public boolean testAddItem(CacheItem i) { ... }
public boolean addItem(CacheItem i) {
if (Invite.runTest(“Cache.addItem”)) {
Invite.createSandboxAndFork();
if (Invite.isTestProcess()) {
if (testAddItem(i) == false) Invite.fail();
else Invite.succeed();
Invite.destroySandboxAndExit();
}
}
return __addItem(i);
}
96
In Vivo Testing: Case Studies

Applied testing approach to two caching systems
 OSCache
2.1.1
 Apache JCS 1.3

Both had known defects that were found by users
(no corresponding unit tests for these defects)

Goal: demonstrate that “traditional” unit tests would
miss these but In Vivo testing would detect them
97
In Vivo Testing: Experimental Setup
An undergraduate student created unit tests
for the methods that contained the defects
 These tests passed in “development”

Student was then asked to convert the unit
tests to In Vivo tests
 Driver created to simulate real usage in a
“deployment environment”

98
In Vivo Testing: Discussion

In Vivo testing revealed all defects, even
though unit testing did not

Some defects only appeared in certain states,
e.g. when the cache was at full capacity
 These
are the very types of defects that In Vivo
testing is targeted at

However, the approach depends heavily on
the quality of the tests themselves
99
In Vivo Testing: Performance
100
More Robust Sandboxes

“Safe” test case selection
 [Willmor

and Embury, ICSE’06]
Copy-on-write database snapshots
 MS
SQL Server v8
101
In Vivo Testing: Related Work

Self-checking Software
[A.Orso et al, ISSTA’02]
 Skoll: [A.Memon et al., ICSE’03]
 Cooperative Bug Isolation [B.Liblit et al., PLDI’03]
 COTS components [S.Beydeda, COMPSAC’06]
 Gamma

Property-based Software Testing
 D.Rosenblum:
runtime assertion checking
 I.Nunes: checking algebraic properties [ICFEM’06]
102
Related
Work
Motivation
103
Limitations of Other Approaches

Formal specification languages
 Issues
related to completeness
 Balance between expressiveness and implementability

Algebraic properties
 Useful
for data structures, but not for arbitrary functions or
entire programs
 Limitations of previous work in runtime checking

Log/trace file analysis
 Requires
careful planning in advance
104
Previous Work in MT

T.Y.Chen et al.: applying metamorphic testing
to applications without oracles [Info. & Soft.
Tech. vol.44, 2002]

Domain-specific testing
[J.Mayer and R.Guderlei, QSIC’07]
 Bioinformatics [T.Y.Chen et al., BMC Bioinf.
10(24), 2009]
 Middleware [W.K.Chan et al., QSIC’05]
 Others…
 Graphics
105
Previous Studies [Hu et al., SOQUA’ 06]
Invariants hand-generated
 Smaller programs
 Only deterministic applications
 Didn’t consider partial oracle

106
Developer Effort [Hu et al., SOQUA’ 06]
Students were given three-hour training
sessions on MT and on assertion checking
 Given three hours to identify metamorphic
properties and program invariants

Averaged about the same number of
metamorphic properties as invariants
 The metamorphic properties were more
effective at killing mutants

107
Fault Localization

Delta debugging
 [Zeller, FSE’02]
 Compare trace of

failed execution vs. successful ones
Cooperative Bug Isolation
et al., PLDI’03]
 Numerous instances report results and failed execution is
compared to those
 [Liblit

Statistical approach
 [Baah, Gray, Harrold; SoQUA’06]
 Combines model of normal behavior
with runtime
monitoring
108
Download