having me here and thank you very much for attending... sacrificed the spring holidays or have their families sent off... >> Tom Zimmerman:

advertisement
>> Tom Zimmerman: Okay. Thank you very much, Tom, and thank you very much for
having me here and thank you very much for attending here. I know that several of you
sacrificed the spring holidays or have their families sent off to holidays already while
you're desperately waiting to get into your cars to get to the airport.
So I'm going to start by introducing myself. So I'm Andreas Zeller and this is Earth, this is
where I come from. More precisely, this is Europe. And in Europe, you have this patch,
this is Germany and this tiny orange dot here is the state of Saarland. The State of
Saarland is the smallest of Germany's states. It's about -- it's a little bit larger than
Seattle County. And in this very tiny patch -- in this tiny patch of a state there is a City of
Saarbrucken, which is right on the border to France. That's also among the most
interesting parts of the Saarbrucken that is close to France because there is essentially
only one thing that's really interesting there, for which you may want to stop there and
that's Computer Science cluster, which is composed of the university, which is composed
of Saarland University, the German center for artificial intelligence and two Max Planck
Institutes, which do basic research.
The picture you see there is actually slightly outdated because if you look on the other
side of the road, this is what you see there, currently there's seven new buildings for
computer science being constructed there. So I understand you folks have just moved
into new building in here and you know what it looks like. This is what we're going to, this
is what we're going to move into or expand into in the next year.
This is the guy in charge of these buildings. This is Hans Petalenof(phonetic). He's the
head of Bioinformatics and you can see that's he's pretty happy because -- well, you see,
building -- buildings like this one is actually a more or less risk-free thing. See. You set
them up. While you're setting them up, you can walk through them, you can see where
everything is in place and if something doesn't work properly you're still able to fix things,
even if a wall is completely -- even if a wall is completely where you can move the wall to
another place, there's always stuff you can change and essentially you'll have plenty of
time to figure out whether it works or not and there's little chance that the building as a
whole will simply collapse at some particular point.
Now as I said, this guy is happy because he's in charge of doing buildings. If you're
building software, however, you're having something, you're also constructing a big, big
artifact, which actually is just as complex as a building or even more complex. But if
something is wrong in there, it can easily take the entire construction down. That is in
this building say there was a screw somewhere over here, which is in the wrong place,
and this simple screw may be able to tear the entire building down. This is something --
this is something which -- what I was developing and releasing software always made me
very, very worried.
Well, maybe this is a bit drastic, but it's certainly a property of software that can give us a
couple sleepless nights in particular before a release. So what do we do about that?
Well, we of course we test our software before we release it, but how do we know that we
actually tested our software well enough?
There's a couple of criteria, which you can learn if you go into a software engineering
course. You can for instance find out that what you should do is you should make sure
that your tests cover every single statement. Every single statement is executed at least
once while you're testing it. Or better yet, you can think about having every single branch
being executed. You can think about having every single condition evaluated to true and
false, at least ones -- you can make sure that every loop is executed at least a number of
times and these are all called coverage criteria and these are criteria at which you can
measure whether your test is -- whether your test makes sense or not.
These are frequently used in industry. These are frequently taught and if you go to an
exam you can also -- well, this is something you have to remember if you went to an oral
exam and testing. But the question is the so-called coverage criteria, in particular what's
up here and that you -- such that you need to check every single branch of your program.
Does this actually make sense?
There's an interesting code by Elaine Wyaker of AT&T and she said that whether there's
coverage criteria for testing are actually adequate or not, whether they make any sense
or not, well, we can only define them intuitively. That is from our own expertise. We think
that this makes sense, but what she pointed out is that in practice -- in practice, what you
usually have is defect distribution in your program that can look very, very different.
This is defect density as we mined it from the aspect J -- from the aspect J version
history. Every single rectangle here is single class and the rhetoric class is the more
defects it had during its history. So what you see here for instance, there's places which
had over here they're very red, bright ones had 46 bugs over time. Over here they're
more black ones, had zero bugs over time.
So -- and this defect density is usually distributed to something we call peritu(phonetic),
which is a peritu(phonetic) distribution, which means that 20% of all modules or 20% of
all classes have about 80% of all defects. And the problem -- the problem with that is
that if you now go and try to achieve something like statement coverage or branch
coverage you want to go and make sure every single branch is executed at least once.
You may go on and focus to improve your nest in these again and again and again,
where no defect ever has been found. And instead neglect these red areas where
there's plenty of defects, but turns out that these -- that this occurrence of defects, this
defect density is not necessarily related to the structural -- to the structural properties that
is -- it isn't the case that in these areas you simply have more statements to cover or
more branches to cover. So this defect distribution is -- this defect distribution actually is
independent from the structural criteria I've seen so far.
So people have thought about that already in the past or actually already 30 years ago
and they came up with alternative to define or to measure whether a test suite is
adequate or not. And this idea is mutation testing. The concept of mutation testing is
actually to simulate bugs and to see whether your test suite finds them.
Works like this. You have a program and from this program you create a number of
copies and in these copies you now introduce artificial bugs, which are called mutations.
A mutation, for instance, may change a constant by a specific factor, may change
operations. The idea is to simulate the bugs that programmers do in real life, too. And
once you have created these copies with these artificial bugs being introduced, you test
them. And if your test suite is good, then it should find these artificial bugs, so far, so
good. However, if it turns out that your test suite does not find these artificially seeded
bugs the assumption is that this also implies that your test suite also would not find any
real bugs in the code. If the test does not find artificial bugs, it doesn't find areas that it
would also fail to find real bugs. And this is something you have to worry about.
Now here's an example of a mutation. This is piece of code from aspect J, the checker
class and this is one of the well-known compare-to methods, which is supposed to return
zero or negative or a positive value, depending on whether the object being compared
with is equal, less or greater than the receiver of the method. In this case we have some
special compare-to method in order to return zero, which means that when it comes to
checkers, objects are simply equal.
And a mutation for instance could go and change this compare-to method from zero over
here, simply to one, meaning that whenever you compare two objects, A and B, B would
always be the bigger of the two. Turns out that if you change this from zero to one it has
no effect on the aspect J test suite whatsoever. The test suite simply couldn't care less
what the return value of compare to is, which in that case obviously indicates weakness
in the test suite.
Now there's plenty of mutation operators that you can think of. This is from the book on
software testing and analysis by Mara Petsi(phonetic) and Michael Young. You can
replace constant by constant, color for constants, area for constants. You can insert
absolute values and insert method, constants can remove method, constants can replace
operators. There's plenty of things that people have thought about in the past in order to
simulate -- in order to simulate what programmers are doing.
Question is, does this actually work. And again there have been a number of studies on
that, in particular very interesting and very important results. Generate mutants are
similar to real forces software which is -- one is able to -- by using a small number of
mutation operators, one is able to simulate real force as they occur in the software.
Second thing is when you do mutation testing it's generally more powerful than statement
or branch coverage when it comes to improving a test suite and it's also better than data
flow coverage. This is another set of coverage criteria which refer to data, rather than to
control flow. So generally mutation testing in this study has been found to be effective, it
has been found to work. And this is actually why we wanted to use mutation testing as a
means to assess the quality of a test suite because what I initially wanted to do is I had
this cool idea that when a defect arises and we wanted to predict defects in software and
we said, well, maybe the chance of a defect increases when the test quality decreases
and therefore we wanted to use mutation testing in order to measure the quality of tests.
This was our initial idea, simply to come up as a user of mutation testing and to apply it.
And it didn't really work. It didn't really work because of two issues. One of these issues
was known before, we expected that. And the other came completely unexpected. The
first thing we worried about was that the size of the programs being used in mutation
testing studies was somewhat limited. This for instance is the mid-program, this is a
four-trend program, we can also make a Java version of it, which returns the middle of
three numbers. So the one number that is neither the minimum or the maximum of these
three numbers. This is the size of programs that will be used in papers on mutation
testing. So of course when you write a paper you come up with a small example that fits
on one slide such that you can illustrate your approach. This is perfectly normal, but this
is also the size of program that was used to -- actually was used to apply mutation
testing.
Well, and of course we knew that mutation testing would work better on small programs
and we knew that because one issue when this other issue we expected is efficiency.
Mutation testing is an expensive technique because for every single mutation you apply
to the program, you must rerun the test suite, which means that you run the test suite
over and over and over again, which means that it only works in practice if you have a
fully automated test suite. That's important. But still it requires quite a lot of computing
resources. This was the issue we knew about and this was the issue we faced. So I
wanted to figure out how to make mutation testing efficient, as efficient as possible
because we, being a university, also have limited computing resources. So we came up
with a couple of ideas that already had been discussed in literature and we implemented
them all.
So for instance, we wanted to build -- we wanted to work on Java programs, so we went
and rather than manipulating the source code and recompiling it, we went and
manipulated the byte code directly, so we simply replaced the property of byte codes into
different byte codes. This was very efficient, because we didn't have to go through all the
recompilation. We focused on a very small number of mutational operators, simple
mutation operators that had been shown before to actually simulate all the -- that actually
had been shown in the earlier study to simulate all the real fault by programmers.
In particular we have numerical constants that are replaced by plus minus one or by zero
overall. Negating a branch condition means simply going along the other way and we
also replace arithmetic operators plus by minus and divide by multiplication and vice
versa. And we also implemented a couple of run-time conditions that would allow us to
switch on and off mutations without even changing the byte code again. And finally we
also checked for coverage that is we only bother to mutate code that was actually -- that
would actually be executed in the program, that is if some piece of code would not be
executed we didn't even bother to change it.
Here's an example of one of these optimizations. Again, the checker class, again the
compare-to method, we have this mutation changing zero into one. It's not found by the
aspect J test suite for a simple reason, it's not ever executed and therefore since it's not
executed we can of course mutate it like crazy. This will not have any effect whatsoever.
So this was the issue we knew about and after we built a framework, turned out to work
reasonably well. For instance, one of the test programs is an expert engine, 12,000 lines
of code, definitely more than this mid-program with 25 lines of code and if we ran it on
one of our machine its took six and a half CPUs. You divide that by the number of CPUs
you have available and you get close to the -- get close to the realtime that it takes. Our
machine had eight CPUs, so it took about one hour of real time to run it all work. It's
okay. You can live with that.
Which meant that all in all we were able to apply mutation testing to -- well, mid-size Java
programs, which was perfectly okay. Then, however, came the second problem and the
second problem hit us hard. It hit us -- it hit us when we did not expect it at all and we
were really, really bothered. This it problem is the inspection problem and this is a
problem which you rarely read about. I think in the entire book of Petsi(phonetic) and
Young there's only one sentence devoted to it. And this problem is known as the
problem of so-called equivalent mutants. Because if you apply mutation to a program
this may simply lead the program semantics completely unchanged, which means that it
cannot be found by any test, which also means that such an equivalent mutant cannot be
detected by any test and there is no way to improve your test to catch this equivalent
mutant.
Also an equivalent mutant is not exactly the representative of a real bug because real
bugs is what actually change the program semantics. Now deciding whether a mutation
to a program changes the semantics of a program is unfortunately an undecidable
problem. If you can solve that problem, you also solve the halting problem and probably
also P equals NP and whatnot in the same term. Now we solved the halting problem, but
halting problem is undecidable, so all -- what you need to do is you have to look at these
mutants, you have to inspect them manually and you have to decide whether a mutation
actually change the program semantics, whether it's equivalent or not.
And this is a very tedious task. Let me show you an example of an equivalent mutant.
Here's one. Again, we have a compare-to method. Compare-to method this time just
comes from the B cell advice class, which again is class in aspect J. And again we are
changing the return value of this compare to method, but this time we're changing this
plus one value over here into plus 2. So this is a compare-to method. Normally you
think, well, would this be equivalent or not. Does this have an effect on the program
semantics in normally a compare-to method is supposed to return a positive value or a 0
or a negative value. So the actual value return should not have any -- should not make
any difference. But how do you know that a program actually also uses this compare to
method as intended by comparing the sign of the return (inaudible) that a-dot compared
to B greater than 0. Maybe each programmer writes a-dot compared to B equals plus 1
and then it would have an impact.
So actually have to go and check all the instances of compare to. I think if you do a
graph compare-to in S(inaudible) code you end up with 4,000 different instances, so you
have to be somewhat smarter than that to find all the occurrences. However, turns out
that in the end this change from plus 1 to plus 2 has no impact whatsoever in aspect J.
And there's no way such mutation can help you in improving your test suite.
Now we went through a number of such mutations and checked them. Turned out that in
Jackson, again, this 12,000 line program, turned out that we just took a sample of 20
mutants and of the sample, 40% of the mutants that were nondetected were equivalent,
which means that we do have 40% false positives. Well, there's -- if you work on static
program analysis to find bugs, 40% false positives may not necessarily seem big, but this
takes time.
Every single mutation took us on average it took us 13 minutes to figure out whether it
was equivalent or not because if it was not equivalent, we also had to write in a proper
test that would catch it, but solve 13 minutes mutations is quite a lot, in particular if you
think -- if you realize that almost 2,000 mutations were not detected, which means that if
you go and check every single mutation on whether it's equivalent or not, that's -- this will
take you about 1,000 hours or for Microsoft programmer, make that 10 weeks. So 10
weeks is 10 weeks looking for one mutation after another is not necessarily what you
want to do.
So we thought that efficiency was the main issue. So we made everything to run as fast
as possible, but what does it mean if your check takes six hours? Afterwards you have
work for entire quarter. Or worse even, this ratio 40% grows as you improve the test
suite because you can think if you have the perfect test suite that catches every single
bird, it would also catch every single -- it would also catch every single nonequivalent
mutant. Every single mutant that is not caught by your perfect test would be equivalent
and therefore as your test improves to get to this 100%, the number of -- the number of
nonequivalent mutants decreases, but the number of equivalent mutants stays the same
and therefore -- therefore, as you improve your test suite you get more and more false
positives.
And all these false positives are just worthless. They just waste your time. If you use a
statement coverage you find here is a place which I cannot -- here is a statement which I
cannot reach by any test. You found a bug, that's great. However, in mutation testing if
you find here's a change which I cannot find by any test, well, then okay, you go back to
zero. And all of this is a serious problem in mutation testing. Actually when we read the
papers we found that -- we found a couple of sentences in the paper, the human effort
needed to actually apply mutation testing was almost prohibited. So mutation testing in
principle, good idea. But in practice takes lots of human effort. And that's when we were
slightly disappointed because at this point we already had invested a year, a man year,
on solving inappropriate mutation testing infrastructure, which was very efficient.
So the problem arises how do we go and determine these equivalent mutants? How can
we tell whether a mutant has impact on the program semantics or not?
This word impact actually was what drove our next step. What we wanted to have is big
impact. We wanted to have a tiny, tiny change to the program execution that would -with tiny change create lots and lots of differences in the program execution, like this
snowboard guy down here. Difference that would propagate to the anti-program
execution and make everything very, very, very different. And this Javalanche model is
also actually what then became the center of our framework.
So this is what we want to have, we want to have a tiny change that has lots and lots of
impact all across the -- all across the execution. How do we measure this? Our -- well,
our idea was to look for changes in what constitutes the semantics of individual programs
of individual functions that is pre and post-conditions and where do we get pre and
post-conditions from, because normally they are not specified?
So we applied a technique called Dynamic Invariants to learn the pre and post-conditions
of individual function. Dynamic Invariants is a technique, which was invented by Mike
Ernst, who is now at the University of Washington just across the pond. The idea is you
have a number of program runs and from these program runs you determine an invariant
that is a property that holds for all the individual executions.
How does this work? Well, once you have such invariant you can then check it for
individual properties and check with invariant holds or not, you can compare the two.
Here is a little example. This is again piece of Java code. This is an example. 15/11,
which, well, it should take you about 30 seconds to figure out what it does. If you run this
program with 100 randomly generated areas of different length you will find that there is
one specific pre-condition, that is M is always the size of the area, B is never null and N is
also between 7 and 13. This is the pre-condition that is determined by Dynamic
Invariants and this is the post-condition. B is unchanged and the return value is the sum
over all elements of the area.
And this set of invariants is obtained as follows. We get a trace from all these runs.
Then we filter out the invariants are then reported as post conditions. This is how
Dyacom, the two of my (inaudible) works, we adapted it and also made Dyacom much
more efficient in order to make it scaleable and if you ever worked with Dyacom, yes
there, is definitely need for that. We were very happy because at the time Mike Ernst
was having a sabbatical with Saarland, so it was just around the corner and we could
easily do that.
So here's an example. Here's another method from aspect J. Signature to string
internal. And this function, what this function does, it encodes the signature of a function.
Signature meaning the types of the arguments that go in there and go out there and if we
change here this index from minus 1 to 0 this means that we're going to change the way
identifiers and signatures are encoded in the program. And it turns out that this little
change which we make to the signature to string internal method, this little mutation,
actually changes the post-condition of the -- changes post-conditions and pre-conditions
and pre-conditions and post-conditions of methods all over the place. This little change
up here has impact on overall 40 different methods and 40 different methods the pre or
post-condition is now violated after apply this change, which means that the semantics of
these individual functions has changed.
Better yet, if I make this change, not only does it impact overall 40 invariants, but also it is
not detected by any of the aspect J unit tests and to make matters really worse, this part
which I have omitted in here actually has a comment from the programmer which says
this should be thoroughly tested, which unfortunately it is not. So you can change all
these internal signature stuff and it is never tested, never caught by any of the aspect J
test suite, which implies that if anything else is wrong in here it also were not found by the
real test.
So we called our firm JAVALANCHE because of the avalanche effect. And it took us a
year to build it. It implements all the nice, efficient techniques as we've seen them before
and it goes and ranks the mutations by impact. That is those mutations with the highest
impact get the highest detention from the program. And this is how JAVALANCHE works
when it is mutation testing. Again, we have the program, but now we have an instrument,
the program, with this invariant checkers which are going to show us whether the
semantics of an individual function has changed. Again, we create copies and we put in
our individual invariant checkers in these copies just as well.
Now we go and mutate these copies by applying appropriate mutations. Then we run the
tests and if the test finds the mutant, it's all fine. We don't worry about these. But if a test
does not find the mutation and this happens -- if a test does not find the mutation, then
we check whether individual invariants have been violated, whether there has been a
change in the function -- in the function semantics, in the function behavior and if we now
have multiple mutations that are not found by the test suite then we focus on the ones
which have the highest number of individual violations because these are the ones with
the highest impact.
So again, we learn these invariants from the test suite, we insert invariant checkers into
the code. We detect the impact of mutations and we focus on those where the most
invariants have been violated. That is where we find the highest impact. So this was our
idea. But now again the question for us...does this actually work? Does this make
sense?
So we went and built an elevation for that. First, we checked with a mutation with impact
is less likely to be equivalent. Second, we checked with mutations with impact are more
likely to be detected by the test suite and third we checked whether those mutations with
the highest impact are the most likely to be detected.
We did this on a number of elevation subjects, so we started with Jackson, which is writer
in the middle, 12,450 lines of code, with up to aspect J, which is almost 100,000 lines of
code. This is aspect J is AOP sanction to Java. There is Barbecue, which is a bar code
reader. Comments, which is helping you. The Jackson is an ex-path engine.
Joda(phonetic) time is a data and time library. Jatopas(phonetic) Parser, parser tools
and extreme is an XML -- is a tool for object serialization and XML. Different sizes,
different number of tests.
We applied all the mutations there were, our mutation scheme has a finite number of
mutations so we all satisfy a number of mutations that would apply it. Something
between 1,500 mutations for Jatopas(phonetic) and 47,000 mutations for aspect J. You
can see that the detection rate from project to project varies a lot. Extreme, almost all
mutations were detected, aspect J, roughly only half of the mutations were detected. Not
sure whether you can directly compare these measures and whether this has anything to
say about this with respective test suite. But the extreme test suite is really, really tough,
whereas the aspect J test suite is -- let's say it's great for research if you want to improve
tests. Very much like Aspect J, also, if you do static analysis aspect J is your total friend
because -- or if you do mining of bugs, it's really great. It's a wonderful subject in every
single word.
>> Question: These numbers are for nonequivalent mutation?
>> Tom Zimmerman: This is -- so far we don't care about the equivalents or not, this is
just the mutations that were detected and now if you apply mutation testing now we
would focus on those that survived in here and we would check for every single mutation.
If it's not equivalent, then we should improve our test suites such that it would find it.
>> Question: Wondering what that Extreme, it must imply that they're not very many
equivalent mutations, too. So ->> Tom Zimmerman: That's right, too.
>> Question: So I wonder what that says about it?
>> Tom Zimmerman: This number of 92% of mutations being detected, of course also
means that the number of equivalent mutants cannot be more than 8% because detected
always implies changing the program semantics and therefore nonequivalent. We will
actually use this very relationship in a moment to show that -- to show the relationship
between impact and nonequivalent.
Just a brief word about performance. I said that it's feasible in practice, meaning that
aspect J, the largest of our programmers, 100,000 lines of code, took 40 CPU hours. Of
course, if you have programs that are larger than that, 10 million, 20 million lines of code,
you probably will not reach these numbers. I guess that it takes 40 CPU hours for one
single test run alone, but for these medium-sized programs, we are already pretty happy.
Learning invariants is an expensive thing. It took 22 CPU hours, but this is something
you have to do only once for the entire time you ever want to apply mutation testing.
Creating these checkers is somewhat expensive, but learning is the most expensive so
far.
So this -- these are the results that we got. First question, our mutation with impacts are
less likely to be equivalent. What we did here is we took a sample of the Jackson
mutants. As I said, it takes lots of time to manually assess them, so we had to go for a
sample. Otherwise, my PhD student would still be busy for remainder of the year or
probably for five more years to check all the known detected mutants.
In this sample we checked whether they were the equivalent. We had two people with
the task of writing a test case. If they could not write a test case then they were deemed
equivalent and all in all this took 30 minutes per mutation. This is where the time of 30
minutes comes from. And what we found was that in the set of mutations that actually
violated invariants, that is mutations with impacts turned out that 10 out of 12 were
nonequivalent, that is they actually could be detected by a test, and only two of them
were equivalent, whereas in the set of known violating mutants there is those which did
not violate any invariant.
Turned out that eight were also equivalent when it came to overall program behavior,
whereas only four of them were nonequivalent. And this is a significant relationship.
Mutants with impact, in our sample, are significantly less likely to be equivalent. That is if
you take the sample you'd like to -- if you generalize the results of the sample you
generally like to focus on mutants that actually have an impact because then you don't
have to run through all these equivalent -- you don't have to run through all these
equivalent ones.
For our next experiment we wanted to go for something that was more automated
because, as I said, our checking for equivalence manually is lots of time. And our idea
was to use exactly the detection rate of the test suite because if I have a mutation that's
detected by a test suite, it's not equivalent by definition because it changed the program
semantics. And if I now have a general mutation scheme whose mutations are frequently
detected by the test suite, then this means that this mutation scheme generates a small
number of equivalent invariants, that is the more mutations are detected, the better my
generation scheme in terms of generating nonequivalent mutants.
In other words, on a more logical level, this is our reasoning. If a mutation is detected by
a test, it must be nonequivalent by definition. And if we can show that a mutation that
has impact is also frequently detected by a test, then we can show that mutations that
have impact also are likely to be nonequivalent. And this relationship is the relationship
we have to show and this is what we wanted to do in our test, whether mutations with
impact are more likely to be detected.
And this is what we found in all of our subjects. This was automated so we could run it
on as many programs as we wanted. Our -- here is again our seven programs and this is
the detection rate for nonviolating mutants, so nonviolating mutant in aspect J had a 60%
chance to be detected, whereas commons land, 80% extreme -- 80%. However, if you
look -- if you take the subset of mutants that actually violate invariants it turns out that
these are almost always, in six of the seven projects, six out of the seven projects they
are detected more frequently than the nonviolating mutants, which shows that if you have
an impact on invariants, the chances of getting your mutation detected by an actual test
is higher and therefore they are less likely to be equivalent.
Again, you should focus on those mutations that have impact. Yes?
>> Question: And this is to be expected because the invariants were dynamically
deferred starting from the same suite, right?
>> Tom Zimmerman: Yes, you can think of the -- since the invariants were found,
detected by the test suite, then anything that's different from the behavior in the test suite
would also trigger these invariants. That's right. However, this was not necessarily
something that we expected at this point. Also, the same thing happened -- the same
thing actually also happens if you don't learn from the entire test suite, but if you only
learn from one single execution. If you learn from only one single execution, you will still
find the very same aspect -- very same effect.
You can also think of these invariants as simply a measure of detecting a difference in
the execution and the more invariants you violate, the more higher, the greater the
different spreads across the program and therefore the higher the chance of actually -- of
the difference actually also reaching the end of the execution or turning up in the over a
result of something that is testable.
>> Question: (Inaudible) dynamic invariant environments are an abstraction ->> Tom Zimmerman: Uh-huh.
>> Question: Of properties observed ->> Tom Zimmerman: Yes.
>> Question: -- through the execution of that is basically the suite.
>> Tom Zimmerman: That's right. Yes.
>> Question: Should we characterize this ->> Tom Zimmerman: Yes. The assumption of course is also that the test suite, which is
designed by humans, makes some sense initially. It's not created randomly, it's created
on purpose, such that it covers the most important properties of the program, except for
those that are not detected in there.
We also went for -- so this is just absolute whether a mutation had impact or not. We
also checked for those mutations with the highest impact because we said, well, if you
have the choice between individual mutations, you go for those with the highest impact
first. Again, we have the non -- the detection rate for nonviolating mutants, which is the
same. But now we compare this against the top 5% of violating mutants that is -- they
are mutants that violate the highest number of invariants and you can see that this is
always higher in every single program up to 100% in three of these projects, which
means that if you look for the top 5% are mutants, every single one would be detected.
And if you had any in here, which would not be detected, then this would be the very first
thing to concentrate upon because this would be a mutant which changes everything in
the inner execution, but still is not found by -- still not found by the test suite.
So these are mutations with the highest impact are most likely to be detected and
therefore, also the least likely to be equivalent.
The detection rates actually, aspect J, this is very neat. If you look at the top 1%
mutants, close to 100% and the detection rate goes down. As you go down the stream,
same actually also applies for commons, Jackson, Jodar(phonetic) time,
Jaytopars(phonetic), Barbecue -- Barbecue is an exception, Barbecue actually -- it stays
at the top, but there's no decrease in here. The reason for that is that Barbecue has a
relatively known number of mutants that are nondetected and therefore we don't see this
nice behavior over here in Barbecue, but still for six out of seven projects the detection
rate is always the highest for the top for those mutations that are ranked at the top.
So in terms of evaluation, mutations with impact of less likely to be equivalent? Yes.
Mutations with impact are more likely to be detected and therefore nonequivalent? Yes.
With the highest impact most likely to be detected? Yes, as well.
This means in practice that when you are doing mutation testing you should focus on
those mutations which have impact and focus on those with the highest impact first,
which at the same time addresses the issue of having to deal with lots and lots of
nonequivalent mutants. So the two essential problems of mutation testing, first efficiency
and inspection. Efficiency, one can deal with it for limited size programs and inspection.
Second issue we face is also something you can deal with.
What this also means in practice is that in the past our mutation testing was limited to
very small programs, the size of where you could still easily determine whether a
mutation was equivalent or not, this was the mid-program, with 16 lines of code. Now we
can apply mutation testing to programs which are much larger than that, up to 100,000
lines of code. This is what we actually do at this point. And we are pretty happy to have
made mutation testing this much more scaleable.
I know that folks of you, you are working on the Windows (inaudible) things, yes, you
have to add another dimension of scale, probably larger than this entire wall to that. I
understand that. So far we are pretty happy with the advances and the scalability we
made so far. So our speaking about future work, first thing -- so initially we came to this
from a pure user perspective, but then we got hooked and we're doing research in this
direction because it became interesting. But we want to find out how effective mutation
testing is on a large scale and we are actually the first ones who can actually do that
because we are able to do this on a large scale without sacrificing the lives of thousands
and thousands of interns who need to classify all these mutants and we want to compare
it to traditional coverage metrics and likewise and see whether mutation testing is better
than traditional metrics and not only for the small programs, but also for large or say
medium-sized programs.
We're looking into alternative measures of impact, in particular we're looking into
coverage these days. That is checking not only whether the data has changed in terms
of invariants, but also checking whether the coverage has changed. Checking whether
the coverage has changed is something that can be done in a much more efficient way
than learning invariants or this overhead goes away. This is what we're looking into
these days.
One of my pet ideas is adaptive mutation testing. The idea here being is that you have
put in a number of mutations and you check those that have survived and you check for
those with the highest impact and then you have this small set of those with highest
impact and from these you generate a new batch of mutations. You play this again and
again you select those with the highest impact and you do that again and again, which is
some sort of meter mutation approach or mutations that mutate into new mutations. The
idea will simply be you want to have those mutations with the highest impact that have -and let those survive in order to be able to focus on those changes you can make to your
program which turns everything apart, but which still is not caught by your test suite.
And finally this concept of looking at the impact of an individual change in measuring this
also has a number of other applications. So what we're thinking of for instance is to use
this as a measure for predicting risk. You can think of if I can -- if I simulate a change to
particular component and then I measure the impact of this simulated change I could
think that wow, if changing this very component has so much impact all across the
program execution, you could think of this component being very crucial for the program
to work or to work right and it may well be if you look at defect density like this it may well
be that components that have many, many defects may be precisely those where
changes have a great impact and which have been frequently changed in the path.
If you can determine that a change in here has lots and lots of impact, changes the
program behavior, introduces real failure and you also find out this also has been
changed frequently or changed recently in the past, then you would assume that this
makes a very good defect predictor for the future.
So coming to the conclusion, or to sum all this up, two major problems with mutation
testing. Great idea in principle, but issues with efficiency and issues with inspection. We
built this to a Javalanche, which focuses on the mutations with the highest impact of
invariants, focusing on the mutations with the highest impact makes sense because this
actually reduces the number of equivalent mutants and all in all this constitutes the fear of
both the tester, as well as our own fear. I'm very happy to have gotten rid of this fear and
we're able to apply mutation testing to much, much larger programs than we were
originally able to.
And with that, thank you for your attention. I'm happy to take questions.
(applause)
Patrif?
>> Question: So -- so the end game here is to basically be able to tell whether or not a
given test suite is good or lousy?
>> Tom Zimmerman: Yeah.
>> Question: So for the example that you consider every test we didn't consider, how
basically -- I mean, do you have to -- it would be nice to show so traditional like just
branch coverage or statement coverage.
>> Tom Zimmerman: Uh-huh.
>> Question: And see what conclusion could get come up with like this test suite is
probably lousy because it only covers like 25% of code.
>> Tom Zimmerman: Uh-huh.
>> Question: Versus what you could come up with this framework.
>> Tom Zimmerman: Yes. This is on our agenda and I'd really like to have an answer
to that. The reason why this is on our agenda and why we are still working on this is that
it's unfortunately -- this is unfortunately -- there's lots and lots of manual work in there
because the traditional way of comparing coverage metrics against each other is actually
using mutation testing. Because you put mutants into program and then check whether a
specific coverage criteria is better in finding these artificially seeded bugs than another
one.
>> Question: Yes. But for large programs not necessarily, because the larger the
program, the more there is dead code.
>> Tom Zimmerman: Yes. That's right.
>> Question: And so -- so now I have a medium-line program.
>> Tom Zimmerman: Uh-huh.
>> Question: And 50% of it is dead code. 25% is reading that code and 25% is not
exercisable, just basically testing interface.
>> Tom Zimmerman: Uh-huh.
>> Question: Now I have two test suite basically.
>> Tom Zimmerman: Uh-huh.
>> Question: One good one that exercise 50% of the code.
>> Tom Zimmerman: Uh-huh.
>> Question: -- and (inaudible) ->> Tom Zimmerman: Yes.
>> Question: -- coverage.
>> Tom Zimmerman: Yes.
>> Question: I have one that is lousy and just exercise 10% of the code.
>> Tom Zimmerman: Uh-huh.
>> Question: So it still would be interesting, I mean that, you could just show which is
coverage. You see that exactly.
>> Tom Zimmerman: Yes.
>> Question: It is (inaudible) to compute.
>> Question: It danger with perhaps in a (inaudible) like this is that you are directing
(inaudible), there are prediction of like equivalent mutant. Really driven by the (inaudible)
itself. So if I have -- I have lousy (inaudible) for instance, very well that the mutant is not
going to be detected by that suite. So it is not going to help me improve my test suite
from 10% coverage to 50% coverage.
>> Tom Zimmerman: Uh-huh. There's two things. One is the assumption that 50%
coverage is always better than a 10% coverage because if your 10% coverage happens
to be in this areas which marked as red for instance, then you can eradicate 90% of the
bugs, whereas your 50% test the work in the black areas and may find 0%, 10% of all
bugs.
That is one thing. And this is actually the reason for us that we wanted to look into
mutation testing because mutation testing, well, because mutation testing also takes into
account whether propagates across the program how it propagates, whether it actually
can be tested. So this is sort of built in advantage to this back synthesis in here. Of
course, you have to take into account are these bugs similar to real bugs and all this.
There's a number of other issues that come with that.
The second question is if we -- the second question is the dependency on the -- on these
dynamic invariants. Of course you have -- the thing is there is a risk of the test suite
perpetuating itself in some way because you learn invariants from the test suite and then
you focus on those mutations that violate these invariants and you improve your test suite
again in a particular direction.
For one thing, it may be that this direction is actually the right direction because the test
fit originally makes sense and you are completing the test suite using the same properties
again and again. Number two is that if you actually can create lots and lots of impact by
validating this inverse in the program, these may actually be the very places that you
want to look into. And the number three is we actually obtain very similar results also if
approximate we just look for changes in coverage or if we look for changes that look only
for one individual execution without taking the entire test suite into account.
If we look for changes in coverage for instance and this is another experiment we ran,
you also find that the more difference in coverage you have, in particular the more
widespread your difference in coverage is then you will also find that this correlates with
the detectability, as well as samples you find this correlates with actually equivalent and
nonequivalent.
The problem with coverage and this is why invariants are table. We know that this works
and we've shown that. The problem with coverage that we're working on right now is that
there's many, many ways to establish differences in coverage. In particular, loops, you
can count a number of loops. Does the number of loops have an impact on the program
semantics? There is a couple of things we have to weed out in here. However, for
dynamic invariants, this works very well. Be happy to take the risk of the test
perpetuating itself into account. Long answer.
Yes, Rob?
>> Question: So it seems like JAVALANCHE has two major sources of heuristics in it.
One is, what are the mutations that you choose to introduce. And then the other is what
are the invariants that you choose to try out.
>> Tom Zimmerman: Uh-huh.
>> Question: Kind of in the back of my mind I was thinking through, okay, this is an
interesting design space, how would I coordinate those two.
>> Tom Zimmerman: Uh-huh. Yes.
>> Question: And it's kind of subtle. I mean, it's not for example, you know, so say
you're going to check for equalities and inequalities, you know, when you introduce
invariants.
>> Tom Zimmerman: Yes.
>> Question: That doesn't necessarily mean you only want to mutate equalities and
inequalities.
>> Tom Zimmerman: Yes.
>> Question: I mean, that would be dumb. You take constants and that might be good.
So trying to find heuristics that kind of match and compliment each other, that seems like
a complicated little (inaudible) ->> Tom Zimmerman: Yes. We were -- we were conservative at this point in terms of
mutations. These are not mutations that we chose, but mutations that have been
chosen -- that have been -- sorry, mutations this have been shown in different context to
actually correlate with real thoughts of programmers. Question again, does this apply to
Java programs? Does it apply to real big programs and all? This is also something that
now with the framework like this, you can evaluate.
Our second thing is the dynamic invariants. In the invariants, essentially first to Dyacom
out of the box. As it was designed by some -- it was not asked to decide on that again.
So we simply took the Dyacom tool out of the box and then we ran it. Then after a week
or so we stopped the execution and we decided to make it more efficient because it had
run for one week already and then we took those, then we simply retained those
invariants, which would cover 80% of all invariants and we only checked for those and
low and behold Dyacom ran 100 times faster from now on. Mike Ernst was very happy to
hear that because he's frequently concerned with people complaining about the poor
performance of Dyacom on real programs. And this is essentially what we did in here.
So these decisions make up a big design space, but we didn't make these decisions
ourselves, but we simply relied on what we found as determined by other researchers.
Of course, this is all stuff to investigate and this is -- and in order to investigate this you
first need a framework, first need a framework that gets rid of all these equivalent
mutants because if you want to examine whether a specific set of mutations make more
sense than another and you don't have this filter of recalling mutants up front, you have
to spend endless and endless hours just to analyze whether a mutant would be
equivalent or not because this fault in a single result.
If you come up with a set of mutations that generates lots and lots of equivalent mutants,
for instance, simply replace every constant by itself, would make a great mutation. You
will find this generates lots of mutants that are not detected by test suite, but all these are
equivalent. So this is always what you have to get rid of first.
>> Question: I guess it can be also module by module figure out historically what has
changed most about that program.
>> Tom Zimmerman: Yes.
>> Question: And then replay those accessible bugs.
>> Tom Zimmerman: Yes. This is also part of our agenda. I already mentioned this
idea of adaptive mutation testing. So you can make mutation testing adaptive with
respect to impact, but you could also go and make mutation testing adaptive with respect
to the past history. That is focus on those mutations that are more similar to past ones.
What you need there again also for adaptive mutation testing, some sort of vocabulary
from which you can compose appropriate mutations. What you will find, though, is that
real bugs typically tend to have fixes that span across maybe sometimes just one line,
sometimes a dozen of lines and showing that your mutations are equivalent to these is
not that easy. But we want to have a framework that allows us to do that.
Andrew?
>> Question: So (inaudible) of Dyacom performance enhance want in some sense
could be considered mutations to the Dyacom code. Did you run AVALANCHE over
that?
(laughter)
>> Question: Accidentally change the semantics of Dyacom?
>> Tom Zimmerman: We actually did not change the semantics of Dyacom, but we
simply -- but Dyacom has a library of invariants, of invariant patterns that we check
against and what we did is we simply reduce this library to the very basic invariants that
cover, as I said, 80% of all invariants.
So we first ran Dyacom over all, not on aspect J, because this doesn't work. And at least
not on our machine which, which only has a paltry 32 gigabytes of RAM, didn't work. But
we ran it between the smaller programs like Jackson ran Dyacom as a whole and then
went for reduction of the library and we found that 80% of invariants were the same. And
therefore, we ran -- went for the much smaller version.
Yep?
>> Question: (Inaudible) case how -- so the users model seems to be find inadequacy
in your tests, which mean you add new tests. How feasible is it to have a good test suite
for something like aspect J, because it seems like you keep adding more and more tests.
Is that feasible to actually do that if your goal is successful and (inaudible) ->> Tom Zimmerman: Well, in the long run what you want -- if you believe in mutation
testing what you want to have is you have -- you apply 1,000 mutations and every single
mutation is found by a test suite. That's very nice. Then you have a very good test suite
that finds all these artificially seeded bugs.
You can of course increase the smartness of your mutations or at some point coming up
with very subtle things that would not be detected. Okay. Then you would show that
your test suite is also able to find very subtle bugs, but this is -- but for us also from
looking at the mutations that we found that were not detected is that you look at those -there are simply places in your program which are covered by the test suite which are
being executed, so far, so good, but in which a single change can create lots and lots of
impact through the program and this means that the test suite, although it executes this
part, obviously does not check -- does not check properly for the results or is not accurate
enough or is not strong enough at this point because this is one thing, you can create a
test which simply executes (inaudible) but which simply doesn't care for the results or
which simply checks whether it crashes or not. But if the test suite does not check for the
result, you achieve perfect coverage, but your test suite may still be a very bad quality.
And this is something, this is something that would not happen within mutation testing.
>> Question: (Inaudible) -- a good test suite is even feasible for the whole (inaudible)?
>> Tom Zimmerman: Is building a good test suite actually feasible? Well, over here at
Microsoft Research, you have people who are working on exhaustive testing techniques,
model-based testing, in particular. People who are building really, really extensive tests
that -- well, this is also something I'd like to verify. How resilient are these super-dooper
nice test generation system when it comes to artificially seeded bugs? You will probably
find that these techniques are very, very good and in testing your program exhaustively,
so it's actually feasible, yes. And parts of this test generation, large parts of test
generation comes to be automated.
Also, you can apply these -- you can apply mutation testing to any kind of automated
quality issuance if you have some fancy static analysis that finds these bugs or even if
you have maybe the whole army of theorem provers and you have formal specifications
and all, you can apply the very same techniques. You see the bug in there and if your
theorem prover doesn't find it, oh, you obviously need to work on your axioms at this
point.
>> Question: (Inaudible) -- feasible, right? Ship stuff whether (inaudible), but it's testing
because -- or it's good enough testing.
>> Tom Zimmerman: Uh-huh.
>> Question: But the kind of cool thing about this is especially if you're working with a lot
of (inaudible) tests, tests -- the previous measure was code coverage under tests and
you have no idea if it's actually -- it actually (inaudible) product code, but you don't know if
it's actually doing confirmation in (inaudible).
>> Tom Zimmerman: Uh-huh.
>> Question: And a lot of the early code is actually just all APIs and be sure it doesn't
(inaudible) ->> Tom Zimmerman: Yeah. There is (inaudible).
>> Question: You don't even check the (inaudible) ->> Tom Zimmerman: Uh-huh.
>> Question: The (inaudible), they get kind of (inaudible) ->> Tom Zimmerman: Uh-huh.
>> Question: And they are kind of useless and in fact they cover up from the ->> Tom Zimmerman: That's the big advantage of mutation testing, simply simulates the
real -- simply simulates real bugs and forces you to really improve your test to catch at
least the simulated bugs. This does not necessarily mean it will also catch real bugs, but
if real bugs have any resemblance to what we see to the mutation that we apply in here,
yes this, should work.
And even if it -- even if you improve your test suite just to catch these very simple bugs
like changing constant from zero to 1 or from adding plus 1, adding minus 1, there's so
many bugs that come from -- that come off by 1 error for instance that could be called this
way and as you said, if a return value is not checked this will be found out by mutation
testing, but not by traditional coverage metrics.
>> Question: (Inaudible) mutations interested in, how did you decide on what that set
was and did you perhaps -- was there any evidence in perhaps (inaudible) to say that
mutations being picked were aligned with bug types that actually occur?
>> Tom Zimmerman: Yes. The mutations that we picked came from Angolia study of -let me see, this was Briar and Labesh and James Andrews, who did a study on that. And
they compared this small set of mutations together with real bug history of a program that
was called space program, which was -- which is medium-size C program which had
number of -- total of 20 bugs in the past. It's been used by the European space agency
to navigate, settle across the globe, so it's a real program. And they found that the small
set of mutations was equivalent or equivalent enough to actually be compared with the
real bug that the space program has made.
The space program is 20 years old at this time.
It is object-oriented and it may well be that object-oriented programs have their own set of
mistakes that people make. At this point, at this point we simply relied on what we found
in the literature because initially we started simply as users of this technique. But with
the tool that we have on hand we can now actually examine all these questions because
we can go and check whether a specific selection of mutation operators is better than
another, whether mutation testing as a whole gives you better results than traditional
coverage metrics. This is in some way, this is just the beginning and this point the design
space is restricted to look on what is already there, what others did in the past. But we
can of course explore this from now on.
Yes?
>> Question: So this is a little bit unrelated. So you mentioned that it took about 30
minutes for ->> Tom Zimmerman: Yes.
>> Question: -- to determine the means for equivalents.
>> Tom Zimmerman: Uh-huh.
>> Question: Did the -- did knowing the invariants and the end path when it was used to
like during the study, did that make that timing faster? Or did people actually use this
new information how you had before like in had determining whether (inaudible) ->> Tom Zimmerman: Oh, that's an interesting -- that's an interesting point. The
question is whether having these invariants would make people -- would make it easier
for people to decide whether they were equivalent or not. In the study that we did with
invariants, work played no rule because when we did the study, we didn't have any
invariants at this point. It may well be that if you do have invariants than this eases
program comprehension, in particular it eases program comprehension, figuring out what
individual function actually should do. And then I would expect that this is faster.
However, these 30 minutes are just an average. The larger your program becomes, the
more tedious it becomes to examine this. And as in the quote of Phyllis Franklin, which
you had in the beginning, even for a small program, figuring out whether random change
has an impact or not is incredibly difficult because -- see, if you're working with changes
made by real people and you want to figure out what their impact is, these are changes
that I made on purpose. There's some intelligence behind that. At least that is what we
expect. There is some intelligence and some meaning behind it. You expect it to stay in
line with the -- at least implied semantics of the program. If you apply random -- if you
apply random changes you can forget all about the reasoning, but instead have to go and
really figure out where would this particular mutation have an impact on. You have to
follow the dependencies, you have to infer what the -- you have to infer what the implicit
semantics of the function of this where invariants actually helped because they make it
more explicit. And you have to check every single occurrence all over the program.
It's like if -- I think of making a random change to the Windows code at some point.
Oops. Where will all have impact and if -- where will this spread? If you don't have a
means to measure this, even if you go with -- even if you with say static analysis and
thought about where would this have an impact, which other place that have to look into,
you still can -- it's still very likely that you will spend more than just 10 minutes on that.
>> Question: Any more questions? Then let's thank Andreas again.
(applause)
>> Tom Zimmerman: Thank you very much.
Download