>> Tom Zimmerman: Okay. Thank you very much, Tom, and thank you very much for having me here and thank you very much for attending here. I know that several of you sacrificed the spring holidays or have their families sent off to holidays already while you're desperately waiting to get into your cars to get to the airport. So I'm going to start by introducing myself. So I'm Andreas Zeller and this is Earth, this is where I come from. More precisely, this is Europe. And in Europe, you have this patch, this is Germany and this tiny orange dot here is the state of Saarland. The State of Saarland is the smallest of Germany's states. It's about -- it's a little bit larger than Seattle County. And in this very tiny patch -- in this tiny patch of a state there is a City of Saarbrucken, which is right on the border to France. That's also among the most interesting parts of the Saarbrucken that is close to France because there is essentially only one thing that's really interesting there, for which you may want to stop there and that's Computer Science cluster, which is composed of the university, which is composed of Saarland University, the German center for artificial intelligence and two Max Planck Institutes, which do basic research. The picture you see there is actually slightly outdated because if you look on the other side of the road, this is what you see there, currently there's seven new buildings for computer science being constructed there. So I understand you folks have just moved into new building in here and you know what it looks like. This is what we're going to, this is what we're going to move into or expand into in the next year. This is the guy in charge of these buildings. This is Hans Petalenof(phonetic). He's the head of Bioinformatics and you can see that's he's pretty happy because -- well, you see, building -- buildings like this one is actually a more or less risk-free thing. See. You set them up. While you're setting them up, you can walk through them, you can see where everything is in place and if something doesn't work properly you're still able to fix things, even if a wall is completely -- even if a wall is completely where you can move the wall to another place, there's always stuff you can change and essentially you'll have plenty of time to figure out whether it works or not and there's little chance that the building as a whole will simply collapse at some particular point. Now as I said, this guy is happy because he's in charge of doing buildings. If you're building software, however, you're having something, you're also constructing a big, big artifact, which actually is just as complex as a building or even more complex. But if something is wrong in there, it can easily take the entire construction down. That is in this building say there was a screw somewhere over here, which is in the wrong place, and this simple screw may be able to tear the entire building down. This is something -- this is something which -- what I was developing and releasing software always made me very, very worried. Well, maybe this is a bit drastic, but it's certainly a property of software that can give us a couple sleepless nights in particular before a release. So what do we do about that? Well, we of course we test our software before we release it, but how do we know that we actually tested our software well enough? There's a couple of criteria, which you can learn if you go into a software engineering course. You can for instance find out that what you should do is you should make sure that your tests cover every single statement. Every single statement is executed at least once while you're testing it. Or better yet, you can think about having every single branch being executed. You can think about having every single condition evaluated to true and false, at least ones -- you can make sure that every loop is executed at least a number of times and these are all called coverage criteria and these are criteria at which you can measure whether your test is -- whether your test makes sense or not. These are frequently used in industry. These are frequently taught and if you go to an exam you can also -- well, this is something you have to remember if you went to an oral exam and testing. But the question is the so-called coverage criteria, in particular what's up here and that you -- such that you need to check every single branch of your program. Does this actually make sense? There's an interesting code by Elaine Wyaker of AT&T and she said that whether there's coverage criteria for testing are actually adequate or not, whether they make any sense or not, well, we can only define them intuitively. That is from our own expertise. We think that this makes sense, but what she pointed out is that in practice -- in practice, what you usually have is defect distribution in your program that can look very, very different. This is defect density as we mined it from the aspect J -- from the aspect J version history. Every single rectangle here is single class and the rhetoric class is the more defects it had during its history. So what you see here for instance, there's places which had over here they're very red, bright ones had 46 bugs over time. Over here they're more black ones, had zero bugs over time. So -- and this defect density is usually distributed to something we call peritu(phonetic), which is a peritu(phonetic) distribution, which means that 20% of all modules or 20% of all classes have about 80% of all defects. And the problem -- the problem with that is that if you now go and try to achieve something like statement coverage or branch coverage you want to go and make sure every single branch is executed at least once. You may go on and focus to improve your nest in these again and again and again, where no defect ever has been found. And instead neglect these red areas where there's plenty of defects, but turns out that these -- that this occurrence of defects, this defect density is not necessarily related to the structural -- to the structural properties that is -- it isn't the case that in these areas you simply have more statements to cover or more branches to cover. So this defect distribution is -- this defect distribution actually is independent from the structural criteria I've seen so far. So people have thought about that already in the past or actually already 30 years ago and they came up with alternative to define or to measure whether a test suite is adequate or not. And this idea is mutation testing. The concept of mutation testing is actually to simulate bugs and to see whether your test suite finds them. Works like this. You have a program and from this program you create a number of copies and in these copies you now introduce artificial bugs, which are called mutations. A mutation, for instance, may change a constant by a specific factor, may change operations. The idea is to simulate the bugs that programmers do in real life, too. And once you have created these copies with these artificial bugs being introduced, you test them. And if your test suite is good, then it should find these artificial bugs, so far, so good. However, if it turns out that your test suite does not find these artificially seeded bugs the assumption is that this also implies that your test suite also would not find any real bugs in the code. If the test does not find artificial bugs, it doesn't find areas that it would also fail to find real bugs. And this is something you have to worry about. Now here's an example of a mutation. This is piece of code from aspect J, the checker class and this is one of the well-known compare-to methods, which is supposed to return zero or negative or a positive value, depending on whether the object being compared with is equal, less or greater than the receiver of the method. In this case we have some special compare-to method in order to return zero, which means that when it comes to checkers, objects are simply equal. And a mutation for instance could go and change this compare-to method from zero over here, simply to one, meaning that whenever you compare two objects, A and B, B would always be the bigger of the two. Turns out that if you change this from zero to one it has no effect on the aspect J test suite whatsoever. The test suite simply couldn't care less what the return value of compare to is, which in that case obviously indicates weakness in the test suite. Now there's plenty of mutation operators that you can think of. This is from the book on software testing and analysis by Mara Petsi(phonetic) and Michael Young. You can replace constant by constant, color for constants, area for constants. You can insert absolute values and insert method, constants can remove method, constants can replace operators. There's plenty of things that people have thought about in the past in order to simulate -- in order to simulate what programmers are doing. Question is, does this actually work. And again there have been a number of studies on that, in particular very interesting and very important results. Generate mutants are similar to real forces software which is -- one is able to -- by using a small number of mutation operators, one is able to simulate real force as they occur in the software. Second thing is when you do mutation testing it's generally more powerful than statement or branch coverage when it comes to improving a test suite and it's also better than data flow coverage. This is another set of coverage criteria which refer to data, rather than to control flow. So generally mutation testing in this study has been found to be effective, it has been found to work. And this is actually why we wanted to use mutation testing as a means to assess the quality of a test suite because what I initially wanted to do is I had this cool idea that when a defect arises and we wanted to predict defects in software and we said, well, maybe the chance of a defect increases when the test quality decreases and therefore we wanted to use mutation testing in order to measure the quality of tests. This was our initial idea, simply to come up as a user of mutation testing and to apply it. And it didn't really work. It didn't really work because of two issues. One of these issues was known before, we expected that. And the other came completely unexpected. The first thing we worried about was that the size of the programs being used in mutation testing studies was somewhat limited. This for instance is the mid-program, this is a four-trend program, we can also make a Java version of it, which returns the middle of three numbers. So the one number that is neither the minimum or the maximum of these three numbers. This is the size of programs that will be used in papers on mutation testing. So of course when you write a paper you come up with a small example that fits on one slide such that you can illustrate your approach. This is perfectly normal, but this is also the size of program that was used to -- actually was used to apply mutation testing. Well, and of course we knew that mutation testing would work better on small programs and we knew that because one issue when this other issue we expected is efficiency. Mutation testing is an expensive technique because for every single mutation you apply to the program, you must rerun the test suite, which means that you run the test suite over and over and over again, which means that it only works in practice if you have a fully automated test suite. That's important. But still it requires quite a lot of computing resources. This was the issue we knew about and this was the issue we faced. So I wanted to figure out how to make mutation testing efficient, as efficient as possible because we, being a university, also have limited computing resources. So we came up with a couple of ideas that already had been discussed in literature and we implemented them all. So for instance, we wanted to build -- we wanted to work on Java programs, so we went and rather than manipulating the source code and recompiling it, we went and manipulated the byte code directly, so we simply replaced the property of byte codes into different byte codes. This was very efficient, because we didn't have to go through all the recompilation. We focused on a very small number of mutational operators, simple mutation operators that had been shown before to actually simulate all the -- that actually had been shown in the earlier study to simulate all the real fault by programmers. In particular we have numerical constants that are replaced by plus minus one or by zero overall. Negating a branch condition means simply going along the other way and we also replace arithmetic operators plus by minus and divide by multiplication and vice versa. And we also implemented a couple of run-time conditions that would allow us to switch on and off mutations without even changing the byte code again. And finally we also checked for coverage that is we only bother to mutate code that was actually -- that would actually be executed in the program, that is if some piece of code would not be executed we didn't even bother to change it. Here's an example of one of these optimizations. Again, the checker class, again the compare-to method, we have this mutation changing zero into one. It's not found by the aspect J test suite for a simple reason, it's not ever executed and therefore since it's not executed we can of course mutate it like crazy. This will not have any effect whatsoever. So this was the issue we knew about and after we built a framework, turned out to work reasonably well. For instance, one of the test programs is an expert engine, 12,000 lines of code, definitely more than this mid-program with 25 lines of code and if we ran it on one of our machine its took six and a half CPUs. You divide that by the number of CPUs you have available and you get close to the -- get close to the realtime that it takes. Our machine had eight CPUs, so it took about one hour of real time to run it all work. It's okay. You can live with that. Which meant that all in all we were able to apply mutation testing to -- well, mid-size Java programs, which was perfectly okay. Then, however, came the second problem and the second problem hit us hard. It hit us -- it hit us when we did not expect it at all and we were really, really bothered. This it problem is the inspection problem and this is a problem which you rarely read about. I think in the entire book of Petsi(phonetic) and Young there's only one sentence devoted to it. And this problem is known as the problem of so-called equivalent mutants. Because if you apply mutation to a program this may simply lead the program semantics completely unchanged, which means that it cannot be found by any test, which also means that such an equivalent mutant cannot be detected by any test and there is no way to improve your test to catch this equivalent mutant. Also an equivalent mutant is not exactly the representative of a real bug because real bugs is what actually change the program semantics. Now deciding whether a mutation to a program changes the semantics of a program is unfortunately an undecidable problem. If you can solve that problem, you also solve the halting problem and probably also P equals NP and whatnot in the same term. Now we solved the halting problem, but halting problem is undecidable, so all -- what you need to do is you have to look at these mutants, you have to inspect them manually and you have to decide whether a mutation actually change the program semantics, whether it's equivalent or not. And this is a very tedious task. Let me show you an example of an equivalent mutant. Here's one. Again, we have a compare-to method. Compare-to method this time just comes from the B cell advice class, which again is class in aspect J. And again we are changing the return value of this compare to method, but this time we're changing this plus one value over here into plus 2. So this is a compare-to method. Normally you think, well, would this be equivalent or not. Does this have an effect on the program semantics in normally a compare-to method is supposed to return a positive value or a 0 or a negative value. So the actual value return should not have any -- should not make any difference. But how do you know that a program actually also uses this compare to method as intended by comparing the sign of the return (inaudible) that a-dot compared to B greater than 0. Maybe each programmer writes a-dot compared to B equals plus 1 and then it would have an impact. So actually have to go and check all the instances of compare to. I think if you do a graph compare-to in S(inaudible) code you end up with 4,000 different instances, so you have to be somewhat smarter than that to find all the occurrences. However, turns out that in the end this change from plus 1 to plus 2 has no impact whatsoever in aspect J. And there's no way such mutation can help you in improving your test suite. Now we went through a number of such mutations and checked them. Turned out that in Jackson, again, this 12,000 line program, turned out that we just took a sample of 20 mutants and of the sample, 40% of the mutants that were nondetected were equivalent, which means that we do have 40% false positives. Well, there's -- if you work on static program analysis to find bugs, 40% false positives may not necessarily seem big, but this takes time. Every single mutation took us on average it took us 13 minutes to figure out whether it was equivalent or not because if it was not equivalent, we also had to write in a proper test that would catch it, but solve 13 minutes mutations is quite a lot, in particular if you think -- if you realize that almost 2,000 mutations were not detected, which means that if you go and check every single mutation on whether it's equivalent or not, that's -- this will take you about 1,000 hours or for Microsoft programmer, make that 10 weeks. So 10 weeks is 10 weeks looking for one mutation after another is not necessarily what you want to do. So we thought that efficiency was the main issue. So we made everything to run as fast as possible, but what does it mean if your check takes six hours? Afterwards you have work for entire quarter. Or worse even, this ratio 40% grows as you improve the test suite because you can think if you have the perfect test suite that catches every single bird, it would also catch every single -- it would also catch every single nonequivalent mutant. Every single mutant that is not caught by your perfect test would be equivalent and therefore as your test improves to get to this 100%, the number of -- the number of nonequivalent mutants decreases, but the number of equivalent mutants stays the same and therefore -- therefore, as you improve your test suite you get more and more false positives. And all these false positives are just worthless. They just waste your time. If you use a statement coverage you find here is a place which I cannot -- here is a statement which I cannot reach by any test. You found a bug, that's great. However, in mutation testing if you find here's a change which I cannot find by any test, well, then okay, you go back to zero. And all of this is a serious problem in mutation testing. Actually when we read the papers we found that -- we found a couple of sentences in the paper, the human effort needed to actually apply mutation testing was almost prohibited. So mutation testing in principle, good idea. But in practice takes lots of human effort. And that's when we were slightly disappointed because at this point we already had invested a year, a man year, on solving inappropriate mutation testing infrastructure, which was very efficient. So the problem arises how do we go and determine these equivalent mutants? How can we tell whether a mutant has impact on the program semantics or not? This word impact actually was what drove our next step. What we wanted to have is big impact. We wanted to have a tiny, tiny change to the program execution that would -with tiny change create lots and lots of differences in the program execution, like this snowboard guy down here. Difference that would propagate to the anti-program execution and make everything very, very, very different. And this Javalanche model is also actually what then became the center of our framework. So this is what we want to have, we want to have a tiny change that has lots and lots of impact all across the -- all across the execution. How do we measure this? Our -- well, our idea was to look for changes in what constitutes the semantics of individual programs of individual functions that is pre and post-conditions and where do we get pre and post-conditions from, because normally they are not specified? So we applied a technique called Dynamic Invariants to learn the pre and post-conditions of individual function. Dynamic Invariants is a technique, which was invented by Mike Ernst, who is now at the University of Washington just across the pond. The idea is you have a number of program runs and from these program runs you determine an invariant that is a property that holds for all the individual executions. How does this work? Well, once you have such invariant you can then check it for individual properties and check with invariant holds or not, you can compare the two. Here is a little example. This is again piece of Java code. This is an example. 15/11, which, well, it should take you about 30 seconds to figure out what it does. If you run this program with 100 randomly generated areas of different length you will find that there is one specific pre-condition, that is M is always the size of the area, B is never null and N is also between 7 and 13. This is the pre-condition that is determined by Dynamic Invariants and this is the post-condition. B is unchanged and the return value is the sum over all elements of the area. And this set of invariants is obtained as follows. We get a trace from all these runs. Then we filter out the invariants are then reported as post conditions. This is how Dyacom, the two of my (inaudible) works, we adapted it and also made Dyacom much more efficient in order to make it scaleable and if you ever worked with Dyacom, yes there, is definitely need for that. We were very happy because at the time Mike Ernst was having a sabbatical with Saarland, so it was just around the corner and we could easily do that. So here's an example. Here's another method from aspect J. Signature to string internal. And this function, what this function does, it encodes the signature of a function. Signature meaning the types of the arguments that go in there and go out there and if we change here this index from minus 1 to 0 this means that we're going to change the way identifiers and signatures are encoded in the program. And it turns out that this little change which we make to the signature to string internal method, this little mutation, actually changes the post-condition of the -- changes post-conditions and pre-conditions and pre-conditions and post-conditions of methods all over the place. This little change up here has impact on overall 40 different methods and 40 different methods the pre or post-condition is now violated after apply this change, which means that the semantics of these individual functions has changed. Better yet, if I make this change, not only does it impact overall 40 invariants, but also it is not detected by any of the aspect J unit tests and to make matters really worse, this part which I have omitted in here actually has a comment from the programmer which says this should be thoroughly tested, which unfortunately it is not. So you can change all these internal signature stuff and it is never tested, never caught by any of the aspect J test suite, which implies that if anything else is wrong in here it also were not found by the real test. So we called our firm JAVALANCHE because of the avalanche effect. And it took us a year to build it. It implements all the nice, efficient techniques as we've seen them before and it goes and ranks the mutations by impact. That is those mutations with the highest impact get the highest detention from the program. And this is how JAVALANCHE works when it is mutation testing. Again, we have the program, but now we have an instrument, the program, with this invariant checkers which are going to show us whether the semantics of an individual function has changed. Again, we create copies and we put in our individual invariant checkers in these copies just as well. Now we go and mutate these copies by applying appropriate mutations. Then we run the tests and if the test finds the mutant, it's all fine. We don't worry about these. But if a test does not find the mutation and this happens -- if a test does not find the mutation, then we check whether individual invariants have been violated, whether there has been a change in the function -- in the function semantics, in the function behavior and if we now have multiple mutations that are not found by the test suite then we focus on the ones which have the highest number of individual violations because these are the ones with the highest impact. So again, we learn these invariants from the test suite, we insert invariant checkers into the code. We detect the impact of mutations and we focus on those where the most invariants have been violated. That is where we find the highest impact. So this was our idea. But now again the question for us...does this actually work? Does this make sense? So we went and built an elevation for that. First, we checked with a mutation with impact is less likely to be equivalent. Second, we checked with mutations with impact are more likely to be detected by the test suite and third we checked whether those mutations with the highest impact are the most likely to be detected. We did this on a number of elevation subjects, so we started with Jackson, which is writer in the middle, 12,450 lines of code, with up to aspect J, which is almost 100,000 lines of code. This is aspect J is AOP sanction to Java. There is Barbecue, which is a bar code reader. Comments, which is helping you. The Jackson is an ex-path engine. Joda(phonetic) time is a data and time library. Jatopas(phonetic) Parser, parser tools and extreme is an XML -- is a tool for object serialization and XML. Different sizes, different number of tests. We applied all the mutations there were, our mutation scheme has a finite number of mutations so we all satisfy a number of mutations that would apply it. Something between 1,500 mutations for Jatopas(phonetic) and 47,000 mutations for aspect J. You can see that the detection rate from project to project varies a lot. Extreme, almost all mutations were detected, aspect J, roughly only half of the mutations were detected. Not sure whether you can directly compare these measures and whether this has anything to say about this with respective test suite. But the extreme test suite is really, really tough, whereas the aspect J test suite is -- let's say it's great for research if you want to improve tests. Very much like Aspect J, also, if you do static analysis aspect J is your total friend because -- or if you do mining of bugs, it's really great. It's a wonderful subject in every single word. >> Question: These numbers are for nonequivalent mutation? >> Tom Zimmerman: This is -- so far we don't care about the equivalents or not, this is just the mutations that were detected and now if you apply mutation testing now we would focus on those that survived in here and we would check for every single mutation. If it's not equivalent, then we should improve our test suites such that it would find it. >> Question: Wondering what that Extreme, it must imply that they're not very many equivalent mutations, too. So ->> Tom Zimmerman: That's right, too. >> Question: So I wonder what that says about it? >> Tom Zimmerman: This number of 92% of mutations being detected, of course also means that the number of equivalent mutants cannot be more than 8% because detected always implies changing the program semantics and therefore nonequivalent. We will actually use this very relationship in a moment to show that -- to show the relationship between impact and nonequivalent. Just a brief word about performance. I said that it's feasible in practice, meaning that aspect J, the largest of our programmers, 100,000 lines of code, took 40 CPU hours. Of course, if you have programs that are larger than that, 10 million, 20 million lines of code, you probably will not reach these numbers. I guess that it takes 40 CPU hours for one single test run alone, but for these medium-sized programs, we are already pretty happy. Learning invariants is an expensive thing. It took 22 CPU hours, but this is something you have to do only once for the entire time you ever want to apply mutation testing. Creating these checkers is somewhat expensive, but learning is the most expensive so far. So this -- these are the results that we got. First question, our mutation with impacts are less likely to be equivalent. What we did here is we took a sample of the Jackson mutants. As I said, it takes lots of time to manually assess them, so we had to go for a sample. Otherwise, my PhD student would still be busy for remainder of the year or probably for five more years to check all the known detected mutants. In this sample we checked whether they were the equivalent. We had two people with the task of writing a test case. If they could not write a test case then they were deemed equivalent and all in all this took 30 minutes per mutation. This is where the time of 30 minutes comes from. And what we found was that in the set of mutations that actually violated invariants, that is mutations with impacts turned out that 10 out of 12 were nonequivalent, that is they actually could be detected by a test, and only two of them were equivalent, whereas in the set of known violating mutants there is those which did not violate any invariant. Turned out that eight were also equivalent when it came to overall program behavior, whereas only four of them were nonequivalent. And this is a significant relationship. Mutants with impact, in our sample, are significantly less likely to be equivalent. That is if you take the sample you'd like to -- if you generalize the results of the sample you generally like to focus on mutants that actually have an impact because then you don't have to run through all these equivalent -- you don't have to run through all these equivalent ones. For our next experiment we wanted to go for something that was more automated because, as I said, our checking for equivalence manually is lots of time. And our idea was to use exactly the detection rate of the test suite because if I have a mutation that's detected by a test suite, it's not equivalent by definition because it changed the program semantics. And if I now have a general mutation scheme whose mutations are frequently detected by the test suite, then this means that this mutation scheme generates a small number of equivalent invariants, that is the more mutations are detected, the better my generation scheme in terms of generating nonequivalent mutants. In other words, on a more logical level, this is our reasoning. If a mutation is detected by a test, it must be nonequivalent by definition. And if we can show that a mutation that has impact is also frequently detected by a test, then we can show that mutations that have impact also are likely to be nonequivalent. And this relationship is the relationship we have to show and this is what we wanted to do in our test, whether mutations with impact are more likely to be detected. And this is what we found in all of our subjects. This was automated so we could run it on as many programs as we wanted. Our -- here is again our seven programs and this is the detection rate for nonviolating mutants, so nonviolating mutant in aspect J had a 60% chance to be detected, whereas commons land, 80% extreme -- 80%. However, if you look -- if you take the subset of mutants that actually violate invariants it turns out that these are almost always, in six of the seven projects, six out of the seven projects they are detected more frequently than the nonviolating mutants, which shows that if you have an impact on invariants, the chances of getting your mutation detected by an actual test is higher and therefore they are less likely to be equivalent. Again, you should focus on those mutations that have impact. Yes? >> Question: And this is to be expected because the invariants were dynamically deferred starting from the same suite, right? >> Tom Zimmerman: Yes, you can think of the -- since the invariants were found, detected by the test suite, then anything that's different from the behavior in the test suite would also trigger these invariants. That's right. However, this was not necessarily something that we expected at this point. Also, the same thing happened -- the same thing actually also happens if you don't learn from the entire test suite, but if you only learn from one single execution. If you learn from only one single execution, you will still find the very same aspect -- very same effect. You can also think of these invariants as simply a measure of detecting a difference in the execution and the more invariants you violate, the more higher, the greater the different spreads across the program and therefore the higher the chance of actually -- of the difference actually also reaching the end of the execution or turning up in the over a result of something that is testable. >> Question: (Inaudible) dynamic invariant environments are an abstraction ->> Tom Zimmerman: Uh-huh. >> Question: Of properties observed ->> Tom Zimmerman: Yes. >> Question: -- through the execution of that is basically the suite. >> Tom Zimmerman: That's right. Yes. >> Question: Should we characterize this ->> Tom Zimmerman: Yes. The assumption of course is also that the test suite, which is designed by humans, makes some sense initially. It's not created randomly, it's created on purpose, such that it covers the most important properties of the program, except for those that are not detected in there. We also went for -- so this is just absolute whether a mutation had impact or not. We also checked for those mutations with the highest impact because we said, well, if you have the choice between individual mutations, you go for those with the highest impact first. Again, we have the non -- the detection rate for nonviolating mutants, which is the same. But now we compare this against the top 5% of violating mutants that is -- they are mutants that violate the highest number of invariants and you can see that this is always higher in every single program up to 100% in three of these projects, which means that if you look for the top 5% are mutants, every single one would be detected. And if you had any in here, which would not be detected, then this would be the very first thing to concentrate upon because this would be a mutant which changes everything in the inner execution, but still is not found by -- still not found by the test suite. So these are mutations with the highest impact are most likely to be detected and therefore, also the least likely to be equivalent. The detection rates actually, aspect J, this is very neat. If you look at the top 1% mutants, close to 100% and the detection rate goes down. As you go down the stream, same actually also applies for commons, Jackson, Jodar(phonetic) time, Jaytopars(phonetic), Barbecue -- Barbecue is an exception, Barbecue actually -- it stays at the top, but there's no decrease in here. The reason for that is that Barbecue has a relatively known number of mutants that are nondetected and therefore we don't see this nice behavior over here in Barbecue, but still for six out of seven projects the detection rate is always the highest for the top for those mutations that are ranked at the top. So in terms of evaluation, mutations with impact of less likely to be equivalent? Yes. Mutations with impact are more likely to be detected and therefore nonequivalent? Yes. With the highest impact most likely to be detected? Yes, as well. This means in practice that when you are doing mutation testing you should focus on those mutations which have impact and focus on those with the highest impact first, which at the same time addresses the issue of having to deal with lots and lots of nonequivalent mutants. So the two essential problems of mutation testing, first efficiency and inspection. Efficiency, one can deal with it for limited size programs and inspection. Second issue we face is also something you can deal with. What this also means in practice is that in the past our mutation testing was limited to very small programs, the size of where you could still easily determine whether a mutation was equivalent or not, this was the mid-program, with 16 lines of code. Now we can apply mutation testing to programs which are much larger than that, up to 100,000 lines of code. This is what we actually do at this point. And we are pretty happy to have made mutation testing this much more scaleable. I know that folks of you, you are working on the Windows (inaudible) things, yes, you have to add another dimension of scale, probably larger than this entire wall to that. I understand that. So far we are pretty happy with the advances and the scalability we made so far. So our speaking about future work, first thing -- so initially we came to this from a pure user perspective, but then we got hooked and we're doing research in this direction because it became interesting. But we want to find out how effective mutation testing is on a large scale and we are actually the first ones who can actually do that because we are able to do this on a large scale without sacrificing the lives of thousands and thousands of interns who need to classify all these mutants and we want to compare it to traditional coverage metrics and likewise and see whether mutation testing is better than traditional metrics and not only for the small programs, but also for large or say medium-sized programs. We're looking into alternative measures of impact, in particular we're looking into coverage these days. That is checking not only whether the data has changed in terms of invariants, but also checking whether the coverage has changed. Checking whether the coverage has changed is something that can be done in a much more efficient way than learning invariants or this overhead goes away. This is what we're looking into these days. One of my pet ideas is adaptive mutation testing. The idea here being is that you have put in a number of mutations and you check those that have survived and you check for those with the highest impact and then you have this small set of those with highest impact and from these you generate a new batch of mutations. You play this again and again you select those with the highest impact and you do that again and again, which is some sort of meter mutation approach or mutations that mutate into new mutations. The idea will simply be you want to have those mutations with the highest impact that have -and let those survive in order to be able to focus on those changes you can make to your program which turns everything apart, but which still is not caught by your test suite. And finally this concept of looking at the impact of an individual change in measuring this also has a number of other applications. So what we're thinking of for instance is to use this as a measure for predicting risk. You can think of if I can -- if I simulate a change to particular component and then I measure the impact of this simulated change I could think that wow, if changing this very component has so much impact all across the program execution, you could think of this component being very crucial for the program to work or to work right and it may well be if you look at defect density like this it may well be that components that have many, many defects may be precisely those where changes have a great impact and which have been frequently changed in the path. If you can determine that a change in here has lots and lots of impact, changes the program behavior, introduces real failure and you also find out this also has been changed frequently or changed recently in the past, then you would assume that this makes a very good defect predictor for the future. So coming to the conclusion, or to sum all this up, two major problems with mutation testing. Great idea in principle, but issues with efficiency and issues with inspection. We built this to a Javalanche, which focuses on the mutations with the highest impact of invariants, focusing on the mutations with the highest impact makes sense because this actually reduces the number of equivalent mutants and all in all this constitutes the fear of both the tester, as well as our own fear. I'm very happy to have gotten rid of this fear and we're able to apply mutation testing to much, much larger programs than we were originally able to. And with that, thank you for your attention. I'm happy to take questions. (applause) Patrif? >> Question: So -- so the end game here is to basically be able to tell whether or not a given test suite is good or lousy? >> Tom Zimmerman: Yeah. >> Question: So for the example that you consider every test we didn't consider, how basically -- I mean, do you have to -- it would be nice to show so traditional like just branch coverage or statement coverage. >> Tom Zimmerman: Uh-huh. >> Question: And see what conclusion could get come up with like this test suite is probably lousy because it only covers like 25% of code. >> Tom Zimmerman: Uh-huh. >> Question: Versus what you could come up with this framework. >> Tom Zimmerman: Yes. This is on our agenda and I'd really like to have an answer to that. The reason why this is on our agenda and why we are still working on this is that it's unfortunately -- this is unfortunately -- there's lots and lots of manual work in there because the traditional way of comparing coverage metrics against each other is actually using mutation testing. Because you put mutants into program and then check whether a specific coverage criteria is better in finding these artificially seeded bugs than another one. >> Question: Yes. But for large programs not necessarily, because the larger the program, the more there is dead code. >> Tom Zimmerman: Yes. That's right. >> Question: And so -- so now I have a medium-line program. >> Tom Zimmerman: Uh-huh. >> Question: And 50% of it is dead code. 25% is reading that code and 25% is not exercisable, just basically testing interface. >> Tom Zimmerman: Uh-huh. >> Question: Now I have two test suite basically. >> Tom Zimmerman: Uh-huh. >> Question: One good one that exercise 50% of the code. >> Tom Zimmerman: Uh-huh. >> Question: -- and (inaudible) ->> Tom Zimmerman: Yes. >> Question: -- coverage. >> Tom Zimmerman: Yes. >> Question: I have one that is lousy and just exercise 10% of the code. >> Tom Zimmerman: Uh-huh. >> Question: So it still would be interesting, I mean that, you could just show which is coverage. You see that exactly. >> Tom Zimmerman: Yes. >> Question: It is (inaudible) to compute. >> Question: It danger with perhaps in a (inaudible) like this is that you are directing (inaudible), there are prediction of like equivalent mutant. Really driven by the (inaudible) itself. So if I have -- I have lousy (inaudible) for instance, very well that the mutant is not going to be detected by that suite. So it is not going to help me improve my test suite from 10% coverage to 50% coverage. >> Tom Zimmerman: Uh-huh. There's two things. One is the assumption that 50% coverage is always better than a 10% coverage because if your 10% coverage happens to be in this areas which marked as red for instance, then you can eradicate 90% of the bugs, whereas your 50% test the work in the black areas and may find 0%, 10% of all bugs. That is one thing. And this is actually the reason for us that we wanted to look into mutation testing because mutation testing, well, because mutation testing also takes into account whether propagates across the program how it propagates, whether it actually can be tested. So this is sort of built in advantage to this back synthesis in here. Of course, you have to take into account are these bugs similar to real bugs and all this. There's a number of other issues that come with that. The second question is if we -- the second question is the dependency on the -- on these dynamic invariants. Of course you have -- the thing is there is a risk of the test suite perpetuating itself in some way because you learn invariants from the test suite and then you focus on those mutations that violate these invariants and you improve your test suite again in a particular direction. For one thing, it may be that this direction is actually the right direction because the test fit originally makes sense and you are completing the test suite using the same properties again and again. Number two is that if you actually can create lots and lots of impact by validating this inverse in the program, these may actually be the very places that you want to look into. And the number three is we actually obtain very similar results also if approximate we just look for changes in coverage or if we look for changes that look only for one individual execution without taking the entire test suite into account. If we look for changes in coverage for instance and this is another experiment we ran, you also find that the more difference in coverage you have, in particular the more widespread your difference in coverage is then you will also find that this correlates with the detectability, as well as samples you find this correlates with actually equivalent and nonequivalent. The problem with coverage and this is why invariants are table. We know that this works and we've shown that. The problem with coverage that we're working on right now is that there's many, many ways to establish differences in coverage. In particular, loops, you can count a number of loops. Does the number of loops have an impact on the program semantics? There is a couple of things we have to weed out in here. However, for dynamic invariants, this works very well. Be happy to take the risk of the test perpetuating itself into account. Long answer. Yes, Rob? >> Question: So it seems like JAVALANCHE has two major sources of heuristics in it. One is, what are the mutations that you choose to introduce. And then the other is what are the invariants that you choose to try out. >> Tom Zimmerman: Uh-huh. >> Question: Kind of in the back of my mind I was thinking through, okay, this is an interesting design space, how would I coordinate those two. >> Tom Zimmerman: Uh-huh. Yes. >> Question: And it's kind of subtle. I mean, it's not for example, you know, so say you're going to check for equalities and inequalities, you know, when you introduce invariants. >> Tom Zimmerman: Yes. >> Question: That doesn't necessarily mean you only want to mutate equalities and inequalities. >> Tom Zimmerman: Yes. >> Question: I mean, that would be dumb. You take constants and that might be good. So trying to find heuristics that kind of match and compliment each other, that seems like a complicated little (inaudible) ->> Tom Zimmerman: Yes. We were -- we were conservative at this point in terms of mutations. These are not mutations that we chose, but mutations that have been chosen -- that have been -- sorry, mutations this have been shown in different context to actually correlate with real thoughts of programmers. Question again, does this apply to Java programs? Does it apply to real big programs and all? This is also something that now with the framework like this, you can evaluate. Our second thing is the dynamic invariants. In the invariants, essentially first to Dyacom out of the box. As it was designed by some -- it was not asked to decide on that again. So we simply took the Dyacom tool out of the box and then we ran it. Then after a week or so we stopped the execution and we decided to make it more efficient because it had run for one week already and then we took those, then we simply retained those invariants, which would cover 80% of all invariants and we only checked for those and low and behold Dyacom ran 100 times faster from now on. Mike Ernst was very happy to hear that because he's frequently concerned with people complaining about the poor performance of Dyacom on real programs. And this is essentially what we did in here. So these decisions make up a big design space, but we didn't make these decisions ourselves, but we simply relied on what we found as determined by other researchers. Of course, this is all stuff to investigate and this is -- and in order to investigate this you first need a framework, first need a framework that gets rid of all these equivalent mutants because if you want to examine whether a specific set of mutations make more sense than another and you don't have this filter of recalling mutants up front, you have to spend endless and endless hours just to analyze whether a mutant would be equivalent or not because this fault in a single result. If you come up with a set of mutations that generates lots and lots of equivalent mutants, for instance, simply replace every constant by itself, would make a great mutation. You will find this generates lots of mutants that are not detected by test suite, but all these are equivalent. So this is always what you have to get rid of first. >> Question: I guess it can be also module by module figure out historically what has changed most about that program. >> Tom Zimmerman: Yes. >> Question: And then replay those accessible bugs. >> Tom Zimmerman: Yes. This is also part of our agenda. I already mentioned this idea of adaptive mutation testing. So you can make mutation testing adaptive with respect to impact, but you could also go and make mutation testing adaptive with respect to the past history. That is focus on those mutations that are more similar to past ones. What you need there again also for adaptive mutation testing, some sort of vocabulary from which you can compose appropriate mutations. What you will find, though, is that real bugs typically tend to have fixes that span across maybe sometimes just one line, sometimes a dozen of lines and showing that your mutations are equivalent to these is not that easy. But we want to have a framework that allows us to do that. Andrew? >> Question: So (inaudible) of Dyacom performance enhance want in some sense could be considered mutations to the Dyacom code. Did you run AVALANCHE over that? (laughter) >> Question: Accidentally change the semantics of Dyacom? >> Tom Zimmerman: We actually did not change the semantics of Dyacom, but we simply -- but Dyacom has a library of invariants, of invariant patterns that we check against and what we did is we simply reduce this library to the very basic invariants that cover, as I said, 80% of all invariants. So we first ran Dyacom over all, not on aspect J, because this doesn't work. And at least not on our machine which, which only has a paltry 32 gigabytes of RAM, didn't work. But we ran it between the smaller programs like Jackson ran Dyacom as a whole and then went for reduction of the library and we found that 80% of invariants were the same. And therefore, we ran -- went for the much smaller version. Yep? >> Question: (Inaudible) case how -- so the users model seems to be find inadequacy in your tests, which mean you add new tests. How feasible is it to have a good test suite for something like aspect J, because it seems like you keep adding more and more tests. Is that feasible to actually do that if your goal is successful and (inaudible) ->> Tom Zimmerman: Well, in the long run what you want -- if you believe in mutation testing what you want to have is you have -- you apply 1,000 mutations and every single mutation is found by a test suite. That's very nice. Then you have a very good test suite that finds all these artificially seeded bugs. You can of course increase the smartness of your mutations or at some point coming up with very subtle things that would not be detected. Okay. Then you would show that your test suite is also able to find very subtle bugs, but this is -- but for us also from looking at the mutations that we found that were not detected is that you look at those -there are simply places in your program which are covered by the test suite which are being executed, so far, so good, but in which a single change can create lots and lots of impact through the program and this means that the test suite, although it executes this part, obviously does not check -- does not check properly for the results or is not accurate enough or is not strong enough at this point because this is one thing, you can create a test which simply executes (inaudible) but which simply doesn't care for the results or which simply checks whether it crashes or not. But if the test suite does not check for the result, you achieve perfect coverage, but your test suite may still be a very bad quality. And this is something, this is something that would not happen within mutation testing. >> Question: (Inaudible) -- a good test suite is even feasible for the whole (inaudible)? >> Tom Zimmerman: Is building a good test suite actually feasible? Well, over here at Microsoft Research, you have people who are working on exhaustive testing techniques, model-based testing, in particular. People who are building really, really extensive tests that -- well, this is also something I'd like to verify. How resilient are these super-dooper nice test generation system when it comes to artificially seeded bugs? You will probably find that these techniques are very, very good and in testing your program exhaustively, so it's actually feasible, yes. And parts of this test generation, large parts of test generation comes to be automated. Also, you can apply these -- you can apply mutation testing to any kind of automated quality issuance if you have some fancy static analysis that finds these bugs or even if you have maybe the whole army of theorem provers and you have formal specifications and all, you can apply the very same techniques. You see the bug in there and if your theorem prover doesn't find it, oh, you obviously need to work on your axioms at this point. >> Question: (Inaudible) -- feasible, right? Ship stuff whether (inaudible), but it's testing because -- or it's good enough testing. >> Tom Zimmerman: Uh-huh. >> Question: But the kind of cool thing about this is especially if you're working with a lot of (inaudible) tests, tests -- the previous measure was code coverage under tests and you have no idea if it's actually -- it actually (inaudible) product code, but you don't know if it's actually doing confirmation in (inaudible). >> Tom Zimmerman: Uh-huh. >> Question: And a lot of the early code is actually just all APIs and be sure it doesn't (inaudible) ->> Tom Zimmerman: Yeah. There is (inaudible). >> Question: You don't even check the (inaudible) ->> Tom Zimmerman: Uh-huh. >> Question: The (inaudible), they get kind of (inaudible) ->> Tom Zimmerman: Uh-huh. >> Question: And they are kind of useless and in fact they cover up from the ->> Tom Zimmerman: That's the big advantage of mutation testing, simply simulates the real -- simply simulates real bugs and forces you to really improve your test to catch at least the simulated bugs. This does not necessarily mean it will also catch real bugs, but if real bugs have any resemblance to what we see to the mutation that we apply in here, yes this, should work. And even if it -- even if you improve your test suite just to catch these very simple bugs like changing constant from zero to 1 or from adding plus 1, adding minus 1, there's so many bugs that come from -- that come off by 1 error for instance that could be called this way and as you said, if a return value is not checked this will be found out by mutation testing, but not by traditional coverage metrics. >> Question: (Inaudible) mutations interested in, how did you decide on what that set was and did you perhaps -- was there any evidence in perhaps (inaudible) to say that mutations being picked were aligned with bug types that actually occur? >> Tom Zimmerman: Yes. The mutations that we picked came from Angolia study of -let me see, this was Briar and Labesh and James Andrews, who did a study on that. And they compared this small set of mutations together with real bug history of a program that was called space program, which was -- which is medium-size C program which had number of -- total of 20 bugs in the past. It's been used by the European space agency to navigate, settle across the globe, so it's a real program. And they found that the small set of mutations was equivalent or equivalent enough to actually be compared with the real bug that the space program has made. The space program is 20 years old at this time. It is object-oriented and it may well be that object-oriented programs have their own set of mistakes that people make. At this point, at this point we simply relied on what we found in the literature because initially we started simply as users of this technique. But with the tool that we have on hand we can now actually examine all these questions because we can go and check whether a specific selection of mutation operators is better than another, whether mutation testing as a whole gives you better results than traditional coverage metrics. This is in some way, this is just the beginning and this point the design space is restricted to look on what is already there, what others did in the past. But we can of course explore this from now on. Yes? >> Question: So this is a little bit unrelated. So you mentioned that it took about 30 minutes for ->> Tom Zimmerman: Yes. >> Question: -- to determine the means for equivalents. >> Tom Zimmerman: Uh-huh. >> Question: Did the -- did knowing the invariants and the end path when it was used to like during the study, did that make that timing faster? Or did people actually use this new information how you had before like in had determining whether (inaudible) ->> Tom Zimmerman: Oh, that's an interesting -- that's an interesting point. The question is whether having these invariants would make people -- would make it easier for people to decide whether they were equivalent or not. In the study that we did with invariants, work played no rule because when we did the study, we didn't have any invariants at this point. It may well be that if you do have invariants than this eases program comprehension, in particular it eases program comprehension, figuring out what individual function actually should do. And then I would expect that this is faster. However, these 30 minutes are just an average. The larger your program becomes, the more tedious it becomes to examine this. And as in the quote of Phyllis Franklin, which you had in the beginning, even for a small program, figuring out whether random change has an impact or not is incredibly difficult because -- see, if you're working with changes made by real people and you want to figure out what their impact is, these are changes that I made on purpose. There's some intelligence behind that. At least that is what we expect. There is some intelligence and some meaning behind it. You expect it to stay in line with the -- at least implied semantics of the program. If you apply random -- if you apply random changes you can forget all about the reasoning, but instead have to go and really figure out where would this particular mutation have an impact on. You have to follow the dependencies, you have to infer what the -- you have to infer what the implicit semantics of the function of this where invariants actually helped because they make it more explicit. And you have to check every single occurrence all over the program. It's like if -- I think of making a random change to the Windows code at some point. Oops. Where will all have impact and if -- where will this spread? If you don't have a means to measure this, even if you go with -- even if you with say static analysis and thought about where would this have an impact, which other place that have to look into, you still can -- it's still very likely that you will spend more than just 10 minutes on that. >> Question: Any more questions? Then let's thank Andreas again. (applause) >> Tom Zimmerman: Thank you very much.