>>: Okay. So it's my great pleasure to introduce Wes Weimer, who's going to be -- who is visiting us with Stephanie Forrest. And Wes is going to talk today about automatically repairing bugs in programs. >> Westley Weimer: Excellent. Okay. Thank you for the introduction. So I'm going to be talking about automatically finding patches using genetic programming, or, to put it another way, automatically repairing programs using genetic programming. And this is joint work with Stephanie Forrest from the University of New Mexico and our grad students Claire and Vu. So in an incredible surprise move for Microsoft researchers, I will claim that software quality remains a key problem. And there's an oft-quoted statistic from the past that suggests that over one half of 1 percent of the U.S. GDP goes to software bugs each year. It would appear that there are people in the audience familiar with this statistic. And that programs tend to ship with known bugs, often hundreds or thousands of them that are triaged to be important but perhaps not worth fixing with the resources available. What we want to do is reduce the cost of debugging, reduce the cost associated with fixing bugs that have already been identified in programs. Previous research suggests that bug reports that are accompanied by an explanatory patch, even if it's tool created, even if it might not be perfect, are more likely to be addressed rapidly; that is, that it takes developers less time and effort to either apply them or change them around a bit and then apply them. So what we want to do is automate this patch generation for defects in essence to transform a program with a bug into a program without a bug but only by modifying small, relevant parts of the program. So the big thesis for this work is that we can automatically and efficiently repair certain classes of defects in off-the-shelf, unannotated legacy programs. We're going to be taking existing C programs like FTP servers, sendmail, Web servers, that sort of thing, and try to repair defects in them. The basic idea for how we're going to do this is to consider the program and then generate random variants, explore the search space of all programs close to that program until we find one that repairs the problem. Intuitively, though, this search space might be very large. So we're going to bring a number of insights to bear to sort of shrink this search space to focus our search. The first key part is that we're going to be using existing test cases, both to focus our search on a certain path through the program, but also to tell if we're doing a good job with automatically creating these variants. We'll assume that the program has a set of test cases, possibly a regression test suite, that encodes required functionality, that makes sure that the program is doing the right thing. And we'll use those test cases to evaluate potential variants. In essence, we prefer variants that pass more of these test cases. Then we're going to search by randomly perturbing parts of the program likely to contain the defects. You might have a large program, but we hypothesize that it's unlikely that every line in the program is equally likely to contribute to defect. We're going to try to narrow down, localize the fault, as it were, and only make modifications in specific areas. So how does this process work out? You give us as input the source code to your program, standard C program, as well as some regression tests that you currently pass that encode required behavior. So, for example, if your program is a Web server, these positive regression test cases might encode things like proper behavior when someone asks you to get indexed on HTML. Basic stuff like that. In addition, you have to codify or quantify for us the bug that you want us to repair. So there's some input upon which the program doesn't do the right thing: it loops forever, it segfaults, it's vulnerable to some sort of security attack. And by taking a step back, we can actually view that as another test case but one that you don't currently pass. So there's some input. We have some expected output, but we don't currently give the expected output. We then loop for a while doing genetic programming work. We create random variants of the program. We run them on the test cases to see how close they are to what we're looking for. And we repeat this process over and over again retaining variants that pass a lot of the test cases and possibly combining or mutating them together. At the end of the day, we either find a program variant that passes all of the test cases, at which point we win. We declare moral victory, we take a look at the new program source code and perhaps take a diff of it and the original program and that becomes our patch. Or we might give up, since this is a random or heuristic search process, after a certain number of iterations or after a certain amount of time and say we're unable to find a solution. We can't repair the program given these resource bounds. So for this talk I will speak ever so briefly about genetic programming, what it is, what those word means -- what those words mean, and then talk about one key insight that we use to limit the search space: our notion of the weighted path. And this is where we're going to look for the fault or the defect in the program. I'll give an example of a program that we might repair, an example perhaps near and dear to the hearts of people in this audience. And then I'll present two sets of experiments: one designed in general to show that we can repair many different programs with sort of large -- large and different classes of defects; and one to try to get at the quality of the repairs that we produce. Would anyone trust them. Are they like what humans would do. Might they cause you to lose transactions or miss important work if you were to implement our proposed patches. Sir? >>: Don't have any guarantee, or at least statistical guarantees, or any kind of guarantee, that your patch will not create additional problems? >> Westley Weimer: Right. So the question was do we have any guarantees at all that our patch will not create additional problems. At the lowest level, we only consider patches that pass all of the regression tests that you've provided for us. So if there's some important behavior that you want to make sure that we hold inviolate, you need only write a test case for it. And in some sense, coming up with an adequate set of test cases is an orthogonal software engineering problem they won't address here, but I will acknowledge that it's difficult. However, it is entirely possible that a patch that we propose might be parlayed into a different kind of era. For example, in one of the programs that we repair, there's a non-buffer overflow based denial of service attack, where people can send malformed packets that aren't too long that cause the program to crash, to fail some sort of assertion. We might repair this by removing the assertion, but now the program might continue on and -- yeah, and do bad things. So it's possible that the patches that we generate could introduce new security vulnerabilities. And we will get to this later in the talk. It's an excellent question, and you should be wary. Yes. >>: [inaudible] making assumptions [inaudible] the impact of the [inaudible] assume that it affects local state or global state [inaudible]? >> Westley Weimer: The question was do we make any assumptions about the impact of the bug on program state, local, global. And in general, no. I'll give you a feel for the kind of patches and the kind of defects that we're able to repair. Many of them, such as buffer over unvulnerabilities, integer overflow attacks, deal with relatively local state. But nothing in the technique assumes or requires that. Good question. Other questions while we're here? Okay. So we'll do some experiments on repair quality which will potentially address some of these concerns but not all of them. And then there'll be a big finish. So genetic programming formally is the application of evolutionary or genetic algorithms to program source code. And whenever you talk about genetic algorithms, at least in my mind, there are three important sort of concepts you want to get across: You need to maintain a population of variants, and you need to modify these programs perhaps with operations such as crossover and mutation, and then you need to be able to evaluate the variants, to tell which ones are closer to what you're hoping for. And here for a sort of Microsoft Research localized example -- Rustan Leino wasn't able to make it to the talk, but I did get a chance to speak with him yesterday -- here are a population of variants as sort of four copies of Rustan Leino, and then we make small modifications. So, for example, some are standing up, some are sitting down, some have a pitcher of apple juice. And at the end of the day we have to tell which we prefer. So, for example, if we're very thirsty, we might prefer the Rustan with the apple use. Right? We're going to do this accept that we're going to fix programs instead. Many of you might be wary of the application of genetic programming or genetic algorithms or machine learning to programming languages or software engineering tasks. You might have seen these things done in a less-than-principled manner in the past. Another way to view our work is as search-based software engineering, where we're using the regression test cases to guide the search. They guide both where we make the modifications and also how we tell when we're done. >>: What is that picture [inaudible]? >> Westley Weimer: The picture says: What's in a name? That which we call a rose by any other name would smell as sweet. So it's a quote from Shakespeare about the naming of things and whether or not that matters. But the font is small and hard to read. Good question. So how does this end up working? In some sense there are two secret sauces that allow us to scale. Our main claim is that in the large program not every line is equally likely to contribute to the bug. And since we required as input indicative test cases, regression tests to make sure that we're on the right track, the insight is to actually run them and collect coverage information. So we might do something as simple as instrumenting the program to print out which lines it visits. And we hypothesize that the bug is more likely to be found on lines that are visited when you feed in the erroneous input, lines that are visited during sort of the execution of the negative test case. Yes. >>: Surprised you have a [inaudible] statistical studies and he found that the bugs are statistically in those places which are especially common, [inaudible] explanation that those places [inaudible] and there's a lot of [inaudible] they work again and again [inaudible] visited. >>: Endless source of bugs. >> Westley Weimer: And so there was a comment or a suggestion that even lines of code that are visited many times might still contain bugs. And we're not going to rule out portions of the code that are visited along these positive test cases, but we might weight them less. And while Tom Ball is not here, he and many other researchers have looked into fault localization. And to some degree what I'm going to propose is a very course-grained fault localization. We might imagine using any sort of off the-the-shelf fault localization technique. One of the reasons that we are perhaps not tempted to do so is that we want to be able to repair SYNs of omission. So if the problem with your program is that you've forgotten a call, you've forgotten a call to free or forgotten a call to sanitize, we still want to be able to repair that. But many but not all fault localization techniques tend to have trouble with bugs that aren't actually present in the code. It's hard to pin SYNs of omission down to a particular line. So we'll start with this and then later on I will talk about areas where fault localization or other statistical techniques might help us. Good question. Yes. >>: The third bullet point is puzzling me. So you say the bug is more likely to be found when you're running the fail test case, so there are all also passing test cases? >> Westley Weimer: Right. >>: So if you don't -- I mean, if you don't run the code, then you... >> Westley Weimer: So we assume that you give us a number of regression test cases, that you pass currently, and also one new -- we call it the negative test case that encodes the bug, some input on which you crash or get the wrong output. And these existing positive regression test cases are assumed to encode desired behavior. So we can only repair programs for which a test suite is available. But there might be some techniques for automatically generating test cases. Yes. >>: [inaudible] confused about the deployment scenario again. So is this meant to be something you deploy during debugging the code or you observe a failure in the field, you collect some [inaudible] and then you use this for [inaudible]. Which of these? >> Westley Weimer: All right. So the question was about the deployment scenario. And in some sense, my answer to this is going to be both. Our official story, especially at the beginning of the talk, is that we want to reduce the effort required for human developers to debug something. And we're going to present to them a candidate patch that they can look at that might reduce the amount of time it takes them to come up with a real patch. So we assume maybe the bug is in the database, there are some steps to reproduce it, you run our tool and five minutes later we give you a candidate patch. However, our second set of experiments will involve a deployment scenario where we remove humans from the loop and assume that some sort of long-running service has to stay up for multiple days, and there we want to detect perhaps with anomaly or intrusion detection faults as they arrive and then try to repair them. So we will address that second scenario later. Excellent question, though. Other questions while we're here? Okay. So we have the test cases. We run them. We notice what parts of the program you visit. We claim that when we're running this erroneous input that causes the program to crash or get the wrong answer that it's likely that those lines have something to do with the bug. However, many programs have -- let's say a lot of initialization code at the beginning that's common, maybe you set up some matrix the same way every time or read some user configuration file. And that might be code that you visit both on this negative test case, when you're experiencing the bug, but also on all other positive test cases. We want to or we hypothesize that those lines are slightly less likely to be associated with the bug. In addition, as we're going around making modifications to the program, we might sometimes delete lines which are relatively simple, but we might also be tempted to insert or replace code. And when we do, if you're going to insert a line into a C program, there are a large number of statements or expressions you might choose from. Infinite, in fact. One of the other reasons that we scale is we're going to restrict the search space here to code that's already present in the program. In essence, we hypothesize that the program contains the seeds of its own repair; that if the problem with your program is, for example, a null point to do reference or an array out of bounds access that probably somewhere else in your program you have a null check or you have an array bounds check. Following, for example, the intuition of Dawson Engler, bugs as deviant behavior, it's likely that the programmer gets it right many times but not every time. So rather than trying to invent new code, we're going to copy code from other parts of your program to try to fix the bug. And I will revisit that assumption later. So we wanted to narrow down the search space and our formalism for doing this is a weighted path. And, relatively simply, for us a weighted path is a list of statement and weight pairs where conceptually the weight represents the likelihood or the probability that that statement is related to the bug, is related to the defect. In all of our experiments, we use the following weighted path. The statements are those that are visited during the failed test case, when we're running the buggy input. The weights are either very high, if that's a statement that's only visited on the bug-inducing input and not on any other test case, and relatively low if it's also visited on the other past test cases, if it's that matrix initialization code that's global to all runs of the program. Sir? >>: [inaudible] >> Westley Weimer: The question was do the weights only take these values, 1.0 and 0.1. For the experiments that I will be presenting today, yes. We have investigated other weights. So another approach, for example, would be to make this 0.1 is 0.0, to just throw out lines in the program that are also visited on the positive test case. And for some runs of our tool, we tried different parameter values. In practice, these are the highest performing parameters that we found but with a very small search. It seems that our algorithm is in some sense relatively insensitive to what these values actually are. >>: You call this weighted path rather than least. So what make it least path? >> Westley Weimer: Ah. >>: [inaudible] connection between [inaudible]. >> Westley Weimer: The question was about the terminology. It's a list of statement weight pairs. Why do I call it a weighted path. And the reason is that it's the path through the program potentially around loops or through if statements visited when we execute the failed test case. So it corresponds to a path through the program. But when we throw away the abstract syntax tree, a path is just a list of steps that you took forward and then left and whatnot. Does this answer your question? Yes. >>: So a failed test isn't as simple as the assignments to the inputs or do you also -- does the user also specify why does this test case fail, so somehow [inaudible] should have some [inaudible]? >> Westley Weimer: Right. So the question is what is a test case. And I will get to this at the end and show you a concrete example of a test case from ours. But officially we follow the oracle comparator model from software engineering where a test case is an input that you feed to the program and it produces new output, but you also have an oracle somewhere that tells you what the right output should have been. And in regression testing the oracle output is often running the same input on the last version. Then there's also a comparator which could be something very strict, like diff for compare, which would notice if even one byte were off. But for something like graphics, for example, your comparator might be for fuzzy. You might say, oh, if part of the picture is blurry, that doesn't matter as long as the sum of the squares of the distance is less than blah, blah, blah. For all of the test cases that I'll present experiments on, we followed a very simple regression test methodology. We ran a sort of vanilla or the correct version of the program and wrote down the correct output, all of the output that the program produced. And in order to pass the test case, you had to give exactly that; no more, no less. But in theory, if you wanted to only care about certain output variables, our system supports that. >>: [inaudible] how many times a [inaudible]. >> Westley Weimer: Right. So the question is do we care how many times a statement is executed, is this really a list, or is it actually a set. And in our implementation, it's actually a list. If you go around a loop multiple times, we will record that you visited those statements more often than before, so we might be more likely to look there when we're performing modifications. I'm not going to talk about our -- I will skip over our crossover operator, although I will be happy to answer questions about it in this talk. But, in essence, it requires sort of an ordered list and potentially might care about duplicates. We have investigated and there are options available for our tool, for our prototype implementation of this: treating this list as a set, not considering duplicates. The advantage of set-based approach is that it reduces the search space. The disadvantage is that we might be throwing away frequency information about which statements are actually visited more, so the question is do you want sort of a larger search space but with a pointer in the right direction, or do you want a smaller search space that you're choosing from at random. >>: [inaudible] >> Westley Weimer: And we could implement this using a multiset, yes. But the actual sort of data structure bookkeeping of our algorithm ends up not being the dominant cost of this. And I'll get into that in a bit. Yes. >>: [inaudible] >> Westley Weimer: Ah. So initially -- so the question was this seems to only care about the passing test cases. But the set of statements that we consider are only those visited during failed test cases. So you can imagine that statements visited during neither passed nor failed test cases implicitly get a weight of zero. And then of those statements during failed test cases, if they're only visited on the failed test case, they get high probability, otherwise they get low. So both the positive test cases and the negative test cases matter, and I have a graphical example to show in a bit. So once we've introduced the weighted path, I can no go back and fit everything we're doing into a genetic programming framework. Need to tell you what our population of variance is. And for this every variant program that we produce is a pair of a standard abstract syntax tree and also a weighted path. And the weighted path, again, focuses our mutations or our program modifications, tells us where we're going to make the changes. Once we have a bunch of these variants, we might want to modify or mutate them. And to mutate a variant, we randomly choose a statement from the weighted path only. Biased by the weights. So we're more likely to pick a statement with a high weight. Once we've chosen that statement, we might delete it, we might replace it with another statement, or we might insert another statement before or after it. When we come up with these new statements, rather than making them up out of whole cloth, we copy them from other parts of the program. This reduces the search space and helps us to scale. And, again, this assumes that the program contains the seeds of its own repair; that there's a null check somewhere else that we can bring over to help us out. We have a crossover operator that I'm not going to describe but will happily answer questions about. And in some sense it's a relatively traditional one-point crossover. Once we have come up with our variants, we want to evaluate them: are they any good, are they passing the test cases. We use the regression test cases and the buggy test case as our fitness function. We have our abstract syntax tree. We prettyprint it out to a C program, pass it to GCC. If it doesn't compile, we give it a fitness of zero. Now, since we're working on the abstract syntax tree, we're never going to get unbalanced parentheses. We'll never have syntax errors. But we might copy the use of a variable outside the scope of its definition, so we might fail to type check. If the program fails to compile for whatever reason, we give it a fitness of zero. Otherwise, we simply run it on the test cases and its fitness is the number or fraction of test cases that it passes. We might do something like weighting the buggy test case more than all of the regression the tests, if there are many regression tests, but only one way to reproduce the bug. But in general you can imagine just counting the number of test cases passed. Once we do that, we know the fitness of every variant. We want to retain higher fitness variabilities into the next generation, those might be crossed over with others. And we repeat this process until we find a solution, until we find a variant that passes every test case or until enough time has passed and we give up. >>: [inaudible] failed to compile if you -- if the variables [inaudible]. >> Westley Weimer: Yes. >>: But it's also possible that the variable is redefined [inaudible] the same name or maybe it's [inaudible] some other object in the midterm [inaudible]. >> Westley Weimer: Yes. >>: You consider those cases ->> Westley Weimer: No. >>: [inaudible] [laughter] >> Stephanie Forrest: We just let them happen. >> Westley Weimer: We just let them happen. And presumably -- so let's imagine. We move a use of a variable into a scope that also uses a different variable with the same name. So now we're going to get in essence random behavior. Hopefully we'll fail a bunch of the test cases so we won't keep that variant around. On the other hand, that random behavior might be just what we need to fix the bug. Unlikely, but possible. So we'll consider it, but probably it won't pass the test cases, so it won't survive until the next generation. So going to give a concrete example of how this works using idealized source code from the Microsoft Zune media player. And many of you are perhaps familiar with this code or code like it. December 31st, 2008, some news sources say millions of these Zune players froze up and in many reports the calculation that led to the freeze was something like this. We have a method that's interested in converting a raw count of days into what the current year is, and in essence this is a while loop that subtracts 365 from the days and adds one to the years over and over again, but has some extra processing for handling leap years, where the day count is slightly different. To see what might go wrong with this, the infinite loop that causes the players to lock up, imagine that we come in and days is exactly 366. We're going to go into the while loop. Let's say the current year, 1980, is a leap year. If days is greater than 366 -- well, it's not greater than, it's exactly equal to. So instead we'll go into this empty else branch, which I've sort of drawn out for expository purposes. Nothing happens, we go around the loop, and we'll go around the loop over and over again, never incrementing or decrementing variables; we'll loop forever. We want to repair this defect automatically. So our first step is to convert it to an abstract syntax tree. And it should look similar to before: assignments here, we have this big while loop. And what I want to do is figure out which parts of this abstract syntax tree are likely to be associated with the bug. We want to find out weighted path. So our first step is to run the program on some bug-inducing input, like 10,593 which corresponds to December 31st, 2008. And I've highlighted in pink all of the nodes in the abstract syntax tree that we visit on this negative test case. We loop forever, so notably we never get to the printf at the end. We visit all of the nodes except the last one. So this is some information but not a whole lot. But now I also want to remove or weight less nodes that are visited on positive passing regression test cases. So let's imagine that we know what the right answer is supposed to be when days equal 1,000. We have a regression test for it. I rerun the program and also mark -- and here I'm marking in green -- all of the nodes that were visited on positive test cases. And in some sense, the difference between them is going to be my weighted path. I'm going to largely restrict attention to just this node down here, which is visited on negative test case, but on no positive test cases. So here's my weighted path. Once I have the weighted path, I might randomly apply mutation operations: insertions, deletions, replacements. But only to nodes along the weighted path. Let's imagine I get relatively lucky the first time. I choose to do an insertion. Where am I going to insert? I'm more or less stuck inserting here. And rather than inventing code whole cloth, I'm going to copy a statement from somewhere else in the program. So, for example, I might copy this days minus equal 366 statement and insert it into this block. Once I do, the final code ends up looking somewhat like this. I now have a copy of that node. I've changed the program. Slightly different abstract syntax tree. And this ends up being our final repair. Passes all of our previous test cases and it doesn't loop forever on the negative test case. But it might not be what a human would have done. And it might not be correct. Maybe our negative test case wasn't correct. So I'm going to talk a bit in a bit about repair quality. Yes. >>: [inaudible] with exception handling. Suppose you're an exception handler and those [inaudible] negative test case. Essentially you'd be trying to add code to the exception handler to make it repair the bug. But the bug might be somewhere else in the program. >> Westley Weimer: So okay. So this is a good point. For the implementation that we have here, we've been concentrating on off-the-shelf C programs wherein sort of standard -- standard C exception handling is not an option. We have previously published work for repairing vulnerabilities if a specification is given. And in that work, we've specifically described how to add or remove exception handling -- either try catch or try finally statements -- in order to fix up type states problems, for example. If we were going to move this to, say, Java or some other language with exception handling, we would probably try to carry over some of those ideas. So we haven't implemented it yet. >>: [inaudible] >> Westley Weimer: Currently our implementation is C. There are no major problems preventing us from moving to other languages. But there are corner cases we would want to think about, such as exception handling. So, for example, if we were moving to a language with exception handling, we might want to consider adding syntax like try or catch or finally as atomic mutation actions. More on that in a bit. So before we get to repair quality, just sort of a brief show of how this happened over time. On this graph the dark bar shows the fitness we assign to our -- the average, our overall population of variance as time or as generation goes by. And in this particular version, we had five normal regression test cases that were worth one point each, and we had two test cases that reproduced the bug, and passing those -that is, not looping forever, getting the right answer -- was worth ten points each. And we start off at five. We pass all of the normal regression tests, but none of the buggy tests. After one generation, our score is now up at 10 or 11. We pass one of the buggy tests but have lost key functionality. We're no longer passing some of the regression tests. We're 12 instead of 15. And our favorite variant, the variant that eventually goes on to become the solution, stays relatively stable for a while. The others sort of generally increase -- until eventually either it's mutated or it combines part of its solution with something else that's doing relatively well. And in this seventh generation, it passes all of the regression test cases as well as the two bugs. So we think about this for a while. This is an iterative process, and then hopefully at the end we pass all the test cases. Yes. >>: If something doesn't compile you said [inaudible] zero, right? >> Westley Weimer: Yes. >>: Is it possible that you generate a lot of stuff that has been zero to find one thing that actually compiles? >> Westley Weimer: That's a great question. And in fact we have -- we have these numbers, although ->>: [inaudible] >> Westley Weimer: Let us say that less than half of the variants that we generate fail to compile. And I can get you the actual numbers right after this presentation. >>: But if you're going to operate on, say, source code as opposed to [inaudible]. >> Westley Weimer: If we were to operate on source code and it was possible that we could make mistakes with balance parentheses or curly braces, then it would be a much higher fraction. And we'll return to that in a bit. >>: [inaudible] >> Westley Weimer: So we maintain a population of variants. Let's say 40 random versions of the same program. And we will mutate all of them and run them all on the test cases and then keep the top half, say, to form the next generation. And the average here is the average fitness of all 40 members of that generation, that population. >>: [inaudible] remains flat for [inaudible]. >> Westley Weimer: So I may defer some of this to the local expert, but my answer would go along these lines. Genetic programming or genetic algorithms are heuristic search strategies, somewhat like simulated kneeling, that you would use to explore a large space. And one of the reasons you would be tempted to use something like this instead of, say, Newton's method is that you want to avoid the problems associated with hill climbing. You want to avoid getting stuck in a local maximum. And the crossover operation in genetic programming or genetic algorithms, which I haven't talked about, holds out the possibility of combining information from one variant that's stuck in a local hill over here, let's say in the second part of the path, with another variant that's stuck in a hill over here, say in the first part of the path. We might take those two things and merge them together and suddenly jump up and break out of the problems associated with local hill climbing. So the hope is that this is the sort of thing we would get out of using this particular heuristic search strategy. But I will defer to the local expert. >> Stephanie Forrest: [inaudible] you can see that there's still something happening in the population because the average is moving. So even though that individual was -- it was good enough to keep being copied unchanged into the next generation, but it was so obviously not the best individual left in the population. But then in the end [inaudible]. >>: [inaudible] nothing moved. >> Stephanie Forrest: You know, in the end, it was the individual that led to the winner. >> Westley Weimer: And the question was between three and four nothing moved. Between three and four, the averages didn't move. But the population size is, let's say, 40 or 80. >>: [inaudible] best individual [inaudible] how can the average be bigger than 11? >> Westley Weimer: So I must apologize for the misleading label here. In our terminology, the best individual is the one that eventually forms the repair. >>: [inaudible] >> Westley Weimer: Eventually. So we're tracking a single person or a single individual from the dawn of time to the end. >> Stephanie Forrest: So you know about genetic algorithms, that's usually really hard to do in its artifact, to trace the evolution of a single individual. But it's an artifact of how we do our cost of operating [inaudible]. >> Westley Weimer: Yes. >>: How long did it take in seconds? >> Westley Weimer: For this particular repair? Let's say 300 seconds. Three or four slides from now, I will -- I will present quantitative results. On average, for any of the 15 programs that we can repair, the average time to repair is 600 seconds. So once we do this, we apply mutation for a while. We come up with a bunch of variabilities, we evaluate them by running them on the test cases, eventually we find one that passes all the test cases. This is our candidate for the repair. Our patch that we might present to developers is simply the diff between the original program and the variant that passes all the test cases. We just prettyprint it out and take the diff. However, random mutation may have added unneeded statements or may have changed behavior that's not captured in any of your test cases. We might have inserted dead code or redundant computation. And we don't want to waste developer time with that. The purpose of this was to give them something good to look at. So in essence we're going to take a look at our diff for our patch, which is sort of a list of things to do to the program, and consider removing every line from the patch. If we did all of the patch except line 3, would that still be enough to make us pass all of the test cases? If so, then we don't need line 3. Then line 3 was redundant. And you can imagine trying all combinations, but that might be computationally infeasible or take exponential time. Instead we use the well-known delta debugging approach to find a one minimal subset of the diff in N squared time; where 1 minimal means that if even a single line of the patch were removed, it wouldn't cause us to pass all of the test cases. So and once again rather than working at the raw concrete syntax level, rather than using standard diff, we use a tree structured diff algorithm, diffX in particular, to avoid problems with balanced curly braces where removing one would cause the patch to be invalid and not compile. >>: [inaudible] I mean every line of the program rather than just the differences of the candidate with the original program? Because that seems to be [inaudible] look at lines that are actually the original program. >> Westley Weimer: Um... >>: You might now remove stuff that's -- happens to be not necessary to [inaudible]. >> Westley Weimer: Ah. >>: [inaudible] was in the original program. >> Westley Weimer: Okay. So this is a good question. Let me see if I can give an example of this. Let's say that the original program is A, B, C. And the fix is A, B, D. Something like that. Then our first patch will be something like delete C and insert D at the appropriate location. There are two parts of this patch. These are the only things that we're going to be considering when we minimize sets. So when we talk about minimizing the patch, the possibilities that we're considering are 1 alone; 1, 2; 2; or none of these. So we're -- we're never going to remove A. So we're not going to mistakenly hurt the original program. It's only among the changes that were made by genetic algorithms. We say did we actually need all of those. >>: [inaudible] lines in the patch other than the performance over here. What's the [inaudible]? >> Westley Weimer: So this is a great idea. It's a great question. And we have a number of ideas related to it. The question was why are we trying to minimize the size of the patch and not some other objective. Our initial formulation of this was we wanted to present it to developers. And often when developers make matches to work in code, you want to do the least harm. You want to touch as little as possible. However, we have been considering applying this technique to other domains. For example, let's imagine that you're a graphics person interested in pixel or vertex shaders which are loaded onto a graphics processor. Those are small C programs. Rather than trying to repair them, might we try to use genetic programming or the same sort of mutation to create an optimized version of a vertex or pixel shader that is faster but perhaps produces slightly blurrier output. But since it will be viewed by a human, we might not care. So here our test cases would be not an exact match but instead the L2 norm could be off by 10 percent. And rather than favoring the smallness of the change, we would favor the performance of the resulting program. So that is a good idea. We're certainly considering things like that. For now, for the basic presentation, we want to get to the essence of the defect to help the developer repair it. >>: But even outside of the domain of graphics, say, the smallness of the change is not necessarily representative of how impactful it might be. So suppose you change function fu that is called on dozens of path whereas function bar is only holding one, something like that. >> Westley Weimer: Yes. >>: And where you would favor [inaudible]. >> Westley Weimer: Right. So the question was a smaller change might influence a method that's called more frequently. Also at ICSE once of my students will be presenting work on statically predicting or statically guessing the relative dynamic execution frequency of paths. One of the other areas we've been considering for this is rather than breaking ties in terms of size, break ties in terms of this change is less likely to be executed in the real world. So we might have a static way of detecting that, or we might use the dynamic counts that we have from your existing regression test case, assuming they're indicative, and use those to break the tie instead. Also a great idea. Other questions while we're here? So does it work? Yes. So here we have initial results. All in all, we've been able to repair about 15 programs totalling about 140,000 lines of code, off-the-shelf C programs. On the left I have the names of the programs. Some of them may well be familiar. Wu-ftpd is a favorite whipping boy for its many vulnerabilities. The php interpreter, graphical Tetris game, flex, indent and deroff are standard sort of UNIX utilities. Lighttpd and nullhttpd are Web servers. And openldap is a Web-based directory authentication protocol. In the next column we've got the lines of code. And for a number of these -- openldap, lighttpd or php -- it's worth noting that we are not performing a whole program analysis. For these, we want to show that we can repair just a single module out of an entire program. Openldap is relatively large. We're only going to concentrate on modifying io.c. perhaps someone has already been able to localize the error to there. For the rest of these, graphical Tetris game, FTP server, we perform as a whole program analysis. We merge the entire program together into one conceptual file and do the repair from there. So here's the lines of code that we operated on. The next column over shows something that's perhaps more relevant to us, the size of the weighted path. This can be viewed as a proxy for the size of the search space that we have to look through in order to potentially find the repair. Recall that we only perform mutations, insertions, and deletions along this weighted path. As we go through and create variance, we have to evaluate them. We apply our fitness function. We run all of the test cases against each variant to see how well it's doing. This ends up being the dominant performance cost in our approach. Running your entire test suite 55 times might actually take a significant amount of wall clock time. Although it can be parellelized at a number of levels. So in this column we report the number of times we would have to run your entire test suite before we came up with this repair. And finally on the right, I have an inscription of the sorts of defects that we're repairing. Initially we started off by reproducing results from the original fuzz testing work which was on UNIX utilities. You feed them garbage input, they crash. So for many of these, like deroff for indent, we found a number of segmentation violations or infinite loops that we were able to repair. Then we moved more towards security. We went back through old security mailing lists, bug track, that kind of thing and found exploits that people had posted, say, against Web servers or against openldap. And for these sorts of things, php, atris, lighttpd, openldap, the negative test case is a published exploit by some black hat, and then we try to repair the program until it's immune to that. So defects include things as varied as format string vulnerabilities, integer overflows, stack and heap buffer overflows, segmentation violations, infinite loops, non-overflow denials of service. Yes. >>: I was just going to ask what [inaudible] in each case? >> Westley Weimer: Ah. This is a good question. So for many of these we were able to get away with relatively small test suites. Let's say five or six test cases that encode required behavior. However, in many cases, we expect there to be more test cases available rather than less. So I'll return to test case prioritization or test case reduction or our scaling behavior as test cases increase in just a bit. For right now we're only using five or six test cases, and we're still able to come up with decent repairs. >>: [inaudible] 15 programs ->> Westley Weimer: Yes. >>: -- out of how many you're trying? >> Westley Weimer: Of the first 15 we tried, in some sense, everything we tried succeeded. Although recently we have found a number of programs that we're unable to succeed at. >>: [inaudible] >> Westley Weimer: We might be unable to succeed at. >>: You saw the test that the new [inaudible] still able to use the [inaudible]. >> Westley Weimer: This is such a great idea that it will occur two slides from now. We will test in just a bit whether or not we can use the software after applying the repair. It's an excellent idea. Yes. >>: How do we read the fitness numbers? >> Westley Weimer: How do we read the fitness numbers. This is the number of times I had to run your entire test case, your entire test suite, before I came up with the repair. So, for example, to fix the format strength vulnerability in wu-ftpd, I had to run its test suite 55 times. >>: So are there lessons to be learned from looking [inaudible] which is [inaudible]? >> Westley Weimer: Yes. Yes. There are such lessons. I will get to them on the next slide and we'll talk about scalability in just a bit. Yes. >>: So what was the [inaudible] side of the patch for each of these? >> Westley Weimer: After minimization, the average patch size was four to five lines. So we tended to be able to fix things or we tended to produce fixes that were relatively local. And I'll talk about that qualitatively in a bit. Often our repairs were not what a human would have done. >>: [inaudible] easy to take a negative case and make it not [inaudible] if you change, say -- make small changes [inaudible] chances are the black hat [inaudible]. So I'm not sure -- so what is the metric for success? Is it the continued developability or just the attack? >> Westley Weimer: All right. So the question was for a number of security attacks, they're relatively fragile and any small change might defeat them. To some degree, great. If a small change does defeat it. Less facetiously, for many of these -- for openldap or for php, for example -- we searched around until we found an exploit that was relatively sophisticated. For example, exploits that did automatic searching for stack offsets and layouts so that even if we changed the size of the stack they would still succeed. For the format string vulnerability one we used an exploit that searched dynamically for legitimate format strings, often finding ones that were 300 characters long or whatnot. So the relatively simple changes where you add a single character and now the executable is one byte larger and it's immune, we tried to come up with negative test cases that would see through that. Another way to approach that is we also manually inspected -- and we describe this in the ICSE paper -- all of the resulting repairs and are able to tell whether or not a similar attack would succeed. And in general what actually happens is we do defeat the attack, but only there. Let's imagine that you have a buffer overrun vulnerability rampant throughout your code, and your negative test case shows one particular read that can take over. We'll fix that one read, but not the three others. >>: [inaudible] trivial solution to the problem [inaudible] it's just if you -- they are so [inaudible] you do one [inaudible] you don't want, just block those and that's it. >> Westley Weimer: Right. >>: So such -- not only it's faster, it's 100 percent precise with respect to that specific test suite. And second if you know [inaudible] is the passing test, you [inaudible]. >> Westley Weimer: Right. >>: What's wrong with that? >> Westley Weimer: So Patrice [phonetic] is describing fast signature generation, an approach that we find orthogonal to ours. There is nothing wrong with signature generation. We like it a lot. In fact, we want to encourage you to use it to buy us time to come up with the real repair. So one of the tricks with fast and precise signature generation is that you actually often can't get all of those adjectives at the same time. If you are just blocking that particular input, then the black hat changes one character and they get past you or whatnot. If you're trying to block a large class of inputs, then often you're using regular expressions or some such at the network level and you also block legitimate traffic. We're going to end up evaluating our technique using the same sorts of metrics you would use to evaluate signature generation; to wit, the amount of time it takes us to do it and the number of transactions lost later. And you'll be able to compare for yourself quantitatively whether or not we've done a good job. But we do want to use signature generation. It's orthogonal. [inaudible]. >>: So how many negative test cases do you have for each of these examples? >> Westley Weimer: One each. >>: Okay. So one way to measure to robustness of your technique would be the following: You start with more than one negative test case [inaudible] technique [inaudible] one negative test case at a time. So you see if [inaudible] and would all of them end up with the same patch. >> Westley Weimer: This is a good idea. Officially we can actually support multiple negative test cases. We just keep going until you pass every test case. So having multiple there at the same time doesn't hurt us. But in some sense we might -- we have no reason to believe that our operation is distributive. It might be easier for us to fix two things alone than to fix them both at the same time. Your robustness measurement is intriguing. We should talk after the presentation. >>: [inaudible] negative test and positive test was quite different. So what is the negative is going to become a positive test when you consider the next negative test. >> Westley Weimer: Yes. We -- you are correct. Yes. Other questions while we're here? So in general this worked out relatively well. For us, rather than the lines of code in the program, the size of the weighted path ended up being a big determining factor in how long it took to come up with a repair. Question from the front? No. >>: [inaudible] >> Westley Weimer: I should wrap up now? [multiple people speaking at once] >> Stephanie Forrest: No more questions till the end. >> Westley Weimer: Okay. So our scalability is largely related to the length of this weighted path. The length of the weighted path is sort of a proxy for the search space that we have to explore. Here we've plotted a number of things that we can repair: note log-log scale. But in general this is one of the areas where we might use some sort of off-the-shelf fault localization technique from Tom Ball or whatnot for the cases that have particularly long awaited paths. >> Stephanie Forrest: [inaudible] >> Westley Weimer: Certainly. >> Stephanie Forrest: So you measure the slope of that line, what you find out is that it's less than quadratic and a little more than linear in the length of the weighted path for a small number of datasets. >> Westley Weimer: So in general the repairs that we come up with are not what a human would have done. For example, I talked about this before, we might add bound checks to one area in your code rather than rewriting the program to use a safe abstract string class, which is what humans might do when presented with evidence of the bug in one area. It is worth noting that any proposed repair we come up with must pass all of the regression tests that you give us. One way to view this is for one of the things that we repair, this Web server, the vulnerability is that it trusts the content length provided by the user. The user can provide a large negative number when posting information, when uploading images or whatnot, allowing the remote user to take control of the Web server. If we don't include any regression tests for POST functionality, the repair that we come up with very fast so to disable POST functionality. We just remove that. You're not longer allowed to do that. Which in some sense might actually be better than the standard approach of running the server in read-only mode because in read-only mode you could still get access to confidential information. But in general we require that the user provide us with a sufficiently rich test suite to ensure all functionality. And that's an orthogonal problem in software engineering. Our minimization process in some sense also prevents gratuitous deletions [inaudible] and not shown here adding more test cases, say moving up to 20 or a hundred helps us rather than hurting us. So we still succeed about the same fraction of the time, if there are more test cases. So in brief limited time, you might also want to look at repair quality another way. Imagine you're in some sort of e-commerce security setting like your Amazon.com or pets.com and you're selling dogs to people over the Internet. From your perspective, a high-quality repair might be one that blocks security attacks while still allowing you to sell as many dogs as possible without reducing transactional throughput. So to test this, to test our repair in sort of a self-healing setting, we integrated with an anomaly intrusion detection system. Basic idea here is we throw an indicative workload at you, and when the anomaly detection system flags the requests as possibly an attack, we set it aside and treat it as the buggy test case and try to repair the program so that it's not vulnerable to that. The danger to this is that we can do this repair without humans in the loop. And ->>: [inaudible] >> Westley Weimer: What? >>: [inaudible] >> Westley Weimer: Yes, just dogs in the loop. So let's see how that actually goes. Our experimental setup was to obtain indicative workloads. We then apply the workload to a vanilla server sped up so that the server is brought to its knees. And the reason we're doing this is so that any additional overhead introduced by the repair process or the patch shows up on measurement. We then send some known attack packet. The intrusion detection system flags it. We take the server down during the repair, measure how many packets we lose then. We apply the repair and then we send the workload again measuring the throughput after we apply the repair, which is sort of equivalent to measuring the precision of a filter. Does this work? In fact, it worked relatively well. We started out with workloads about 140,000 requests for two different Web servers and also we considered a dynamic Web application written in php that ran on top of Web servers as sort of room and item reservation system. Each row here is individually normalized so that the vanilla version, before the repair, is considered to have 100 percent performance; however many requests it's able to serve is the right amount. We end it, the indicative workload, once. And then we send the attack packet and try to come up with a repair for that attack. In each case, the attacks were again off-the-shelf security vulnerabilities from bug track. We were able to generate one of these repairs. One of the first questions is how long did it take, are we as fast as fast signature generation. And assuming that there's one such attack per day, in general we lose between let's say zero and 3 percent of your transactional throughput if you take the server down to do the repair. In general, you might imagine having a failover or backup server in read-only mode, say, that might handle requests during this time. So for us the more important number is what happens after we deploy the repair. Might we deploy low-quality repairs that cause us to fail to sell dogs. And in essence the answer is no. Any transactions or any packets that we lost after applying the repair were totally lost in the noise. And our definition of success was relatively stringent. Success here means not only that we got the output byte for byte correct, but also that we did it in the same amount of time or less than it took the vanilla unmodified program. And the reason for this, the interpretation, is that our positive test cases, the regression tests, prevent us from sacrificing key functionality. There was a question. >>: I use experiment a lot. I wonder if you've considered comparing the [inaudible] reduction with [inaudible]. >> Westley Weimer: This is a great idea. We had not considered it. But we will now. >>: I think you will do quite a bit better. >> Westley Weimer: The question was could we compare our throughput loss to that of a Web application firewall like -- and then you named one in particular. >>: [inaudible] of the requests [inaudible]. >> Westley Weimer: Yes, yes. The outputs -- we only count it as a successful request if it's exactly the same. >>: [inaudible] after you'd gone back [inaudible]. >> Westley Weimer: Yes. So yes. I will get to this in let's say a minute. It's a great question. Whenever we start using an intrusion or anomaly detection system, false positives become a fact of life. And intrusion detection system might mistakenly point out some innocuous request as if it were an error. So a second set of experiments we did hear was let's take some false positives, let's take some requests like get index at HTML that are not actually attacks, pretend they are attacks and try to repair them and see what happens. Maybe then we come up with a low-quality repair. So we did that three times. We took completely legitimate requests and consider them to be attacks and try to repair them. In two of the cases we were actually able to make some sort of repair, and I'll quantify for you in just a second what that might mean. In general it takes us a long time to come up with repairs for things that aren't actually problems because we have to search through the space until we find some sort of corner-case way of in essence tabulating, defeating this particular input but allowing all others to get through. In fact, in our third false positive, it was in essence get index at HTML, and there was no way to both fail this and also pass this as one of the regression tests. So we couldn't make a repair there. But even for these two times where we did come up with repairs that in some sense weren't actually valid or repaired non-bugs, any transactional throughput that we might have suffered at the end was in the noise. Because, again, we have these sort of positive regression test cases that make us keep important behavior around. What are examples of packets that we might lose. For one of these, the repair involved in some cases dropping the cache control tag from the http response header. Previously, yes, you were allowed to cache it with our bad repair. In some of these cases, no, you weren't allowed to cache it. So maybe then you have to make extra requests of the server in order to get the same data so you don't get everything done in time so you drop a few packets there. >>: Can you clarify the -- seems to me you said earlier that you you're using the workload [inaudible] problem that to -- is that your test cases that you ->> Westley Weimer: No. This is great question. >>: [inaudible] >> Westley Weimer: No. >>: [inaudible] >> Westley Weimer: No, no, no. We have five separate test cases get indexed at HTML, post my name and copy it back to me. Get a directory, get a GF and get a file that's not found. Those are our five test cases. >>: Okay. And those are the ones that you're using for the [inaudible]. >> Westley Weimer: Yes. Those are the ones we check every time. >>: [inaudible] >> Westley Weimer: Yes. So the repair never sees these 138,000 requests in our -- yes. >>: Sorry. So among this workload here, there are how many -- what's the percentage of malicious, so to speak -- >> Westley Weimer: One out of a 138,000. The workload is entirely benign. We play the workload once. We send the attack, and then we play the workload again. >>: And the attack is what exactly? >> Westley Weimer: The attack is an off-the-shelf exploit from bug track. It varies by which program. >>: [inaudible] the previous question different from the single basically [inaudible] failing test case that you use for generating the repair. >> Stephanie Forrest: No. >> Westley Weimer: Exactly the same as becomes the negative test case that we use to generate the repair. >>: Exactly the same if there's only one of them, there are [inaudible] of them. That's what I'm ->> Westley Weimer: Uh... >>: What would defeat the trivial solution of just blocking that specific input? >> Westley Weimer: In this experiment, if you did the trivial solution of blocking that particular input, you would get a high score. We did not do the trivial solution of blocking that particular input. >>: Yes. But [inaudible]. >> Westley Weimer: Which you cannot tell from this chart, but I assert and we describe qualitatively in text. >>: [inaudible] really well if that's -- for that's basically experiment setup. >> Westley Weimer: Right. And in essence what we want to say is the trivial solution we do really well, perfect filtering would do really well. We also do really well. But we're repairing the problem and not just putting a Band-Aid on top, or so we claim. >>: [inaudible] you are implicitly giving a lot of weight [inaudible]. >> Westley Weimer: We're not going to invent new code. Right. So other limitations beyond those that have been brought out by astute audience members. One is that we can't handle non-deterministic faults. We run your program in all the test cases and assume that's going to tell us whether it's correct or not. And if you had something like erase condition, running it once might not be able to tell us what's going on. Now, we did handle multithreaded programs like Web servers, but those aren't really multithreaded because the copies don't actually communicate with each other in any interesting way. A long-term potential solution to this might be to put scheduler constraints perhaps related to controlling interleavings into the variant representation so that a repair would be both change these five lines of code and also instruct the virtual machine not to do the following scheduler swap. Future work. We also assume that the buggy test case visits different lines of code than the normal regression test cases for something like a cross-eyed scripting attack or a SQL code injection vulnerability that would not be the case. And then our weighted path wouldn't be a good way to narrow down the search space and in such an area we might want to use an off-the-shelf fault localization approach like Tom Ball's. Finally, we also assume that existing statements in the program can help form the repair. Currently we're working on some notion of repair templates, something like typed runtime code generation, sort of the essence of a null check is if some hole is not null, then some hole for statements we can imagine either hand crafting those or mining them from version control repositories. Maybe you work in a company where there have been a lot of changes related to switching over to wide character support, moving from printf to wsprintf. If we notice that other divisions are doing that, then maybe we can suggest that as repairs for you. Sir? >>: In connection to your dog example, so you're worried about whether dogs are sold or not. >> Westley Weimer: Yes. >>: Instead you may worry about how many dogs are sold. So in this connection, there is one generalization, instead of test being yes/no ->> Westley Weimer: Test could be numbers. >>: Numbers, how well do we pass the test. And then you can optimize ->> Westley Weimer: Yes. One -- so this is a good idea. One of the reasons that we haven't pursued it experimentally is that in the open source world it is hard to find test cases that are that fine grained, that have a real number value rather than yes or no. But Microsoft may well have them and before everyone leaves there will be sort of a beg-a-thon where I ask for ->> Stephanie Forrest: The real reason we're here. >> Westley Weimer: Yes. The real reason we're here: help from Microsoft. So the work they've presented here covers actually two papers and more ICSE work that we're presenting in a few days which in essence describes our repair algorithm, the test cases that we use, the repair quality. Answers the question does this approach work at all. It's controversial, you know, yes, it looks like it does. We have a second paper in the Genetic and Evolutionary Computing Conference coming out a bit later which answers questions related to our crossover and mutation operations, is this really evolutionary computing or is this perhaps just random search. What are the effects of more test cases, what's the scaling behavior, questions more related to why does this work. In general, though, we claim that we can automatically and efficiently repair certain classes of bugs and off-the-shelf legacy programs. We use regression tests to encode the desired behavior and also to focus the search on parts of the program that we view are more likely to contain the bugs. So we use regression tests rather than, say, manual annotations. And I encourage difficult questions, and there have been a number already. Any last questions? >>: Let's take a few more minutes to ask particularly tough questions. [multiple people speaking at once] >>: I would love to -- I mean, ideally of course you would like to have proof that -- for -I mean, that you have more verification. Because to test the repair program actually semantically equivalent function and equivalent -- the faulty one ->> Westley Weimer: Except one thing, yeah. >>: So but how are you [inaudible] -- I mean, what are your thoughts about that challenge? >> Westley Weimer: Right. So we have some previous work for automatically generating repairs in the presence of a specification where the repair provably adheres to the specification and doesn't introduce any new violations. One of the things we wanted to do here was see how far we could get without specifications. Without specifications it's harder to talk about things like proof. One of the directions we've been considering instead is documentation. And we have some previous work that was successful at automatically generating documentation for object-oriented programs. Would we extend that work to automatically document what we think is going on in the exception -- or, sorry, what we think is going on in the repair and that it might be even easier for the developer to say no, the tool is way off, I don't want to do this, or, oh, this is the right idea, but I should add one more line or some such. We have also been considering -- imagine someone has some specifications but not all of them and also some test [inaudible] I know, know, it's hard to believe -- and then also some test cases. Is there some way we could incorporate information from specifications into our fitness function or to sort of guide the repair. So if we do have any sort of partial correctness specification, we can be certain that we never generate or repair that causes a tool or a model checker to think that the resulting program violates a partial correctness specification. Assuming they're present. >>: [inaudible] more questions. >>: [inaudible] maybe I missed a part, but a critical part seems to be which tests you pick -- the positive test case because in reality I don't know any party going, yes, five test cases which are presentative, and then multiply by the number of ->> Westley Weimer: Yes. >>: -- you see where I'm going, right? >> Westley Weimer: I can see where you're going. So it might take a very long time. So in fact we have current work not described here on using in essence either randomization or test suite reduction. There's been a lot of previous work in the software engineering community on let's say time aware test suite prioritization, maybe using knapsack solvers, maybe using other approaches to perhaps order test cases by code coverage, maybe doing the first ones and then sort of bailing out in a short-circuit manner if you fail some obvious test at the beginning. We have a student currently working on -- let's imagine you have a hundred test cases, subsampling for each fitness evaluation, just pick ten of them or just pick five of them, use that for your fitness score, but then if you think you've run, then we run you on all 100 of them. Thus far using relatively simple techniques like that we've been able to get 95 percent performance improvements. So we have reason to believe that we might be able to apply off-the-shelf software engineering test suite reduction techniques in the case where you have hundreds or thousands of test cases rather than five. That was the sigh of I'm not convinced or... >>: Well, then you should be running test cases -- I mean, you're assuming the test cases actually were written. You might just try to destroy the machine or -- you are assuming that your test case [inaudible] correctly for this to work in that they are really hard -- they are very low level and they qualify quite a bit [inaudible] output. So [inaudible] concrete input. I mean, first [inaudible] input, I get that completely defined output. You cannot have a test case say all the output has to be greater than zero first. >> Westley Weimer: You could. Totally could. >>: But then it's not going to work because it's going to -- you could very well then generate basically a lot of different repairs that are going to basically -- and then the repair program will be completely different functionally speaking than the original one. You see what I'm getting at? >> Westley Weimer: Yes. So we have not studied that -- again, we claim that if you -- if what you really want the program to be doing is not returning any number greater than zero, then your test case shouldn't just pass it if it returns any number greater than zero. It's your responsibility to make a more precise test case in that scenario. >>: So when you say you will be [inaudible] based on the fault, that might be very expensive by itself. >> Westley Weimer: But less expensive than a retest all methodology, which is what we're currently doing. Right. It might still be expensive. But we are still in the early stages of dealing with programs with more test cases. And one of the reasons that we're still in the early stages is that they don't exist in the open source world. So if you in Microsoft have many test cases and you would like to contribute to a worthy goal, you might want to consider ->>: You're on the wrong side of the [inaudible]. >>: If we had an exhaustive test case, we wouldn't need this to repair. We'd have a [inaudible] formula the entire program [inaudible]. >> Westley Weimer: The converse of that is that if you had a formal specification, you also wouldn't need any repairs. >>: Other questions after this commercial break? >>: So, I mean, one way of trying to minimize the impact of a change you make is to look at how much of the global stage gets affected because of the patch that you generate. >> Westley Weimer: Yes. >>: I mean, the reason [inaudible] suppose you pause the test cases but you introduce some behavior that only shows when you run [inaudible] ->> Westley Weimer: Yes. >>: -- change as the file system or something like that. So ideally speaking you want to minimize the impact that the program has on ->> Westley Weimer: Yes. >>: -- [inaudible] so is that something you consider? >> Westley Weimer: This is a good idea. Would we notice if we added a memory leak. And we probably would not unless you already had a test case for it. Officially, as is standard test case practice, all of these test cases should be run in a change rooted jail or in a virtual machine or whatnot lest we introduce the command RM star dot star. If we're already running under virtual machine, then checking things like memory areas that you're touching the potentially more feasible. We have not thought about this, but this is a great idea. We should look more into it. Other questions? >>: I wanted to ask you one question before you go. So I think your premise has been that programs contain seeds of their own repair [inaudible]. >> Westley Weimer: Yes. >>: Some programs are basically [inaudible]. >> Westley Weimer: Hmm. >>: So in those cases would it be possible to use the abundance of code and programs [inaudible] to introduce some sort of form of [inaudible]. >> Westley Weimer: This -- I think the answer to this is yes. This would be a good idea. We had not thus far considered it. We've been sort of stuck in our repair templates mode where you can see some of the seeds of this, as it were, seen from a different division in your company to bring over repairs. But in essence this is a different question, which is when we're doing mutation to insert statements might we draw from a different statement bank than the original program, maybe a better high-quality statement bank or one that has been winnowed down to contain only good stuff or whatnot. This is a great idea. We have not considered it, but it is likely that it would do better than what we're doing now. >>: Okay. Well, let's thank the speaker. [applause]