>>: Okay. So it's my great pleasure to... is visiting us with Stephanie Forrest. And Wes is...

advertisement
>>: Okay. So it's my great pleasure to introduce Wes Weimer, who's going to be -- who
is visiting us with Stephanie Forrest. And Wes is going to talk today about automatically
repairing bugs in programs.
>> Westley Weimer: Excellent. Okay. Thank you for the introduction. So I'm going to
be talking about automatically finding patches using genetic programming, or, to put it
another way, automatically repairing programs using genetic programming.
And this is joint work with Stephanie Forrest from the University of New Mexico and our
grad students Claire and Vu.
So in an incredible surprise move for Microsoft researchers, I will claim that software
quality remains a key problem. And there's an oft-quoted statistic from the past that
suggests that over one half of 1 percent of the U.S. GDP goes to software bugs each year.
It would appear that there are people in the audience familiar with this statistic. And that
programs tend to ship with known bugs, often hundreds or thousands of them that are
triaged to be important but perhaps not worth fixing with the resources available.
What we want to do is reduce the cost of debugging, reduce the cost associated with
fixing bugs that have already been identified in programs.
Previous research suggests that bug reports that are accompanied by an explanatory
patch, even if it's tool created, even if it might not be perfect, are more likely to be
addressed rapidly; that is, that it takes developers less time and effort to either apply them
or change them around a bit and then apply them.
So what we want to do is automate this patch generation for defects in essence to
transform a program with a bug into a program without a bug but only by modifying
small, relevant parts of the program.
So the big thesis for this work is that we can automatically and efficiently repair certain
classes of defects in off-the-shelf, unannotated legacy programs. We're going to be
taking existing C programs like FTP servers, sendmail, Web servers, that sort of thing,
and try to repair defects in them.
The basic idea for how we're going to do this is to consider the program and then
generate random variants, explore the search space of all programs close to that program
until we find one that repairs the problem.
Intuitively, though, this search space might be very large. So we're going to bring a
number of insights to bear to sort of shrink this search space to focus our search.
The first key part is that we're going to be using existing test cases, both to focus our
search on a certain path through the program, but also to tell if we're doing a good job
with automatically creating these variants. We'll assume that the program has a set of test
cases, possibly a regression test suite, that encodes required functionality, that makes sure
that the program is doing the right thing. And we'll use those test cases to evaluate
potential variants. In essence, we prefer variants that pass more of these test cases.
Then we're going to search by randomly perturbing parts of the program likely to contain
the defects. You might have a large program, but we hypothesize that it's unlikely that
every line in the program is equally likely to contribute to defect. We're going to try to
narrow down, localize the fault, as it were, and only make modifications in specific areas.
So how does this process work out? You give us as input the source code to your
program, standard C program, as well as some regression tests that you currently pass
that encode required behavior. So, for example, if your program is a Web server, these
positive regression test cases might encode things like proper behavior when someone
asks you to get indexed on HTML. Basic stuff like that.
In addition, you have to codify or quantify for us the bug that you want us to repair. So
there's some input upon which the program doesn't do the right thing: it loops forever, it
segfaults, it's vulnerable to some sort of security attack.
And by taking a step back, we can actually view that as another test case but one that you
don't currently pass. So there's some input. We have some expected output, but we don't
currently give the expected output.
We then loop for a while doing genetic programming work. We create random variants
of the program. We run them on the test cases to see how close they are to what we're
looking for. And we repeat this process over and over again retaining variants that pass a
lot of the test cases and possibly combining or mutating them together.
At the end of the day, we either find a program variant that passes all of the test cases, at
which point we win. We declare moral victory, we take a look at the new program
source code and perhaps take a diff of it and the original program and that becomes our
patch. Or we might give up, since this is a random or heuristic search process, after a
certain number of iterations or after a certain amount of time and say we're unable to find
a solution. We can't repair the program given these resource bounds.
So for this talk I will speak ever so briefly about genetic programming, what it is, what
those word means -- what those words mean, and then talk about one key insight that we
use to limit the search space: our notion of the weighted path. And this is where we're
going to look for the fault or the defect in the program.
I'll give an example of a program that we might repair, an example perhaps near and dear
to the hearts of people in this audience. And then I'll present two sets of experiments:
one designed in general to show that we can repair many different programs with sort of
large -- large and different classes of defects; and one to try to get at the quality of the
repairs that we produce. Would anyone trust them. Are they like what humans would
do. Might they cause you to lose transactions or miss important work if you were to
implement our proposed patches. Sir?
>>: Don't have any guarantee, or at least statistical guarantees, or any kind of guarantee,
that your patch will not create additional problems?
>> Westley Weimer: Right. So the question was do we have any guarantees at all that
our patch will not create additional problems. At the lowest level, we only consider
patches that pass all of the regression tests that you've provided for us. So if there's some
important behavior that you want to make sure that we hold inviolate, you need only
write a test case for it. And in some sense, coming up with an adequate set of test cases is
an orthogonal software engineering problem they won't address here, but I will
acknowledge that it's difficult.
However, it is entirely possible that a patch that we propose might be parlayed into a
different kind of era. For example, in one of the programs that we repair, there's a
non-buffer overflow based denial of service attack, where people can send malformed
packets that aren't too long that cause the program to crash, to fail some sort of assertion.
We might repair this by removing the assertion, but now the program might continue on
and -- yeah, and do bad things. So it's possible that the patches that we generate could
introduce new security vulnerabilities. And we will get to this later in the talk. It's an
excellent question, and you should be wary. Yes.
>>: [inaudible] making assumptions [inaudible] the impact of the [inaudible] assume that
it affects local state or global state [inaudible]?
>> Westley Weimer: The question was do we make any assumptions about the impact of
the bug on program state, local, global. And in general, no. I'll give you a feel for the
kind of patches and the kind of defects that we're able to repair. Many of them, such as
buffer over unvulnerabilities, integer overflow attacks, deal with relatively local state.
But nothing in the technique assumes or requires that. Good question.
Other questions while we're here? Okay. So we'll do some experiments on repair quality
which will potentially address some of these concerns but not all of them. And then
there'll be a big finish.
So genetic programming formally is the application of evolutionary or genetic algorithms
to program source code. And whenever you talk about genetic algorithms, at least in my
mind, there are three important sort of concepts you want to get across: You need to
maintain a population of variants, and you need to modify these programs perhaps with
operations such as crossover and mutation, and then you need to be able to evaluate the
variants, to tell which ones are closer to what you're hoping for.
And here for a sort of Microsoft Research localized example -- Rustan Leino wasn't able
to make it to the talk, but I did get a chance to speak with him yesterday -- here are a
population of variants as sort of four copies of Rustan Leino, and then we make small
modifications. So, for example, some are standing up, some are sitting down, some have
a pitcher of apple juice. And at the end of the day we have to tell which we prefer.
So, for example, if we're very thirsty, we might prefer the Rustan with the apple use.
Right? We're going to do this accept that we're going to fix programs instead.
Many of you might be wary of the application of genetic programming or genetic
algorithms or machine learning to programming languages or software engineering tasks.
You might have seen these things done in a less-than-principled manner in the past.
Another way to view our work is as search-based software engineering, where we're
using the regression test cases to guide the search. They guide both where we make the
modifications and also how we tell when we're done.
>>: What is that picture [inaudible]?
>> Westley Weimer: The picture says: What's in a name? That which we call a rose by
any other name would smell as sweet. So it's a quote from Shakespeare about the naming
of things and whether or not that matters. But the font is small and hard to read. Good
question.
So how does this end up working? In some sense there are two secret sauces that allow
us to scale. Our main claim is that in the large program not every line is equally likely to
contribute to the bug.
And since we required as input indicative test cases, regression tests to make sure that
we're on the right track, the insight is to actually run them and collect coverage
information. So we might do something as simple as instrumenting the program to print
out which lines it visits. And we hypothesize that the bug is more likely to be found on
lines that are visited when you feed in the erroneous input, lines that are visited during
sort of the execution of the negative test case. Yes.
>>: Surprised you have a [inaudible] statistical studies and he found that the bugs are
statistically in those places which are especially common, [inaudible] explanation that
those places [inaudible] and there's a lot of [inaudible] they work again and again
[inaudible] visited.
>>: Endless source of bugs.
>> Westley Weimer: And so there was a comment or a suggestion that even lines of
code that are visited many times might still contain bugs. And we're not going to rule out
portions of the code that are visited along these positive test cases, but we might weight
them less.
And while Tom Ball is not here, he and many other researchers have looked into fault
localization. And to some degree what I'm going to propose is a very course-grained
fault localization. We might imagine using any sort of off the-the-shelf fault localization
technique. One of the reasons that we are perhaps not tempted to do so is that we want to
be able to repair SYNs of omission.
So if the problem with your program is that you've forgotten a call, you've forgotten a call
to free or forgotten a call to sanitize, we still want to be able to repair that. But many but
not all fault localization techniques tend to have trouble with bugs that aren't actually
present in the code. It's hard to pin SYNs of omission down to a particular line.
So we'll start with this and then later on I will talk about areas where fault localization or
other statistical techniques might help us. Good question. Yes.
>>: The third bullet point is puzzling me. So you say the bug is more likely to be found
when you're running the fail test case, so there are all also passing test cases?
>> Westley Weimer: Right.
>>: So if you don't -- I mean, if you don't run the code, then you...
>> Westley Weimer: So we assume that you give us a number of regression test cases,
that you pass currently, and also one new -- we call it the negative test case that encodes
the bug, some input on which you crash or get the wrong output. And these existing
positive regression test cases are assumed to encode desired behavior. So we can only
repair programs for which a test suite is available.
But there might be some techniques for automatically generating test cases. Yes.
>>: [inaudible] confused about the deployment scenario again. So is this meant to be
something you deploy during debugging the code or you observe a failure in the field,
you collect some [inaudible] and then you use this for [inaudible]. Which of these?
>> Westley Weimer: All right. So the question was about the deployment scenario.
And in some sense, my answer to this is going to be both. Our official story, especially at
the beginning of the talk, is that we want to reduce the effort required for human
developers to debug something. And we're going to present to them a candidate patch
that they can look at that might reduce the amount of time it takes them to come up with a
real patch. So we assume maybe the bug is in the database, there are some steps to
reproduce it, you run our tool and five minutes later we give you a candidate patch.
However, our second set of experiments will involve a deployment scenario where we
remove humans from the loop and assume that some sort of long-running service has to
stay up for multiple days, and there we want to detect perhaps with anomaly or intrusion
detection faults as they arrive and then try to repair them. So we will address that second
scenario later. Excellent question, though. Other questions while we're here? Okay.
So we have the test cases. We run them. We notice what parts of the program you visit.
We claim that when we're running this erroneous input that causes the program to crash
or get the wrong answer that it's likely that those lines have something to do with the bug.
However, many programs have -- let's say a lot of initialization code at the beginning
that's common, maybe you set up some matrix the same way every time or read some
user configuration file. And that might be code that you visit both on this negative test
case, when you're experiencing the bug, but also on all other positive test cases. We want
to or we hypothesize that those lines are slightly less likely to be associated with the bug.
In addition, as we're going around making modifications to the program, we might
sometimes delete lines which are relatively simple, but we might also be tempted to insert
or replace code. And when we do, if you're going to insert a line into a C program, there
are a large number of statements or expressions you might choose from. Infinite, in fact.
One of the other reasons that we scale is we're going to restrict the search space here to
code that's already present in the program. In essence, we hypothesize that the program
contains the seeds of its own repair; that if the problem with your program is, for
example, a null point to do reference or an array out of bounds access that probably
somewhere else in your program you have a null check or you have an array bounds
check. Following, for example, the intuition of Dawson Engler, bugs as deviant
behavior, it's likely that the programmer gets it right many times but not every time.
So rather than trying to invent new code, we're going to copy code from other parts of
your program to try to fix the bug. And I will revisit that assumption later.
So we wanted to narrow down the search space and our formalism for doing this is a
weighted path. And, relatively simply, for us a weighted path is a list of statement and
weight pairs where conceptually the weight represents the likelihood or the probability
that that statement is related to the bug, is related to the defect.
In all of our experiments, we use the following weighted path. The statements are those
that are visited during the failed test case, when we're running the buggy input. The
weights are either very high, if that's a statement that's only visited on the bug-inducing
input and not on any other test case, and relatively low if it's also visited on the other past
test cases, if it's that matrix initialization code that's global to all runs of the program.
Sir?
>>: [inaudible]
>> Westley Weimer: The question was do the weights only take these values, 1.0 and
0.1. For the experiments that I will be presenting today, yes. We have investigated other
weights. So another approach, for example, would be to make this 0.1 is 0.0, to just
throw out lines in the program that are also visited on the positive test case. And for
some runs of our tool, we tried different parameter values.
In practice, these are the highest performing parameters that we found but with a very
small search. It seems that our algorithm is in some sense relatively insensitive to what
these values actually are.
>>: You call this weighted path rather than least. So what make it least path?
>> Westley Weimer: Ah.
>>: [inaudible] connection between [inaudible].
>> Westley Weimer: The question was about the terminology. It's a list of statement
weight pairs. Why do I call it a weighted path. And the reason is that it's the path
through the program potentially around loops or through if statements visited when we
execute the failed test case. So it corresponds to a path through the program. But when
we throw away the abstract syntax tree, a path is just a list of steps that you took forward
and then left and whatnot. Does this answer your question? Yes.
>>: So a failed test isn't as simple as the assignments to the inputs or do you also -- does
the user also specify why does this test case fail, so somehow [inaudible] should have
some [inaudible]?
>> Westley Weimer: Right. So the question is what is a test case. And I will get to this
at the end and show you a concrete example of a test case from ours. But officially we
follow the oracle comparator model from software engineering where a test case is an
input that you feed to the program and it produces new output, but you also have an
oracle somewhere that tells you what the right output should have been. And in
regression testing the oracle output is often running the same input on the last version.
Then there's also a comparator which could be something very strict, like diff for
compare, which would notice if even one byte were off. But for something like graphics,
for example, your comparator might be for fuzzy. You might say, oh, if part of the
picture is blurry, that doesn't matter as long as the sum of the squares of the distance is
less than blah, blah, blah.
For all of the test cases that I'll present experiments on, we followed a very simple
regression test methodology. We ran a sort of vanilla or the correct version of the
program and wrote down the correct output, all of the output that the program produced.
And in order to pass the test case, you had to give exactly that; no more, no less.
But in theory, if you wanted to only care about certain output variables, our system
supports that.
>>: [inaudible] how many times a [inaudible].
>> Westley Weimer: Right. So the question is do we care how many times a statement
is executed, is this really a list, or is it actually a set. And in our implementation, it's
actually a list. If you go around a loop multiple times, we will record that you visited
those statements more often than before, so we might be more likely to look there when
we're performing modifications.
I'm not going to talk about our -- I will skip over our crossover operator, although I will
be happy to answer questions about it in this talk. But, in essence, it requires sort of an
ordered list and potentially might care about duplicates.
We have investigated and there are options available for our tool, for our prototype
implementation of this: treating this list as a set, not considering duplicates. The
advantage of set-based approach is that it reduces the search space. The disadvantage is
that we might be throwing away frequency information about which statements are
actually visited more, so the question is do you want sort of a larger search space but with
a pointer in the right direction, or do you want a smaller search space that you're choosing
from at random.
>>: [inaudible]
>> Westley Weimer: And we could implement this using a multiset, yes. But the actual
sort of data structure bookkeeping of our algorithm ends up not being the dominant cost
of this. And I'll get into that in a bit. Yes.
>>: [inaudible]
>> Westley Weimer: Ah. So initially -- so the question was this seems to only care
about the passing test cases. But the set of statements that we consider are only those
visited during failed test cases. So you can imagine that statements visited during neither
passed nor failed test cases implicitly get a weight of zero.
And then of those statements during failed test cases, if they're only visited on the failed
test case, they get high probability, otherwise they get low. So both the positive test
cases and the negative test cases matter, and I have a graphical example to show in a bit.
So once we've introduced the weighted path, I can no go back and fit everything we're
doing into a genetic programming framework.
Need to tell you what our population of variance is. And for this every variant program
that we produce is a pair of a standard abstract syntax tree and also a weighted path. And
the weighted path, again, focuses our mutations or our program modifications, tells us
where we're going to make the changes.
Once we have a bunch of these variants, we might want to modify or mutate them. And
to mutate a variant, we randomly choose a statement from the weighted path only.
Biased by the weights. So we're more likely to pick a statement with a high weight.
Once we've chosen that statement, we might delete it, we might replace it with another
statement, or we might insert another statement before or after it.
When we come up with these new statements, rather than making them up out of whole
cloth, we copy them from other parts of the program. This reduces the search space and
helps us to scale. And, again, this assumes that the program contains the seeds of its own
repair; that there's a null check somewhere else that we can bring over to help us out.
We have a crossover operator that I'm not going to describe but will happily answer
questions about. And in some sense it's a relatively traditional one-point crossover.
Once we have come up with our variants, we want to evaluate them: are they any good,
are they passing the test cases.
We use the regression test cases and the buggy test case as our fitness function. We have
our abstract syntax tree. We prettyprint it out to a C program, pass it to GCC. If it
doesn't compile, we give it a fitness of zero.
Now, since we're working on the abstract syntax tree, we're never going to get
unbalanced parentheses. We'll never have syntax errors. But we might copy the use of a
variable outside the scope of its definition, so we might fail to type check.
If the program fails to compile for whatever reason, we give it a fitness of zero.
Otherwise, we simply run it on the test cases and its fitness is the number or fraction of
test cases that it passes. We might do something like weighting the buggy test case more
than all of the regression the tests, if there are many regression tests, but only one way to
reproduce the bug. But in general you can imagine just counting the number of test cases
passed.
Once we do that, we know the fitness of every variant. We want to retain higher fitness
variabilities into the next generation, those might be crossed over with others. And we
repeat this process until we find a solution, until we find a variant that passes every test
case or until enough time has passed and we give up.
>>: [inaudible] failed to compile if you -- if the variables [inaudible].
>> Westley Weimer: Yes.
>>: But it's also possible that the variable is redefined [inaudible] the same name or
maybe it's [inaudible] some other object in the midterm [inaudible].
>> Westley Weimer: Yes.
>>: You consider those cases ->> Westley Weimer: No.
>>: [inaudible]
[laughter]
>> Stephanie Forrest: We just let them happen.
>> Westley Weimer: We just let them happen. And presumably -- so let's imagine. We
move a use of a variable into a scope that also uses a different variable with the same
name. So now we're going to get in essence random behavior. Hopefully we'll fail a
bunch of the test cases so we won't keep that variant around.
On the other hand, that random behavior might be just what we need to fix the bug.
Unlikely, but possible. So we'll consider it, but probably it won't pass the test cases, so it
won't survive until the next generation.
So going to give a concrete example of how this works using idealized source code from
the Microsoft Zune media player. And many of you are perhaps familiar with this code
or code like it.
December 31st, 2008, some news sources say millions of these Zune players froze up and
in many reports the calculation that led to the freeze was something like this. We have a
method that's interested in converting a raw count of days into what the current year is,
and in essence this is a while loop that subtracts 365 from the days and adds one to the
years over and over again, but has some extra processing for handling leap years, where
the day count is slightly different.
To see what might go wrong with this, the infinite loop that causes the players to lock up,
imagine that we come in and days is exactly 366. We're going to go into the while loop.
Let's say the current year, 1980, is a leap year. If days is greater than 366 -- well, it's not
greater than, it's exactly equal to. So instead we'll go into this empty else branch, which
I've sort of drawn out for expository purposes. Nothing happens, we go around the loop,
and we'll go around the loop over and over again, never incrementing or decrementing
variables; we'll loop forever. We want to repair this defect automatically.
So our first step is to convert it to an abstract syntax tree. And it should look similar to
before: assignments here, we have this big while loop. And what I want to do is figure
out which parts of this abstract syntax tree are likely to be associated with the bug. We
want to find out weighted path.
So our first step is to run the program on some bug-inducing input, like 10,593 which
corresponds to December 31st, 2008. And I've highlighted in pink all of the nodes in the
abstract syntax tree that we visit on this negative test case. We loop forever, so notably
we never get to the printf at the end. We visit all of the nodes except the last one.
So this is some information but not a whole lot. But now I also want to remove or weight
less nodes that are visited on positive passing regression test cases. So let's imagine that
we know what the right answer is supposed to be when days equal 1,000. We have a
regression test for it. I rerun the program and also mark -- and here I'm marking in
green -- all of the nodes that were visited on positive test cases.
And in some sense, the difference between them is going to be my weighted path. I'm
going to largely restrict attention to just this node down here, which is visited on negative
test case, but on no positive test cases.
So here's my weighted path. Once I have the weighted path, I might randomly apply
mutation operations: insertions, deletions, replacements. But only to nodes along the
weighted path. Let's imagine I get relatively lucky the first time. I choose to do an
insertion. Where am I going to insert? I'm more or less stuck inserting here. And rather
than inventing code whole cloth, I'm going to copy a statement from somewhere else in
the program.
So, for example, I might copy this days minus equal 366 statement and insert it into this
block.
Once I do, the final code ends up looking somewhat like this. I now have a copy of that
node. I've changed the program. Slightly different abstract syntax tree. And this ends up
being our final repair. Passes all of our previous test cases and it doesn't loop forever on
the negative test case. But it might not be what a human would have done. And it might
not be correct. Maybe our negative test case wasn't correct. So I'm going to talk a bit in
a bit about repair quality. Yes.
>>: [inaudible] with exception handling. Suppose you're an exception handler and those
[inaudible] negative test case. Essentially you'd be trying to add code to the exception
handler to make it repair the bug. But the bug might be somewhere else in the program.
>> Westley Weimer: So okay. So this is a good point. For the implementation that we
have here, we've been concentrating on off-the-shelf C programs wherein sort of
standard -- standard C exception handling is not an option.
We have previously published work for repairing vulnerabilities if a specification is
given. And in that work, we've specifically described how to add or remove exception
handling -- either try catch or try finally statements -- in order to fix up type states
problems, for example.
If we were going to move this to, say, Java or some other language with exception
handling, we would probably try to carry over some of those ideas. So we haven't
implemented it yet.
>>: [inaudible]
>> Westley Weimer: Currently our implementation is C. There are no major problems
preventing us from moving to other languages. But there are corner cases we would want
to think about, such as exception handling.
So, for example, if we were moving to a language with exception handling, we might
want to consider adding syntax like try or catch or finally as atomic mutation actions.
More on that in a bit.
So before we get to repair quality, just sort of a brief show of how this happened over
time. On this graph the dark bar shows the fitness we assign to our -- the average, our
overall population of variance as time or as generation goes by.
And in this particular version, we had five normal regression test cases that were worth
one point each, and we had two test cases that reproduced the bug, and passing those -that is, not looping forever, getting the right answer -- was worth ten points each.
And we start off at five. We pass all of the normal regression tests, but none of the buggy
tests. After one generation, our score is now up at 10 or 11. We pass one of the buggy
tests but have lost key functionality. We're no longer passing some of the regression
tests. We're 12 instead of 15.
And our favorite variant, the variant that eventually goes on to become the solution, stays
relatively stable for a while. The others sort of generally increase -- until eventually
either it's mutated or it combines part of its solution with something else that's doing
relatively well. And in this seventh generation, it passes all of the regression test cases as
well as the two bugs. So we think about this for a while. This is an iterative process, and
then hopefully at the end we pass all the test cases. Yes.
>>: If something doesn't compile you said [inaudible] zero, right?
>> Westley Weimer: Yes.
>>: Is it possible that you generate a lot of stuff that has been zero to find one thing that
actually compiles?
>> Westley Weimer: That's a great question. And in fact we have -- we have these
numbers, although ->>: [inaudible]
>> Westley Weimer: Let us say that less than half of the variants that we generate fail to
compile. And I can get you the actual numbers right after this presentation.
>>: But if you're going to operate on, say, source code as opposed to [inaudible].
>> Westley Weimer: If we were to operate on source code and it was possible that we
could make mistakes with balance parentheses or curly braces, then it would be a much
higher fraction. And we'll return to that in a bit.
>>: [inaudible]
>> Westley Weimer: So we maintain a population of variants. Let's say 40 random
versions of the same program. And we will mutate all of them and run them all on the
test cases and then keep the top half, say, to form the next generation.
And the average here is the average fitness of all 40 members of that generation, that
population.
>>: [inaudible] remains flat for [inaudible].
>> Westley Weimer: So I may defer some of this to the local expert, but my answer
would go along these lines. Genetic programming or genetic algorithms are heuristic
search strategies, somewhat like simulated kneeling, that you would use to explore a
large space.
And one of the reasons you would be tempted to use something like this instead of, say,
Newton's method is that you want to avoid the problems associated with hill climbing.
You want to avoid getting stuck in a local maximum.
And the crossover operation in genetic programming or genetic algorithms, which I
haven't talked about, holds out the possibility of combining information from one variant
that's stuck in a local hill over here, let's say in the second part of the path, with another
variant that's stuck in a hill over here, say in the first part of the path. We might take
those two things and merge them together and suddenly jump up and break out of the
problems associated with local hill climbing.
So the hope is that this is the sort of thing we would get out of using this particular
heuristic search strategy. But I will defer to the local expert.
>> Stephanie Forrest: [inaudible] you can see that there's still something happening in the
population because the average is moving. So even though that individual was -- it was
good enough to keep being copied unchanged into the next generation, but it was so
obviously not the best individual left in the population. But then in the end [inaudible].
>>: [inaudible] nothing moved.
>> Stephanie Forrest: You know, in the end, it was the individual that led to the winner.
>> Westley Weimer: And the question was between three and four nothing moved.
Between three and four, the averages didn't move. But the population size is, let's say, 40
or 80.
>>: [inaudible] best individual [inaudible] how can the average be bigger than 11?
>> Westley Weimer: So I must apologize for the misleading label here. In our
terminology, the best individual is the one that eventually forms the repair.
>>: [inaudible]
>> Westley Weimer: Eventually. So we're tracking a single person or a single individual
from the dawn of time to the end.
>> Stephanie Forrest: So you know about genetic algorithms, that's usually really hard to
do in its artifact, to trace the evolution of a single individual. But it's an artifact of how
we do our cost of operating [inaudible].
>> Westley Weimer: Yes.
>>: How long did it take in seconds?
>> Westley Weimer: For this particular repair? Let's say 300 seconds. Three or four
slides from now, I will -- I will present quantitative results.
On average, for any of the 15 programs that we can repair, the average time to repair is
600 seconds.
So once we do this, we apply mutation for a while. We come up with a bunch of
variabilities, we evaluate them by running them on the test cases, eventually we find one
that passes all the test cases. This is our candidate for the repair.
Our patch that we might present to developers is simply the diff between the original
program and the variant that passes all the test cases. We just prettyprint it out and take
the diff.
However, random mutation may have added unneeded statements or may have changed
behavior that's not captured in any of your test cases. We might have inserted dead code
or redundant computation. And we don't want to waste developer time with that. The
purpose of this was to give them something good to look at.
So in essence we're going to take a look at our diff for our patch, which is sort of a list of
things to do to the program, and consider removing every line from the patch. If we did
all of the patch except line 3, would that still be enough to make us pass all of the test
cases? If so, then we don't need line 3. Then line 3 was redundant.
And you can imagine trying all combinations, but that might be computationally
infeasible or take exponential time. Instead we use the well-known delta debugging
approach to find a one minimal subset of the diff in N squared time; where 1 minimal
means that if even a single line of the patch were removed, it wouldn't cause us to pass all
of the test cases.
So and once again rather than working at the raw concrete syntax level, rather than using
standard diff, we use a tree structured diff algorithm, diffX in particular, to avoid
problems with balanced curly braces where removing one would cause the patch to be
invalid and not compile.
>>: [inaudible] I mean every line of the program rather than just the differences of the
candidate with the original program? Because that seems to be [inaudible] look at lines
that are actually the original program.
>> Westley Weimer: Um...
>>: You might now remove stuff that's -- happens to be not necessary to [inaudible].
>> Westley Weimer: Ah.
>>: [inaudible] was in the original program.
>> Westley Weimer: Okay. So this is a good question. Let me see if I can give an
example of this.
Let's say that the original program is A, B, C. And the fix is A, B, D. Something like
that. Then our first patch will be something like delete C and insert D at the appropriate
location.
There are two parts of this patch. These are the only things that we're going to be
considering when we minimize sets. So when we talk about minimizing the patch, the
possibilities that we're considering are 1 alone; 1, 2; 2; or none of these. So we're -- we're
never going to remove A. So we're not going to mistakenly hurt the original program.
It's only among the changes that were made by genetic algorithms. We say did we
actually need all of those.
>>: [inaudible] lines in the patch other than the performance over here. What's the
[inaudible]?
>> Westley Weimer: So this is a great idea. It's a great question. And we have a number
of ideas related to it. The question was why are we trying to minimize the size of the
patch and not some other objective.
Our initial formulation of this was we wanted to present it to developers. And often
when developers make matches to work in code, you want to do the least harm. You
want to touch as little as possible.
However, we have been considering applying this technique to other domains. For
example, let's imagine that you're a graphics person interested in pixel or vertex shaders
which are loaded onto a graphics processor. Those are small C programs. Rather than
trying to repair them, might we try to use genetic programming or the same sort of
mutation to create an optimized version of a vertex or pixel shader that is faster but
perhaps produces slightly blurrier output. But since it will be viewed by a human, we
might not care.
So here our test cases would be not an exact match but instead the L2 norm could be off
by 10 percent. And rather than favoring the smallness of the change, we would favor the
performance of the resulting program.
So that is a good idea. We're certainly considering things like that. For now, for the
basic presentation, we want to get to the essence of the defect to help the developer repair
it.
>>: But even outside of the domain of graphics, say, the smallness of the change is not
necessarily representative of how impactful it might be. So suppose you change function
fu that is called on dozens of path whereas function bar is only holding one, something
like that.
>> Westley Weimer: Yes.
>>: And where you would favor [inaudible].
>> Westley Weimer: Right. So the question was a smaller change might influence a
method that's called more frequently.
Also at ICSE once of my students will be presenting work on statically predicting or
statically guessing the relative dynamic execution frequency of paths. One of the other
areas we've been considering for this is rather than breaking ties in terms of size, break
ties in terms of this change is less likely to be executed in the real world. So we might
have a static way of detecting that, or we might use the dynamic counts that we have
from your existing regression test case, assuming they're indicative, and use those to
break the tie instead. Also a great idea.
Other questions while we're here?
So does it work? Yes. So here we have initial results. All in all, we've been able to
repair about 15 programs totalling about 140,000 lines of code, off-the-shelf C programs.
On the left I have the names of the programs. Some of them may well be familiar.
Wu-ftpd is a favorite whipping boy for its many vulnerabilities. The php interpreter,
graphical Tetris game, flex, indent and deroff are standard sort of UNIX utilities.
Lighttpd and nullhttpd are Web servers. And openldap is a Web-based directory
authentication protocol.
In the next column we've got the lines of code. And for a number of these -- openldap,
lighttpd or php -- it's worth noting that we are not performing a whole program analysis.
For these, we want to show that we can repair just a single module out of an entire
program. Openldap is relatively large. We're only going to concentrate on modifying
io.c. perhaps someone has already been able to localize the error to there.
For the rest of these, graphical Tetris game, FTP server, we perform as a whole program
analysis. We merge the entire program together into one conceptual file and do the repair
from there.
So here's the lines of code that we operated on. The next column over shows something
that's perhaps more relevant to us, the size of the weighted path. This can be viewed as a
proxy for the size of the search space that we have to look through in order to potentially
find the repair. Recall that we only perform mutations, insertions, and deletions along
this weighted path.
As we go through and create variance, we have to evaluate them. We apply our fitness
function. We run all of the test cases against each variant to see how well it's doing. This
ends up being the dominant performance cost in our approach. Running your entire test
suite 55 times might actually take a significant amount of wall clock time. Although it
can be parellelized at a number of levels.
So in this column we report the number of times we would have to run your entire test
suite before we came up with this repair.
And finally on the right, I have an inscription of the sorts of defects that we're repairing.
Initially we started off by reproducing results from the original fuzz testing work which
was on UNIX utilities. You feed them garbage input, they crash. So for many of these,
like deroff for indent, we found a number of segmentation violations or infinite loops that
we were able to repair.
Then we moved more towards security. We went back through old security mailing lists,
bug track, that kind of thing and found exploits that people had posted, say, against Web
servers or against openldap. And for these sorts of things, php, atris, lighttpd, openldap,
the negative test case is a published exploit by some black hat, and then we try to repair
the program until it's immune to that.
So defects include things as varied as format string vulnerabilities, integer overflows,
stack and heap buffer overflows, segmentation violations, infinite loops, non-overflow
denials of service. Yes.
>>: I was just going to ask what [inaudible] in each case?
>> Westley Weimer: Ah. This is a good question. So for many of these we were able to
get away with relatively small test suites. Let's say five or six test cases that encode
required behavior.
However, in many cases, we expect there to be more test cases available rather than less.
So I'll return to test case prioritization or test case reduction or our scaling behavior as
test cases increase in just a bit. For right now we're only using five or six test cases, and
we're still able to come up with decent repairs.
>>: [inaudible] 15 programs ->> Westley Weimer: Yes.
>>: -- out of how many you're trying?
>> Westley Weimer: Of the first 15 we tried, in some sense, everything we tried
succeeded. Although recently we have found a number of programs that we're unable to
succeed at.
>>: [inaudible]
>> Westley Weimer: We might be unable to succeed at.
>>: You saw the test that the new [inaudible] still able to use the [inaudible].
>> Westley Weimer: This is such a great idea that it will occur two slides from now. We
will test in just a bit whether or not we can use the software after applying the repair. It's
an excellent idea. Yes.
>>: How do we read the fitness numbers?
>> Westley Weimer: How do we read the fitness numbers. This is the number of times I
had to run your entire test case, your entire test suite, before I came up with the repair.
So, for example, to fix the format strength vulnerability in wu-ftpd, I had to run its test
suite 55 times.
>>: So are there lessons to be learned from looking [inaudible] which is [inaudible]?
>> Westley Weimer: Yes. Yes. There are such lessons. I will get to them on the next
slide and we'll talk about scalability in just a bit. Yes.
>>: So what was the [inaudible] side of the patch for each of these?
>> Westley Weimer: After minimization, the average patch size was four to five lines.
So we tended to be able to fix things or we tended to produce fixes that were relatively
local. And I'll talk about that qualitatively in a bit. Often our repairs were not what a
human would have done.
>>: [inaudible] easy to take a negative case and make it not [inaudible] if you change,
say -- make small changes [inaudible] chances are the black hat [inaudible]. So I'm not
sure -- so what is the metric for success? Is it the continued developability or just the
attack?
>> Westley Weimer: All right. So the question was for a number of security attacks,
they're relatively fragile and any small change might defeat them. To some degree, great.
If a small change does defeat it. Less facetiously, for many of these -- for openldap or for
php, for example -- we searched around until we found an exploit that was relatively
sophisticated. For example, exploits that did automatic searching for stack offsets and
layouts so that even if we changed the size of the stack they would still succeed.
For the format string vulnerability one we used an exploit that searched dynamically for
legitimate format strings, often finding ones that were 300 characters long or whatnot. So
the relatively simple changes where you add a single character and now the executable is
one byte larger and it's immune, we tried to come up with negative test cases that would
see through that.
Another way to approach that is we also manually inspected -- and we describe this in the
ICSE paper -- all of the resulting repairs and are able to tell whether or not a similar
attack would succeed.
And in general what actually happens is we do defeat the attack, but only there. Let's
imagine that you have a buffer overrun vulnerability rampant throughout your code, and
your negative test case shows one particular read that can take over. We'll fix that one
read, but not the three others.
>>: [inaudible] trivial solution to the problem [inaudible] it's just if you -- they are so
[inaudible] you do one [inaudible] you don't want, just block those and that's it.
>> Westley Weimer: Right.
>>: So such -- not only it's faster, it's 100 percent precise with respect to that specific test
suite. And second if you know [inaudible] is the passing test, you [inaudible].
>> Westley Weimer: Right.
>>: What's wrong with that?
>> Westley Weimer: So Patrice [phonetic] is describing fast signature generation, an
approach that we find orthogonal to ours. There is nothing wrong with signature
generation. We like it a lot. In fact, we want to encourage you to use it to buy us time to
come up with the real repair.
So one of the tricks with fast and precise signature generation is that you actually often
can't get all of those adjectives at the same time. If you are just blocking that particular
input, then the black hat changes one character and they get past you or whatnot. If
you're trying to block a large class of inputs, then often you're using regular expressions
or some such at the network level and you also block legitimate traffic.
We're going to end up evaluating our technique using the same sorts of metrics you
would use to evaluate signature generation; to wit, the amount of time it takes us to do it
and the number of transactions lost later. And you'll be able to compare for yourself
quantitatively whether or not we've done a good job. But we do want to use signature
generation. It's orthogonal. [inaudible].
>>: So how many negative test cases do you have for each of these examples?
>> Westley Weimer: One each.
>>: Okay. So one way to measure to robustness of your technique would be the
following: You start with more than one negative test case [inaudible] technique
[inaudible] one negative test case at a time. So you see if [inaudible] and would all of
them end up with the same patch.
>> Westley Weimer: This is a good idea. Officially we can actually support multiple
negative test cases. We just keep going until you pass every test case. So having
multiple there at the same time doesn't hurt us. But in some sense we might -- we have
no reason to believe that our operation is distributive. It might be easier for us to fix two
things alone than to fix them both at the same time. Your robustness measurement is
intriguing. We should talk after the presentation.
>>: [inaudible] negative test and positive test was quite different. So what is the negative
is going to become a positive test when you consider the next negative test.
>> Westley Weimer: Yes. We -- you are correct. Yes. Other questions while we're
here?
So in general this worked out relatively well. For us, rather than the lines of code in the
program, the size of the weighted path ended up being a big determining factor in how
long it took to come up with a repair. Question from the front? No.
>>: [inaudible]
>> Westley Weimer: I should wrap up now?
[multiple people speaking at once]
>> Stephanie Forrest: No more questions till the end.
>> Westley Weimer: Okay. So our scalability is largely related to the length of this
weighted path. The length of the weighted path is sort of a proxy for the search space
that we have to explore. Here we've plotted a number of things that we can repair: note
log-log scale. But in general this is one of the areas where we might use some sort of
off-the-shelf fault localization technique from Tom Ball or whatnot for the cases that
have particularly long awaited paths.
>> Stephanie Forrest: [inaudible]
>> Westley Weimer: Certainly.
>> Stephanie Forrest: So you measure the slope of that line, what you find out is that it's
less than quadratic and a little more than linear in the length of the weighted path for a
small number of datasets.
>> Westley Weimer: So in general the repairs that we come up with are not what a
human would have done. For example, I talked about this before, we might add bound
checks to one area in your code rather than rewriting the program to use a safe abstract
string class, which is what humans might do when presented with evidence of the bug in
one area.
It is worth noting that any proposed repair we come up with must pass all of the
regression tests that you give us. One way to view this is for one of the things that we
repair, this Web server, the vulnerability is that it trusts the content length provided by
the user. The user can provide a large negative number when posting information, when
uploading images or whatnot, allowing the remote user to take control of the Web server.
If we don't include any regression tests for POST functionality, the repair that we come
up with very fast so to disable POST functionality. We just remove that. You're not
longer allowed to do that. Which in some sense might actually be better than the
standard approach of running the server in read-only mode because in read-only mode
you could still get access to confidential information.
But in general we require that the user provide us with a sufficiently rich test suite to
ensure all functionality. And that's an orthogonal problem in software engineering.
Our minimization process in some sense also prevents gratuitous deletions [inaudible]
and not shown here adding more test cases, say moving up to 20 or a hundred helps us
rather than hurting us. So we still succeed about the same fraction of the time, if there are
more test cases.
So in brief limited time, you might also want to look at repair quality another way.
Imagine you're in some sort of e-commerce security setting like your Amazon.com or
pets.com and you're selling dogs to people over the Internet. From your perspective, a
high-quality repair might be one that blocks security attacks while still allowing you to
sell as many dogs as possible without reducing transactional throughput.
So to test this, to test our repair in sort of a self-healing setting, we integrated with an
anomaly intrusion detection system. Basic idea here is we throw an indicative workload
at you, and when the anomaly detection system flags the requests as possibly an attack,
we set it aside and treat it as the buggy test case and try to repair the program so that it's
not vulnerable to that.
The danger to this is that we can do this repair without humans in the loop. And ->>: [inaudible]
>> Westley Weimer: What?
>>: [inaudible]
>> Westley Weimer: Yes, just dogs in the loop. So let's see how that actually goes. Our
experimental setup was to obtain indicative workloads. We then apply the workload to a
vanilla server sped up so that the server is brought to its knees. And the reason we're
doing this is so that any additional overhead introduced by the repair process or the patch
shows up on measurement.
We then send some known attack packet. The intrusion detection system flags it. We
take the server down during the repair, measure how many packets we lose then. We
apply the repair and then we send the workload again measuring the throughput after we
apply the repair, which is sort of equivalent to measuring the precision of a filter.
Does this work? In fact, it worked relatively well. We started out with workloads about
140,000 requests for two different Web servers and also we considered a dynamic Web
application written in php that ran on top of Web servers as sort of room and item
reservation system.
Each row here is individually normalized so that the vanilla version, before the repair, is
considered to have 100 percent performance; however many requests it's able to serve is
the right amount.
We end it, the indicative workload, once. And then we send the attack packet and try to
come up with a repair for that attack. In each case, the attacks were again off-the-shelf
security vulnerabilities from bug track. We were able to generate one of these repairs.
One of the first questions is how long did it take, are we as fast as fast signature
generation. And assuming that there's one such attack per day, in general we lose
between let's say zero and 3 percent of your transactional throughput if you take the
server down to do the repair.
In general, you might imagine having a failover or backup server in read-only mode, say,
that might handle requests during this time. So for us the more important number is what
happens after we deploy the repair. Might we deploy low-quality repairs that cause us to
fail to sell dogs.
And in essence the answer is no. Any transactions or any packets that we lost after
applying the repair were totally lost in the noise. And our definition of success was
relatively stringent. Success here means not only that we got the output byte for byte
correct, but also that we did it in the same amount of time or less than it took the vanilla
unmodified program.
And the reason for this, the interpretation, is that our positive test cases, the regression
tests, prevent us from sacrificing key functionality. There was a question.
>>: I use experiment a lot. I wonder if you've considered comparing the [inaudible]
reduction with [inaudible].
>> Westley Weimer: This is a great idea. We had not considered it. But we will now.
>>: I think you will do quite a bit better.
>> Westley Weimer: The question was could we compare our throughput loss to that of
a Web application firewall like -- and then you named one in particular.
>>: [inaudible] of the requests [inaudible].
>> Westley Weimer: Yes, yes. The outputs -- we only count it as a successful request if
it's exactly the same.
>>: [inaudible] after you'd gone back [inaudible].
>> Westley Weimer: Yes. So yes. I will get to this in let's say a minute. It's a great
question.
Whenever we start using an intrusion or anomaly detection system, false positives
become a fact of life. And intrusion detection system might mistakenly point out some
innocuous request as if it were an error.
So a second set of experiments we did hear was let's take some false positives, let's take
some requests like get index at HTML that are not actually attacks, pretend they are
attacks and try to repair them and see what happens. Maybe then we come up with a
low-quality repair.
So we did that three times. We took completely legitimate requests and consider them to
be attacks and try to repair them.
In two of the cases we were actually able to make some sort of repair, and I'll quantify for
you in just a second what that might mean. In general it takes us a long time to come up
with repairs for things that aren't actually problems because we have to search through
the space until we find some sort of corner-case way of in essence tabulating, defeating
this particular input but allowing all others to get through.
In fact, in our third false positive, it was in essence get index at HTML, and there was no
way to both fail this and also pass this as one of the regression tests. So we couldn't
make a repair there.
But even for these two times where we did come up with repairs that in some sense
weren't actually valid or repaired non-bugs, any transactional throughput that we might
have suffered at the end was in the noise. Because, again, we have these sort of positive
regression test cases that make us keep important behavior around.
What are examples of packets that we might lose. For one of these, the repair involved in
some cases dropping the cache control tag from the http response header. Previously,
yes, you were allowed to cache it with our bad repair. In some of these cases, no, you
weren't allowed to cache it. So maybe then you have to make extra requests of the server
in order to get the same data so you don't get everything done in time so you drop a few
packets there.
>>: Can you clarify the -- seems to me you said earlier that you you're using the
workload [inaudible] problem that to -- is that your test cases that you ->> Westley Weimer: No. This is great question.
>>: [inaudible]
>> Westley Weimer: No.
>>: [inaudible]
>> Westley Weimer: No, no, no. We have five separate test cases get indexed at HTML,
post my name and copy it back to me. Get a directory, get a GF and get a file that's not
found. Those are our five test cases.
>>: Okay. And those are the ones that you're using for the [inaudible].
>> Westley Weimer: Yes. Those are the ones we check every time.
>>: [inaudible]
>> Westley Weimer: Yes. So the repair never sees these 138,000 requests in our -- yes.
>>: Sorry. So among this workload here, there are how many -- what's the percentage of
malicious, so to speak --
>> Westley Weimer: One out of a 138,000. The workload is entirely benign. We play
the workload once. We send the attack, and then we play the workload again.
>>: And the attack is what exactly?
>> Westley Weimer: The attack is an off-the-shelf exploit from bug track. It varies by
which program.
>>: [inaudible] the previous question different from the single basically [inaudible]
failing test case that you use for generating the repair.
>> Stephanie Forrest: No.
>> Westley Weimer: Exactly the same as becomes the negative test case that we use to
generate the repair.
>>: Exactly the same if there's only one of them, there are [inaudible] of them. That's
what I'm ->> Westley Weimer: Uh...
>>: What would defeat the trivial solution of just blocking that specific input?
>> Westley Weimer: In this experiment, if you did the trivial solution of blocking that
particular input, you would get a high score. We did not do the trivial solution of
blocking that particular input.
>>: Yes. But [inaudible].
>> Westley Weimer: Which you cannot tell from this chart, but I assert and we describe
qualitatively in text.
>>: [inaudible] really well if that's -- for that's basically experiment setup.
>> Westley Weimer: Right. And in essence what we want to say is the trivial solution
we do really well, perfect filtering would do really well. We also do really well. But
we're repairing the problem and not just putting a Band-Aid on top, or so we claim.
>>: [inaudible] you are implicitly giving a lot of weight [inaudible].
>> Westley Weimer: We're not going to invent new code. Right.
So other limitations beyond those that have been brought out by astute audience
members. One is that we can't handle non-deterministic faults. We run your program in
all the test cases and assume that's going to tell us whether it's correct or not. And if you
had something like erase condition, running it once might not be able to tell us what's
going on.
Now, we did handle multithreaded programs like Web servers, but those aren't really
multithreaded because the copies don't actually communicate with each other in any
interesting way.
A long-term potential solution to this might be to put scheduler constraints perhaps
related to controlling interleavings into the variant representation so that a repair would
be both change these five lines of code and also instruct the virtual machine not to do the
following scheduler swap. Future work.
We also assume that the buggy test case visits different lines of code than the normal
regression test cases for something like a cross-eyed scripting attack or a SQL code
injection vulnerability that would not be the case. And then our weighted path wouldn't
be a good way to narrow down the search space and in such an area we might want to use
an off-the-shelf fault localization approach like Tom Ball's.
Finally, we also assume that existing statements in the program can help form the repair.
Currently we're working on some notion of repair templates, something like typed
runtime code generation, sort of the essence of a null check is if some hole is not null,
then some hole for statements we can imagine either hand crafting those or mining them
from version control repositories. Maybe you work in a company where there have been
a lot of changes related to switching over to wide character support, moving from printf
to wsprintf. If we notice that other divisions are doing that, then maybe we can suggest
that as repairs for you.
Sir?
>>: In connection to your dog example, so you're worried about whether dogs are sold or
not.
>> Westley Weimer: Yes.
>>: Instead you may worry about how many dogs are sold. So in this connection, there
is one generalization, instead of test being yes/no ->> Westley Weimer: Test could be numbers.
>>: Numbers, how well do we pass the test. And then you can optimize ->> Westley Weimer: Yes. One -- so this is a good idea. One of the reasons that we
haven't pursued it experimentally is that in the open source world it is hard to find test
cases that are that fine grained, that have a real number value rather than yes or no. But
Microsoft may well have them and before everyone leaves there will be sort of a
beg-a-thon where I ask for ->> Stephanie Forrest: The real reason we're here.
>> Westley Weimer: Yes. The real reason we're here: help from Microsoft.
So the work they've presented here covers actually two papers and more ICSE work that
we're presenting in a few days which in essence describes our repair algorithm, the test
cases that we use, the repair quality. Answers the question does this approach work at
all. It's controversial, you know, yes, it looks like it does.
We have a second paper in the Genetic and Evolutionary Computing Conference coming
out a bit later which answers questions related to our crossover and mutation operations,
is this really evolutionary computing or is this perhaps just random search. What are the
effects of more test cases, what's the scaling behavior, questions more related to why
does this work.
In general, though, we claim that we can automatically and efficiently repair certain
classes of bugs and off-the-shelf legacy programs. We use regression tests to encode the
desired behavior and also to focus the search on parts of the program that we view are
more likely to contain the bugs. So we use regression tests rather than, say, manual
annotations.
And I encourage difficult questions, and there have been a number already. Any last
questions?
>>: Let's take a few more minutes to ask particularly tough questions.
[multiple people speaking at once]
>>: I would love to -- I mean, ideally of course you would like to have proof that -- for -I mean, that you have more verification. Because to test the repair program actually
semantically equivalent function and equivalent -- the faulty one ->> Westley Weimer: Except one thing, yeah.
>>: So but how are you [inaudible] -- I mean, what are your thoughts about that
challenge?
>> Westley Weimer: Right. So we have some previous work for automatically
generating repairs in the presence of a specification where the repair provably adheres to
the specification and doesn't introduce any new violations.
One of the things we wanted to do here was see how far we could get without
specifications. Without specifications it's harder to talk about things like proof. One of
the directions we've been considering instead is documentation. And we have some
previous work that was successful at automatically generating documentation for
object-oriented programs. Would we extend that work to automatically document what
we think is going on in the exception -- or, sorry, what we think is going on in the repair
and that it might be even easier for the developer to say no, the tool is way off, I don't
want to do this, or, oh, this is the right idea, but I should add one more line or some such.
We have also been considering -- imagine someone has some specifications but not all of
them and also some test [inaudible] I know, know, it's hard to believe -- and then also
some test cases. Is there some way we could incorporate information from specifications
into our fitness function or to sort of guide the repair. So if we do have any sort of partial
correctness specification, we can be certain that we never generate or repair that causes a
tool or a model checker to think that the resulting program violates a partial correctness
specification. Assuming they're present.
>>: [inaudible] more questions.
>>: [inaudible] maybe I missed a part, but a critical part seems to be which tests you
pick -- the positive test case because in reality I don't know any party going, yes, five test
cases which are presentative, and then multiply by the number of ->> Westley Weimer: Yes.
>>: -- you see where I'm going, right?
>> Westley Weimer: I can see where you're going. So it might take a very long time.
So in fact we have current work not described here on using in essence either
randomization or test suite reduction. There's been a lot of previous work in the software
engineering community on let's say time aware test suite prioritization, maybe using
knapsack solvers, maybe using other approaches to perhaps order test cases by code
coverage, maybe doing the first ones and then sort of bailing out in a short-circuit manner
if you fail some obvious test at the beginning.
We have a student currently working on -- let's imagine you have a hundred test cases,
subsampling for each fitness evaluation, just pick ten of them or just pick five of them,
use that for your fitness score, but then if you think you've run, then we run you on all
100 of them. Thus far using relatively simple techniques like that we've been able to get
95 percent performance improvements. So we have reason to believe that we might be
able to apply off-the-shelf software engineering test suite reduction techniques in the case
where you have hundreds or thousands of test cases rather than five.
That was the sigh of I'm not convinced or...
>>: Well, then you should be running test cases -- I mean, you're assuming the test cases
actually were written. You might just try to destroy the machine or -- you are assuming
that your test case [inaudible] correctly for this to work in that they are really hard -- they
are very low level and they qualify quite a bit [inaudible] output. So [inaudible] concrete
input. I mean, first [inaudible] input, I get that completely defined output. You cannot
have a test case say all the output has to be greater than zero first.
>> Westley Weimer: You could. Totally could.
>>: But then it's not going to work because it's going to -- you could very well then
generate basically a lot of different repairs that are going to basically -- and then the
repair program will be completely different functionally speaking than the original one.
You see what I'm getting at?
>> Westley Weimer: Yes. So we have not studied that -- again, we claim that if you -- if
what you really want the program to be doing is not returning any number greater than
zero, then your test case shouldn't just pass it if it returns any number greater than zero.
It's your responsibility to make a more precise test case in that scenario.
>>: So when you say you will be [inaudible] based on the fault, that might be very
expensive by itself.
>> Westley Weimer: But less expensive than a retest all methodology, which is what
we're currently doing. Right. It might still be expensive. But we are still in the early
stages of dealing with programs with more test cases. And one of the reasons that we're
still in the early stages is that they don't exist in the open source world. So if you in
Microsoft have many test cases and you would like to contribute to a worthy goal, you
might want to consider ->>: You're on the wrong side of the [inaudible].
>>: If we had an exhaustive test case, we wouldn't need this to repair. We'd have a
[inaudible] formula the entire program [inaudible].
>> Westley Weimer: The converse of that is that if you had a formal specification, you
also wouldn't need any repairs.
>>: Other questions after this commercial break?
>>: So, I mean, one way of trying to minimize the impact of a change you make is to
look at how much of the global stage gets affected because of the patch that you generate.
>> Westley Weimer: Yes.
>>: I mean, the reason [inaudible] suppose you pause the test cases but you introduce
some behavior that only shows when you run [inaudible] ->> Westley Weimer: Yes.
>>: -- change as the file system or something like that. So ideally speaking you want to
minimize the impact that the program has on ->> Westley Weimer: Yes.
>>: -- [inaudible] so is that something you consider?
>> Westley Weimer: This is a good idea. Would we notice if we added a memory leak.
And we probably would not unless you already had a test case for it.
Officially, as is standard test case practice, all of these test cases should be run in a
change rooted jail or in a virtual machine or whatnot lest we introduce the command RM
star dot star.
If we're already running under virtual machine, then checking things like memory areas
that you're touching the potentially more feasible.
We have not thought about this, but this is a great idea. We should look more into it.
Other questions?
>>: I wanted to ask you one question before you go. So I think your premise has been
that programs contain seeds of their own repair [inaudible].
>> Westley Weimer: Yes.
>>: Some programs are basically [inaudible].
>> Westley Weimer: Hmm.
>>: So in those cases would it be possible to use the abundance of code and programs
[inaudible] to introduce some sort of form of [inaudible].
>> Westley Weimer: This -- I think the answer to this is yes. This would be a good idea.
We had not thus far considered it. We've been sort of stuck in our repair templates mode
where you can see some of the seeds of this, as it were, seen from a different division in
your company to bring over repairs.
But in essence this is a different question, which is when we're doing mutation to insert
statements might we draw from a different statement bank than the original program,
maybe a better high-quality statement bank or one that has been winnowed down to
contain only good stuff or whatnot.
This is a great idea. We have not considered it, but it is likely that it would do better than
what we're doing now.
>>: Okay. Well, let's thank the speaker.
[applause]
Download