>> Jim Larus: All right. It’s my pleasure today... the University of Illinois. Swarup and Vikram are both...

advertisement
>> Jim Larus: All right. It’s my pleasure today to welcome two visitors from
the University of Illinois. Swarup and Vikram are both here today and
tomorrow so if there are other people who are not on this schedule who want
to talk there is a little bit of time, but it’s my pleasure to welcome Swarup
who is going to be giving the talk today about some work he has been doing
Debugging Framework.
>> Swarup Kumar Sahoo: Thank you. Good afternoon everyone. The title of my
talk is Towards a General Automated Debugging Framework using Automated
Software Fault Localization by Filtering Likely Invariants. This work was
done along with my colleagues Vikram Adve, John Criswell and Chase Geigle at
the University of Illinois.
The goal of our work is to deliver up some kind of semi-automated system to
help programmers fix bugs. And let me give some motivation for our work.
According to a NIST report software failures cost nearly 60 billion dollars
every year. And widely used applications contain the largest number of bugs.
For example Mozilla gets nearly 300 bugs everyday. So they needs some kind
of fail to prioritize and diagnose these bugs.
It’s also known that the cost of fixing bugs increases as the software
develops on life cycle progresses. So a bug fixing costs are very high
during operational/maintenance phase. So it’s very important to fix as many
bugs as possible before the application ships and gets deployed. The process
of debugging involves 3 key steps. First is reproducing the failure, then
trying to locate and understand the root cause that is responsible for the
failure and then finally trying to fix the root cause.
And debugging is a complex job for many reasons. Reproducing the failures
may be very difficult in many cases. And the point of failure may be very
far off from the root cause. That kind of complicates the process of
debugging. And debugging is mostly a manual and time-consuming process now,
and automatic fault localization, which mainly focuses on the second step.
It can automatically identify the root causes of the program statements that
are responsible for a failure. It can also extract other valuable
information which may help the programmer in debugging or fixing the bug.
Automatic fault localization can reduce the developing cost and time
significantly.
So our overall goal is to develop some kind of semi-automatic system to help
programmers fix bugs. One great example where such a system can be
applicable is doing software testing. Doing any kind of automatic software
testing which tests input programs, test inputs, expected outputs and
produces failing tests. So they have all the ingredients that are required
for automatic debugging framework, specifically all the failing tests have
some kind of oracle which can detect whether the program fails and some
inputs that are not.
So using all this information an automatic debugging tool can try to point
out possible locations of root causes and some other information like faulty
program values, faulty execution paths and the cause and effect chains which
produces the system. And all this information can be greatly valuable for
debugging and fixing bugs.
In particular currently we have worked on actually trying to localize the
faults. In one of our recent [indiscernible] papers we developed the
automatic system to identify the root causes of the failures. It was very
scalable and it reports very few false positives. And our tool takes program
and faulty input as the input and tries to produce the faulty locations in
the program in some way. It also provides some other valuable information
and we output them in a presentable way.
So our technique is best on many basic techniques. One of them is the well
known popular Delta debugging strategy, which tries to compare the memory
states of two different runs to isolate the root causes. However, it’s
pretty expensive to do that. And likely program invariants can be a very
efficient way to summarize and compare memory states of different runs. And
what are likely program invariants? They are program properties which are
observed to hold in some set of successful runs, but unlike sound invariants
they might not hold true for all possible future runs.
For example we can say return value of some function is always positive, or
some stored value is between 0 and 100, or some load value is always 10.
These are some examples of likely invariants. And all the likely invariants
that fail during the failing run can give us a set of candidate root causes.
But, even after this step we need still lots of improvements for effective
localization.
Some important contributions to our work were that we used a Novel mechanism
to train invariants. In particular we used auto-generated similar inputs,
which are close to the failing input to generate these invariants. And we
combine our approach with the dynamic slicing in software and we used two
novel heuristics for reducing the false positives further. And we used many
bugs in [indiscernible] applications like Squid, Apache, MySQL and Clang for
evaluation. And we got 5 to 28 locations as root causes even for programs of
100K-1M lines of code. After that we applied some trivial manual filtering
steps which give us only 2-14 program locations. And in many cases we had
only 2-4 locations. So the results were excellent.
>> So I have a question. What is the root cause mean?
can change these instructions?
Does it mean that you
>> Swarup Kumar Sahoo: I guess the root cause is any program statements which
are responsible for the failure. But in evaluation [indiscernible] we saw
the [indiscernible] which statements were changed to fix the failure. So for
evaluation we use only those statements. We call those statements root
causes.
>> [inaudible].
>> Swarup Kumar Sahoo: Yeah, some [indiscernible], although we don’t have
[indiscernible].
>> So how large are your final reports?
statements?
Is it program slice or just several
>> Swarup Kumar Sahoo: It’s several statements; maybe I will show the
statements.
Okay. So I have the motivation and contribution of our work and now I will
be talking with some problems with the existing fault localization work which
we tried to address. And after that I will details of our bug diagnosis
framework. Then I will give some key experimental results. Then I will talk
about some of the future work we plan to do towards usable automatic
debugging tools.
So before going further I will give some definitions that I just talked
about. So we define all the faulty program statements that are responsible
for the failure as root cause of the software. And for experimental
evaluation purposes all the modified statements in the patches we call them
the locations of root cause of the software. And all the candidate root
causes which are not the true locations of the root cause, they are called
false positives.
>> So question: so you modify [indiscernible] locations and you find one of
them? Is that a success?
>> Swarup Kumar Sahoo: No.
>> You need to find [inaudible]?
>> Swarup Kumar Sahoo: Only [indiscernible], which are really actually.
Sometimes they actually try to fix other things and some other irrelevant
statements. So, if it finds all relevant statements that need to be changed
then we call it success, but we need to find all of them.
So there has been a lot of other work on automatic fault localization. We
have classified them into 6 categories here. I will talk about only the
first two which are the most relevant to our work. If any of you are
interested I can talk about anything else. So Delta debugging is very
popular work for automatic fault localization. It’s a smart approach which
compares memory starts of different runs, but it doesn’t scale well. And
there have been some improvements to the developing so that it can handle
larger applications. And this aims to find cause effect chains, but in many
cases in 55 percent of the cases it can still miss the root cause.
And, as I said, invariants is a good way to compact and not a precise way to
compact different runs, but most of the previous work, I think all of the
previous work has many issues. First the test inputs they use to train the
invariants may not always be applicable. And [indiscernible] of test inputs
is often low for training. And they don’t have any solution to make the
likely invariants narrow or tighter. So when the invariants become very
broad it may miss the root cause.
So some of they key insights for our work which try to improve on the
previous work was we used likely invariants to compact [indiscernible] to
summarize and compare different runs. And in this way we can quickly isolate
the difference in behavior and give the programmer an initial set of
candidates of root causes. And we, instead of using similar test inputs we
automatically generated similar closed good inputs to train the invariants.
And because of this we can now use very few closed good inputs to train the
invariants. And because of this we get much tighter and relevant invariants.
So we have very few false positives.
That means we don’t miss --.
>> False negatives.
>> Swarup Kumar Sahoo: Oh, false negatives. That means we don’t miss many
root causes. But this may result in many false positives. And hence we
double up the sequence of novel filtering techniques to reduce this false
positive to a much smaller set.
Okay, now let me give some more details about bug diagnosis framework. So
this is the overall architecture of our tool. Our tool takes the program,
the original bad input and then optional input specification. It then uses
them to try to generate many similar inputs, many similar good inputs. And
these good inputs are used to generate the invariants. These invariants are
then are instrumented back into the program. Then after that all the failed
invariants which will give us the initial set of candidate root causes. Then
we apply a set of false positive failures to reduce this set of initial
candidates to a much smaller set.
In particular we apply three filtering steps. The first one
backward slicing, second one is dependence filtering and the
multiple faulty input filtering. And I will talk about them
later. So let me give a concrete example to explain some of
later on.
is dynamic
third one is
in more detail
the concepts
So this is a bug from MySQL and this bug happens when SQL uses a specific
data fill with [indiscernible] zero. And this causes a segmentation fault in
MySQL. And the segment fault happens at line 7 and when the weekday value
becomes negative this results in a buffer overflow. But the actual root
cause starts at line 3 where unsigned year value is used. And because of
this when the year is 0, year minus becomes a very large value instead of
minus 1. And this value [indiscernible] through various [indiscernible]
values and then daynr value. Then daynr is used and this value
[indiscernible] to weekday. Then finally weekday becomes negative and it
results in segmentation fault.
Okay. What I showed is a kind of simplified version of the code. Actually
this code is split between three different functions. And this is where the
buffer flow occurs at line 16. And there is a function which computes the
weekday value I showed in the previous slide. And the other function which
computes the daynr value I showed earlier. And the faulty values flow
through the green arrows here. And I will use the example to illustrate some
concepts later on.
Now a diagnosis with invariants, so in this work we use likely range
invariants to find potential root causes. And what are likely range
invariants is range of values which are computed by individual instructions
in the program in the correct runs. And when these invariants get
[indiscernible] doing the faulty run they give us the set of candidate
locations. Currently we have invariants only on load values, stored values
and the function return values since they are the most critical locations.
These are some of the examples of the invariants. So here, the return value
of weekday is between 0 and 6. And here some load value is always positive.
And here the stored value is always 100. So these are some of the examples
of invariants we are going to use. In the source code example we have
invariants on the return value of these two, weekday and daynr functions in
lines 9 and 12. And here the invariants return value is always positive.
And these actually fail during the faulty run. And they give us kind of
initial set of candidate locations.
One important point to note here is we are not actually trying to observe
invariants on [indiscernible] values. Like for example lets day
[indiscernible] or temporary value here. So the bug may actually be anywhere
in [indiscernible] values, which feed values to the function return values of
the invariants instructions. So when we present the results we give all
these statements, we output all these statements also. We give them to the
programmer since the bug may be anywhere in the expression. And we call this
actually Expression Tree of the invariants of the return value.
As I said earlier we train invariants using very similar inputs which are
close to the failing input. And in this way we can capture the key relevant
differences between different runs. And because of this we use very few
inputs to train the invariants. We get much tighter and relevant invariants
and we are less likely to miss the root causes, though it might result in
many false positives. We have many false positive filters for them.
I will briefly talk about how we construct inputs and one important one is
this might not be the best way to actually generate inputs although these
techniques work. And one of the key reasons we [indiscernible] here is that
we want to collaborate with testing teams like [indiscernible] and
[indiscernible] who are doing dynamic symbolic execution and other tools. By
using them properly we can generate the inputs in a much more systematic
manner.
Currently we have three approaches to generate inputs. One is deletion-based
specification-independent approach, which is kind of a variation of the well
known ddmin algorithm and apply character-level deletion. And the second
approach is a replacement-based specification-dependent approach. And for
this actually we need some kind of input specification like what are the
tokens that can depend on the input. And for each token what are the
alternative set of tokens that we can replace them with? So depending upon
that token type we try and create many variations of its token. And then we
combine one token at a time to create the inputs.
And the third one was for compiler bugs we used a C-Reduced based approach.
So C-Reduced is a tool which tries to automatically create minimal test cases
for compiler bugs. And while it does so it actually produces many similar
inputs along the way. So we actually modified the test scripts in the CReduce tool to keep track of the good and the faulty inputs and classify them
accordingly. And then after that we can actually select a small set of
inputs which are close to the original failing input. I have some slides if
anyone wants’ to know more details about them I can actually explain later.
So right now what we have is we have a set of similar inputs. We then select
a set of close good inputs from them. Then we generate invariants using
those good inputs. And then what we will do is we will insert those
invariants back into the code and run it with the bad input. Now the failed
invariants will give us the set of initial candidates. But, we still have
100s of candidates after this state. It’s a significant reduction, but still
it’s too much for the program [indiscernible].
Hence we applied three different filtering techniques. The first is Dynamic
Backward Slicing. We strived to remove any kind of candidate invariants
which may not be influencing the symptom. Then we applied something called
Dependence Filtering where we tried to discard the dependent failed invariant
if there is no intervening passing invariants between two failing invariants.
And the third is Multiple Faulty Input Filtering where we run the techniques
for many similar different similar faulty inputs. And then we try to take an
intersection of the candidate root causes from all such inputs.
talk a little bit about them now.
And I will
The first is Dynamic Backwards Slicing. Here we try to build the Dynamic
Backwards Slicing starting from the failure symptom. And any [indiscernible]
instruction or any initial candidate root causes which do not fall under the
backward slice we did not move them. We implemented the NpwC algorithm in 2
phases and we handled both the data flow and control flow dependence’s. At
the run time we record all the memory locations accessed, all the basics
blocks that are traversed and the function calls and returns.
We then build a dynamic program dependence graph using this trace and the SSA
form. And this tool we call it as [indiscernible] available as open-source
and other people started using it. And also that is a Google summer project
this year where we are trying to actually make it more general and wildly
available. And we computed this Dynamic Backwards Slicing on the original
failing run since the root cause is likely during the failing run. And in
our example the two invariants which were on the return of daynr and weekday
function rely on the Dynamic Backwards Slicing so they are not filtered out.
Okay. Now let me talk about the next filtering step which is Dependence
Filtering. So the main idea here is the return value of the daynr function
is actually used by the return value of weekday function. In this case we
say that this one actually failed not because it’s faulty, but since it used
a faulty value from the previous dependent instruction. So, most likely the
root cause is here, not here. Hence actually, so we say that this is a
possible root cause because the invariants return value is greater than equal
of 0 or [indiscernible] here by a negative value. But, this one is probably
not a root cause so we can filter this out.
So in general the idea is that we go through the dynamic program dependence
graph and we check for invariant failures. If a failed invariant uses value
from another failed invariant we say that the dependent invariant is actually
probably not a root cause. It only failed because it used a faulty value
from the previous invariant so we can filter this out. In other cases where
there are passing invariants between two failing invariants, in those cases
we don’t filter this dependent invariant, because it uses the value from a
passing invariant. So our assumption is that this value is correct. So this
used the correct value and failed. So this is also a likely root cause. So
in this case we [indiscernible] the root causes.
>> [inaudible].
>> Swarup Kumar Sahoo: Oh, eliminate the top one?
>> Yeah, because it looks like the code seems to have [inaudible].
>> Swarup Kumar Sahoo: Yes, it may have recovered and one more thing is we
are actually seeing one part. It might be going to another part to the other
symptom. That’s one important reason. This is actually not a sound
technique. Our failing techniques are not sound. So it’s possible that
actually this may not be a root cause also. And it’s possible that is a root
cause. So our technique is currently not sound. So it’s possible that
sometimes it can filter out the true root cause. But, in this case we don’t
know if it may be going through other values and may be affecting the
symptom.
And now Multiple Faulty Inputs Filtering step. This is a very simple idea.
So we assume that root causes are the same for all the similar faulty inputs
which cause the same failure. So we assume that the root cause must be
present in the candidate root causes of all such inputs. So what we do is we
use the similar input methodology to create many similar inputs. And we
repeat the previous three steps to construct the candidates root causes for
each different input. Then we take an intersection of all those candidate
root causes which gives us the final set of candidate root cause locations.
Any questions?
Okay. Now I described some of the key details of Bug Diagnosis Framework.
So I will talk about some key experimental results now. So here I would like
to address two key important questions. The first is how effective is our
overall bug localization framework? The second is how effective are our
filtering techniques? So for the experimental evaluation properties we use
[indiscernible] bugs from four applications. And there are 5 bugs which were
missing code bugs. That means there were some parts of the code which were
missing. And we didn’t consider because our framework currently can’t handle
them. So we need some additional kinds of invariants like [indiscernible] to
handle them. So we didn’t consider them for this evaluation. And we used
LLVM for compiling programs and running our passes.
This table gives some key characteristics of the 8 server bugs. We used 3
server applications: Squid, Apache and MySQL. The third column here gives
total number of static lines of code that are executed by the faulty run. So
we have around kinds of thousands of lines of codes that get executed. And
the fourth column gives the distance from the root cause to the symptom in
terms of dynamic number of LLVM instructions. And this column gives the
distance in terms of number of static number of lines of code. And this
gives the distance in terms of static number of functions from the root cause
to the symptom.
The important observation is that the thousands of lines of code that gets
executed in the failing runs and the distance along the slice from the root
cause to the symptom spans several functions. And this distance is
especially high for the incorrect output bugs. So for the incorrect output
bugs there are kind of tens of functions between the root cause and the
system. And for such bugs the diagnosis process is more difficult.
>> So if you just did a dynamic slicing how close do those distances become?
>> Swarup Kumar Sahoo: This distance is showing along the dynamic slice I
think. Oh, sorry, this distance right, yeah this is along the dynamic slice
actually. Okay. I have not included the other instructions here. Let me
give the other relevant instructions. So if we take the slice from the
symptom to the root cause it will span through this many functions.
>> [inaudible].
>> Swarup Kumar Sahoo: Oh the bug input?
>> [inaudible].
>> Swarup Kumar Sahoo: Oh for this application, like MySQL is kind of some
kind of query, like I said example MySQL query. And for Squid and Apache
it’s [indiscernible]. So we take the inputs for the application. Okay.
So now let me talk about how effective was our overall bug localization
framework. If you see each of these bugs executed thousands of static
invariants. And when we run our invariants bugs we had around hundreds of
failed invariants in each of those bugs. So we can see it’s kind of a
significant reduction from thousands of invariants to hundreds of invariants,
but I still think it’s a lot more for the programmers to analyze each of them
and figure out the root cause. And when we apply all the three previous
filtering techniques we got around 5 to 28 program locations of standard root
cause. So the filtering steps were quite effective.
And then what we did was we actually manually went through those root causes.
And we applied a [indiscernible] filtering step which I will talk about a
little bit later. We could then reduce it to only 2 to 14 program locations.
So the approach was pretty effective for these bugs. And we missed root
cause in one of the cases. And here the root cause was inside the Visit
function the skipProcessUses to false. And we called it the VisitExpr
function here. And here the condition in this branch was wrong, hence
[indiscernible] skipProcessUses to true. So it remained false and it comes
back and incorrectly caused the ProcessUses function and it results in all
sorts of violations.
So to handle these kinds of bugs there are several ways to tackle this.
First is we can do a better input generation for that [indiscernible] of the
failing runs from the good runs. Another kind of invariants may help here,
like [indiscernible] and also invariants on the intermediate values. Right
now, as I said, we have only [indiscernible], not on the temporary
intermediate values. Invariants on intermediate values can also help in such
cases. And that’s the kind of future work for us.
Yes?
>> [inaudible].
>> Swarup Kumar Sahoo: [indiscernible]. I mean any kind of general
invariance which takes into account which branches the program takes. One
example I can think of is kind of [indiscernible] invariants. So basically
if I have some use, which definitions it is using? That depends on the
[indiscernible] of the application, this kind of invariants. So for some
classes bugs may be pretty useful. For example like missing code bugs, this
can be useful.
Now okay, how effective were our individual filters? So if we see the
Slicing Filter is pretty effective. It was able to reduce nearly 80 percent
of the false positives. And the second was Dependence Filtering and it was
pretty effective reducing nearly 53 percent of the remaining false positives.
And the third one, Multiple Faulty Input Filtering was somewhat less
effective. It reduced 14 percent, but still I think we can say since it
appeared a set of very effective filtering steps because it’s still a
significant improvement.
Now one important thing was for one of the bugs the last filtering step
actually missed the root cause. Since for some of the faulty inputs we
generated it didn’t contain the root cause. And so finally the last step
actually we had the root cause for 11 out of 13 bugs.
>> So did any of these [inaudible]?
>> Swarup Kumar Sahoo: Sorry, can your repeat?
>> Did it change the control flow of the [inaudible] of all these bugs?
>> Swarup Kumar Sahoo: After fixing? Um, the control flow? Yes, some of
them, not all, but some of them will change the control flow for sure. But,
I don’t understand, sorry.
>> I understand Slice and Dependence filter are dependent on each other.
>> Swarup Kumar Sahoo: Yes, Dependence Filtering actually uses the Slicing
scale actually. The Dynamic Program Dependence Graph is built --.
>> [inaudible].
>> Swarup Kumar Sahoo: [indiscernible].
>> You do it by itself?
>> Swarup Kumar Sahoo: No, okay, we applied these previous steps on all those
faulty inputs into [indiscernible]. So it can --.
>> [inaudible].
>> Swarup Kumar Sahoo: Oh, just by itself you mean; that we have not tried?
It may be much more effective then if we apply it at the end.
So I will talk about the manual filtering step we did. The programmer can
actually manually look into those root causes and quickly try and filter
false positives. For example when we looked through the candidate root
causes we found out that advancing many --. We didn’t do any kind of
sophisticated processing; we just looked at the function name where those
failed candidates invariants were. By just looking at the function names we
could figure out they are very less likely to affect the symptom.
For example Lex and Parsing functions. Many candidates will fail if there is
a slight defense between the inputs, but it’s very less likely to affect the
input. Same is the case with the input/output functions. Also for random
number generates they can randomly get evaluated without really effective the
symptom. And we also observed that many time-related functions fail. And
they can fail if you are on them different times, but it will less likely
effect the root cause. This actually did not eliminate the false positives.
So after applying this from 5 to 28 we could reduce them to 2 to 14
locations. This is actually one of the bugs, the candidate locations in one
of the bugs. And here I have simply [indiscernible] them to just include the
function names here. So first actually we can remove all the candidate root
causes because of time function. Then I remove the candidate root cause for
the random --. My underscore is some kind of random number [indiscernible]
in MySQL. Then there are two functions which were input/output functions.
Then finally there were functions which were Lex and Parsing functions.
After that we had only 3 candidate root causes here. And the root cause was
in this function.
>> How come you are generating invariants on random number [inaudible] in the
first place? [inaudible].
>> Swarup Kumar Sahoo: Oh, okay you mean that can be any kind of invariant
there. Yeah, but since we are not using sound static analysis we are doing
it at the runtime. So any kind of [indiscernible] observes that. It will
try and form some kind if invariants there.
>> Oh, so if you random long enough on good inputs [inaudible]?
>> Swarup Kumar Sahoo: Yeah, [inaudible].
>> [inaudible].
number of runs.
It’s more like an observed range of values in some small
>> Swarup Kumar Sahoo: Of the properties.
>> I get that, but I was surprised that you saw these dynamically generated
properties.
>> [inaudible].
>> Swarup Kumar Sahoo: Okay. I will talk about some of the --. So, one of
the bugs [inaudible] is quid-len bug. It worked in our case, but if you use
kind of general test inputs it may not work; why? It’s because the input
fill, the faulty input actually it uses many special characters to reproduce
the failure. And if you have any larger user name in the training search
then the faulty input you will actually miss the root cause. For example in
this case the failure in failing input was something like this where it had
many specific special characters in the input.
And if you have in the training set if you include kind of many different
inputs and you have a very large [indiscernible] fail, because there are some
invariants which are based on the lengths of the parts of the input. And
this will become very broad if you use many different large inputs. But, our
approach is more likely to find the root cause in this case.
So, some of the caveats of our work: the first thing is I talked about the
expression trees and how we output to the programmer. And the second is
input sensitivity. So we output to the programmer all the failing candidates
at each filtering step. And for each candidate set we also include the
maximum local sub-expr tree rooted at that candidate, which includes all the
intermediate values that feeds value to the invariant instructions. And we
do this since we only track the load stored values and the function return
values, not the intermediate values where the bug may be actually.
As I had sworn earlier actually for this return value, for the invariant on
this return value the expression tree will include all the statements that
are marked in red. And for the number of candidate root causes that we had
in our last filtering step. So, the total number of lines of code that the
source expression tree maps is here. These numbers are somewhat high for
some of the bugs; however I think that we can actually reduce the size of the
expression tree to a much smaller set by putting invariants on the
intermediate values and using some other invariants like address-based
invariants and control-flow invariants. When you form the expression tree
the values can actually escape through the addresses and make the expression
tree large for some of the candidate root causes.
And also we observed the bug behavior was somewhat sensitive to the inputs we
used. For example as you saw in one of the bugs we missed the root cause in
the last filtering step because of some similar faulty inputs we used. And
the [indiscernible] bug is SQL [indiscernible] bug. In one of our
experiments when we used manual general inputs to train the invariant it
missed the root cause, but in our automatic setup it didn’t remove the root
cause.
So why does our approach work so well? We initially had a few thousand
static lines of code which were exercised by the faulty run. And we were
able to remove that to 2 to 14 locations. So some key reasons were I think
the likely range invariants were effective for comparing successful and
failing runs of many bugs. And we used few similar inputs to train the
invariants. So we had a very tight and relevant set of invariants which
averted the false positives.
>> False negatives.
>> Swarup Kumar Sahoo: Sorry, I have been talking about false positives more.
So, “which prevented false negatives” and we also had some very effective
filtering steps to reduce the false positives. Now let me talk about some of
the future work we are planning to do towards developing a really usable
debugging tool.
First of all our analysis already extracts many useful information. For
example the failed invariant and its value can be very helpful in debugging.
And the bad inputs and the good inputs also and their differences can provide
[indiscernible] clues about what the root cause might be. For example when
you observe the year field is always 0 in all the bad inputs. That will give
us a strong clue. For example in one of the bugs the parameter to the
aggregate function was always negative to reproduce the failure. I think
that such [indiscernible] can be pretty helpful.
We also had the dynamic execution path from the invariant failure to the
symptom. And all this information is pretty useful. And we can additional
use some custom, but simple static analysis to extract more information from
the symptom, invariant and execution path. For example we can do some kind
of symptom specific analysis. For example memory bugs have some specific
particular type of root causes. So if we can do some kind of symptom
specific analysis we could probably pinpoint the root causes better.
Secondly is whitebox fuzz testing uncovers a large number of bugs. And today
it is difficult to find out which bugs to fix, which bugs not to fix and how.
And diagnosis can actually significantly help decide which bugs to fix. And
in such a testing environment for example in the nightly testing environment
our tool can be much more practical; because one of the problems with our
tool is we need some kind of detector to find out whether input causes the
failure or whether it’s a good input.
So especially for the incorrect output bugs we don’t have such a detector.
So we currently use some other [indiscernible] applications, some other
similar application, to compare the output and to detect whether it’s a
failure run or not. However, in the nightly testing it automatically
provides with an easy detector for all the bugs and also it gives us a buggy
input. So this can make our tool really practical.
Also we plan to pursue many future research directions to make this tool
really usable and to really pinpoint the root cause. The first is our
current input generation is not really general. So we plan to do some kind
of robust input generation. And also our filtering techniques are not really
sound so we are planning to do some more robust filtering steps. And we want
to explore a broader category of invariants.
Currently there are many automatic testing tools like [indiscernible], etc
which use those types of approaches. And it uses constraint solvers to
construct new good and bad inputs. And such techniques can be laborious to
build a more robust general and automatic generation for our framework.
However, there are some crucial differences. For example we need a substrategy which can quickly search the execution path for similar inputs.
In contrast the existing testing tools try to explore different paths so that
they can increase the code recoveries. But our goal is the opposite. And we
also need a selection strategy to increase the likelihood of finding the root
causes. I have some examples of how we can do this, for example in this bug
the root cause was here in the last step. There are two conditions here in
this function if the C1 to CN path constraints on the condition to reach this
function.
And we can actually combine the other two conditional statements here. Now
we have a complete path constraint up to this standard root cause. And if we
try to negate a certain constraint and solve up to that constraint we can get
a new set of inputs. For example if we input the last constraint here and
each of these conditions here are branch conditions through which we reach
the particular point. And if we solve these constraints the solver may be
able to give a different input here which is not a failing input, but a
similar input to the failing input. Next we can actually try to negate the
previous constraint and try to solve. And the solver can give us another
input here which is similar to the previous input.
But, as I said our goal is to actually explore similar paths, not increase
the block coverage. So the search strategy will need to minimize the
differences, not the other way around. The next one is building a more
robust filtering step. We are trying to investigate dynamic symbolic
execution to reduce the false positives. In particular, given a failure
inducing input --, sorry, given a failure-inducing input and a particular
statement is there a different value of that statement which can avert the
failure? So we would like to ask this to the constraint solver.
And depending on the answer from the solver we can categorizes them into
three different sets. The candidate statements for which a different value
can avert the failure, some candidate statements for which no different value
can avert the failure and third the solver may not actually find any solution
within certain time durations, certain time bounds. Such an approach may not
actually be possible if we try to apply that on the whole program, but if we
have a small set of candidate locations already then this may be a feasible
approach.
One issue here is the scalability issue because the execution can diverge
from the point from [indiscernible] at the invariant location. And we can
possible use the successful and failing runs to control the scalability. For
this example we can give some sort of constraints like 10 [indiscernible]
daynr + 5.
Okay, sorry.
Yeah, so here it was computing daynr +5. So this constraint kind of models
that statement. Then this statement actually models the cost from
[indiscernible]. And now we have a condition which says the [indiscernible]
lies within the array bound. And depending on if the solver actually finds
the solution, which it finds for this case, and then we can say that this is
a possible root cause. If the solver can’t find the solution we can filter
them out.
>> [inaudible]?
>> Swarup Kumar Sahoo: I am sorry, pardon me.
>> This sounds like [indiscernible].
>> Swarup Kumar Sahoo: Okay.
>> They have a similar approach.
>> Swarup Kumar Sahoo: Similar approach, okay.
>> [inaudible].
>> Swarup Kumar Sahoo: Yeah, I don’t remember.
>> Yeah, we should take a look at it.
>> Swarup Kumar Sahoo: Yeah, yeah.
Okay. The tool can automatically identify the 5 - 28 candidate root causes
and likely range invariants were effective for comparing runs. We had very
few similar good inputs to get the tighter invariants and we had Novel
filtering techniques to effectively reduce the false positives. And after we
applied the manual filtering step we had only 2 - 14 candidate locations.
One important question we would like to as is: “Can Concolic execution be
used to make this approach more robust and pinpoint the root cause(s) of
failure”?
Questions?
>> So I think in general it’s interesting to think about how what’s good for
your technique is good for search. I mean in things like Chess and Sage
there is a search strategy by which you search paths or you search schedules.
And the size of the property you are searching and essentially as long as
things are good you are doing some pruning and you’re [indiscernible]. The
next path or the next trace is often very similar to the previous one. So
then once you have crossed the threshold from good to bad, once you find the
bad trace it’s very often the case that since the traces are so similar you
already get the very good candidate to pause because of the search strategy,
because you are already close to a good trace. Then you find the bad trace.
So I think it’s interesting to think about search and what the notion of
closeness is in similarity, because I think in some cases you are randomly
generating tests. I can see the odd; you just leave it up to the users to
generate good and bad tests. Then you don’t have necessarily the notion of
closeness, but in search you do. You get it sort of for free and it really
helps, in our experience I think, you get the root quals almost for free if
your search strategy has nice properties.
>> Swarup Kumar Sahoo: Yeah, one way maybe it will see how the execution path
is different between different runs to find out the closeness between the
inputs.
>> Jim Larus: Any other questions?
Let’s thank our speaker.
>> Swarup Kumar Sahoo: Okay, thank you.
[clapping]
Download