>> Tom Ball: Hi, I'm Tom Ball, and it's... to welcome the Dilligs, Isil and Thomas Dillig from Stanford...

advertisement
>> Tom Ball: Hi, I'm Tom Ball, and it's my pleasure this early Thursday morning
to welcome the Dilligs, Isil and Thomas Dillig from Stanford University, where
they work as a power duo on program analysis with Alex Aiken, and they're going
to describe to us some work on doing scaleable precise analysis in the presence
of uncertainty and imprecision. Welcome. And we look forward to the talk.
>> Isil Dillig: Thank you, Tom. So this talk is about constraint-based analysis in
the presence of uncertainty and imprecision. And this is joint work with Tom,
who will give the second half of this talk, and also our advisor Alex Aiken.
First of all when we try to reason about program statically, it will be great if we
had perfect knowledge about the world, but unfortunately uncertainty and
imprecision come up all the time. First of all, uncertainty comes up because we
cannot model every aspect of the environment that a program executes in. And
similarly, imprecision comes up because any program verification technique is
necessarily based on some sort of abstraction of the program.
Now to convince you that uncertainly and imprecision really are recurring themes
in program analysis, let's walk through a few simple but realistic examples. So
one typical example of uncertainty is whenever we ask the user for some kind of
input. So here we all function that returns true if the user input is Y and
otherwise it returns false. Here since we have no control over what the user
input is going to be, we have to assume that the this function can
non-deterministically either return true or false, but we don't know which one.
Another situation where uncertainty might come up is whenever we receive data
over the network. So for example here I've opened some socket and then I'm
using the receive function to get data over the network and populate the contents
of this one kilobyte buffer that I just allocated on the stack. Now, after this call to
receive, I have to assume that the contents of the buffer could be anything
because I don't know what time -- what kind of data I'm receiving. So therefore,
for example, among other things it could mean that the cast on the next line
could be unsafe.
A third situation where uncertainty comes up might be when something becomes
an operating system state. So for example, since we don't typically reason about
the free list of the operating system, then we have to assume that malloc can
non-deterministically either return null or non null. And we could think of many
more situations where uncertainty plays a role. For example, calling a function
for which we don't have the source code, or when the thread -- when the
scheduler gets to make a decision about which thread to run next and so on.
So therefore in the all of these situations, there are certain values that appear as
non-deterministic choices made by the environment. Now, moving on to
imprecision, at first glance imprecision seems quite different from uncertainty
because imprecision arises from the abstraction that's chosen intentionally by the
analysis designer. However, as we'll see in the next couple of examples,
imprecision will have very similar consequences as uncertainty.
For instance, if my program analysis doesn't integrate sophisticated shape
analysis reasoning, then it will most likely end up smashing all elements of an
unbounded data structure or an abstract data type into a single summary note,
and if that's the case and I read a particular element out of this array, then I have
to assume that this array read could result any one of the possible values that I
previously wrote to this array.
Another source of imprecision could be due to not tracking complicated
arithmetic, for instance for the sake of scaleability. So for example, if my analysis
doesn't reason about null in arithmetic, then it will have no idea what this
expression coefficient times A times B plus min size is going to evaluate. And so
therefore, this expression will again look to my analysis as though it was a
non-deterministic environment choice.
Yet the third source of imprecision could be for example not knowing about
complicated loop invariance. So here we have a compute GCD function that
uses Euclid's algorithm for computing the greatest common divisor of two
elements, A and B. So here unless my analysis somehow knows about some
non trivial number theoretic axioms, then it will most likely have no clue what's
going on inside compute GCD. And therefore, again, the result of calling
compute GCD will have to be treated as a non-deterministic environment choice.
So therefore, to recap, in all of these situations sources of imprecision appear to
my analysis as non-deterministic environment choices even though there is no
true non-determinism anywhere here.
Now, to deal with these problems that arise from uncertainty and imprecision,
many sound program analysis systems and in particular constraint based
systems will typically model environment choice by introducing fresh,
unconstrained variables. And in this talk, we're going to refer to such variables
as choice variables.
So to be more concrete, for instance whenever there is a call to a function like
get user input, we're just going to make up some fresh variable data. And then
someone asks is it possible that beta is equal to the character Y? Well, the
answer is of course. I don't know if the user input [inaudible].
>>: [inaudible] what exactly is [inaudible].
>> Isil Dillig: Yes. We'll get to the more specific part like a little bit later. Yeah.
Similarly someone asks is it possible that beta is some value other than the
character Y. Well, again, the answer's of course because beta is an
unconstrained variable. On the other hand, if someone asks can we guarantee
that beta is equal to the character Y, well unsurprisingly the answer is of course
not. Now, while the introduction of these choice variables allows us to be sound
in the presence of uncertainty and imprecision, the use of these choice variables
will introduce two important classes of problems.
First of all on the theoretical side, whenever we have recursive constraints that
contain choice variables, it's far from clear how we can go about solving them.
And we'll see an example of this in just a second. Furthermore, on the practical
side, the number of these choice variables grows proportionately in the size of
the analyzed program, and therefore this results in a large formulas which then
directly translates into the poor scaleability of the analysis.
Now to illustrate the theoretical problems that arise from these choice variables,
let's look at the simple but recursive query user function. So what this function
does is it takes a Boolean variables called feature enabled. If this particular
feature is not enabled, it returns false. Then if it is enabled, it asks the user for
some kind of input. Then if the user input is Y, it will return true. If the user input
is N, it returns false. And if the user can't follow instructions and enters some
invalid character, then it will invoke itself recursively to input the user for a valid
input character.
So suppose here we want to know when will this query user function return true?
Or stating this a little bit differently, given some arbiter argument alpha that
denotes feature enabled, what can we say about the constraint pi alpha, true
under which query user will return true? Now, let's try to write this constraint
together? First of all, as we can see from this highlighted line, a prerequisite for
this function to return true is that feature enabled must be true. Otherwise, the
function returns false on the first line. So therefore we have alpha equals true as
part of this formula.
So one way in which this function will return true is if in addition to alpha being
true if the user input that we denote by the choice variables beta is equal to the
character Y on the first invocation of the function. But obviously this isn't the only
way this function will return true. It will also return true is if in addition to alpha
being true, if the user input beta is not equal to the character N on the first
invocation and in addition the result of the recursive call is true.
Here note that we've applied the substitution true replaces alpha, because we
know if we were able to make the recursive call then feature enabled must be
true, otherwise we would have returned on the first line.
>>: So what is beta [inaudible].
>> Isil Dillig: I'll get to that in just a second. Yeah. Okay. Furthermore, note that
we need a substitution that says beta prime where beta prime is a fresh choice
variable replaces the old choice variable beta because there is a distinct user
input on each distinct recursive invocation. Otherwise in the general case it
would not be sound not to rename this choice variable.
>>: [inaudible] in the flow analysis, right, because feature enable just flew
[inaudible].
>> Isil Dillig: Exactly. Right.
>>: [inaudible].
>> Isil Dillig: Yes.
>>: Okay.
>> Isil Dillig: So therefore this constraint here characterizes the exact condition
under which query user will return true. However, note that this constraint is
recursive which is not surprising, given that query user is a recursive function.
But for this constraint to be immediately useful to us so that we can issue
satisfiability and validity queries and so on, we have to solve it and bring it to
closed form.
Now if we try to solve this constraint [inaudible] using a standard fixed point
computation, then we're going to end up introducing an unbounded number of
choice variables, beta, beta prime, beta double prime and so on. And obviously
this simple fixed point computation won't terminate. So therefore the lesson to be
learned from this example is that when we have recursive constraints that
contain these choice variables it's at least not immediately clear how you can go
about solving them.
But however, even if we did have some way of solving such recursive
constraints, these choice variables still cause us headaches with scaleability
even for reasonably small programs. And to see why, let's consider this key new
private function from openness [inaudible] and even though this function may at
first look a little bit scary, it's actually one of the smallest functions I could find in
all of openness.
So what this function does is it takes an integer that identifies the type of the
cryptographic key we want to allocate. And then depending on the type of the
requested cryptographic key, it tries to initialize various fields and if in doing so, if
any of the memory allocations fail, it call this exit function called fatal, which
obviously aborts the program and also logs that memory allocator function called
BNU failed.
On the other hand, if all the memory allocations succeed, this function will return
a properly initialized cryptographic key called KNS function. Now here let's
assume that key RSA1, key RSA and key DSA were pound defined as 1, 2, and
3 respectively. So therefore this is what the preprocess source code will look
like.
So now suppose we're interested in knowing the constraint under which key new
successful, key new private will successfully return a new key. Or in other words,
what's the constraint under which we will reach this line highlighted in red? So
now as before, let's denote the argument to key new private by alpha. And to
make it easier on us to reason about this function, let's slice the relevant part of
this function. And if we do that, this is what the slice will look like.
Now, as I mentioned earlier, BNU is a memory allocator, for instance it could be
a [inaudible], so therefore as we've seen before its return rally should be treated
as a non-deterministic environment choice. So therefore unsurprisingly we're
going to replace each call to BNU with a fresh choice variable beta I. Now, if we
do that, then we end up with this much simpler version of key new private that's
given on the slide. And if we stare at this for just a second and put all of this
together, then we can write the condition under which the function succeed as
this constraint here that I won't even attempt to read. And to say the very least,
this is a somewhat verbal way of stating that this function succeeds if all the
memory allocations succeed.
Now, if that wasn't convincing enough for you, now let's take a look at the call
[inaudible] function. Now, in this three line code snippet here, we're trying to
allocate three different kinds of cryptographic keys. One of type RSA1, one of
type RSA, and one of type DSA. And suppose we want to know the constraint
under which we will reach this line marked with success? So clearly the
condition under which we will reach this line will be the conjunction of all the
conditions under which every call to key new private successfully returns a new
key. So therefore if we take the constraint we have from the previous slide and
instantiate that with the correct type and conjoin them, we end up with this rather
large constraint here. And if we take into account the fact that this huge
constraint arises from this three line code snippet, then one starts to have doubts
about how scaleable this approach really is.
>>: [inaudible].
>> Isil Dillig: Uh-huh.
>>: So is the problem that you will have [inaudible] qualifications not decided or
something like that?
>> Isil Dillig: No. We'll get to it later.
>>: Okay.
>> Isil Dillig: Yeah. So maybe like if you hold that question until the end, I can
answer it better I think.
>>: So like you can't simply find the formula ->> Isil Dillig: You can. You can. You can simplify it and get something simpler,
but the point remains that like you get -- you end up with large formulas that
contain lots and lots of redundancies in these variables that you don't want to
have in the first place. .
So I guess going back to our existential quantification question even if you did
existentially quantify all the betas here, you would still end up with this huge
formula that like -- and we want to get something much simpler that doesn't says
the same thing.
So now what conclusions can we then draw from these examples? First of all, as
we saw in the key new private function, these choice variables result in large
formulas which then in turn translates into scaleability problems. And
furthermore, as we saw in the query user function, when we have recursive
constraints that contain chose variables, it's not immediately clear how you can
solve those. So therefore, it seems like it might be very desirable to eliminate
these annoying choice variables from our constraints. And the first idea that
immediately comes to mind is to play the same trick that we always play in
program analysis and compute an overapproximation of the constraint not
containing any of these choice variables.
Now, this discussion brings us immediately to the topic of necessary conditions
because an overapproximation of a formula phi not containing any choice
variables is a necessary condition of this formula. In other words, it's implied by
the original formula. But here, rather than computing any necessary condition,
we specifically interested in computing the strongest necessary condition, if at all
possible. And what this formula here says is that the strongest necessary
condition, which would denote by the ceiling function, is stronger than any other
formula phi prime that's also implied by the original constraint and doesn't contain
the same set of choice variables.
And the -- yes?
>>: [inaudible].
>> Isil Dillig: What do you mean?
>>: [inaudible] exist?
>> Isil Dillig: It depends on what kind of -- what -- in what here you're trying to
compute the strongest necessary condition in. For example, in first order logic
computing the strongest necessary condition is undecidable but in propositional
logic, it certainly is. It depends on the theory. And we'll talk more about that
later.
And the reason we like strongest necessary conditions so much is because they
have this desirable property of being satisfiability preserving. So in other words,
the original constraint phi is satisfiable if and only if its strongest necessary
condition is satisfiable.
So again, to emphasize here, if we use the strongest necessary condition to
determine satisfiability, we're either overapproximating or underapproximating
our answer to satisfiability is exact.
>>: How did you get that [inaudible] one direction is here, right, because phi
implies ceiling of ->> Isil Dillig: Yeah. The other direction, one simple way to see it is for example if
the original constraint is unsatisfiable, its equivalent to false, right, and false didn't
contain any choice variables or any set of variables.
>>: The definition that you have for strongest necessary condition does not refer
to choice variables in any way. So your explanation doesn't make sense to me.
>> Isil Dillig: No, it does. What's lacking here is that phi prime doesn't contain
the same set of variables as does the strongest necessary condition. So for
example, if beta is a variable that we want to eliminate then the strongest
necessary condition needs to be any other condition that also doesn't contain
beta, and that's also implied by the original formula.
>>: So it's the same as if in your set phi you quantify, you universally quantify phi
over all choice variables?
>> Isil Dillig: No, they are free. They're free variables.
>>: Is the strongest necessary condition semantically equivalent to quantifying
out the choice ->> Isil Dillig: It is.
>>: [inaudible].
>> Isil Dillig: It is.
>>: All right.
>> Isil Dillig: So now to be more concrete, let's consider the constraint we had
from key new private. So to give you some intuition here the strongest
necessary condition will be just true. And this makes sense because there is no
particular requirement that needs to hold about the type of the requested
cryptographic key for this function to succeed. In other words, key new private
may successfully return a valid key, no matter what kind of cryptographic key the
programmer requests.
Now, the second example let's look at this recursive constraint we had from
query user. In this case, the strongest necessary condition is alpha equals true.
And again, this makes sense because the only condition that needs to hold is the
feature enabled must be true. So feature enabled is true, query user may return
true. That's all we can say.
Now, so far we haven't talked at all about how we can compute these strongest
necessary conditions, but even if we did have some way of computing these
strongest necessary conditions our situation would still not be entirely
satisfactory. And the reason for that is if we only compute strongest necessary
conditions now we have basically lost our power to soundly negate the
constraints. So this is the case because the strongest necessary condition of not
phi is not logically equivalent to the negation of the strongest necessary condition
of phi. In fact, it turns out that the negation of the strongest necessary condition
of phi is not even a necessary condition of not phi, let alone being the strongest
one.
So therefore, this suggests that we need to dual and complementary notion to
strongest necessary conditions which is weakest sufficient conditions. So now
just like we denote the strongest necessary conditions by the ceiling function,
we're going to denote weakest sufficient conditions by the floor function.
Because they underapproximate the formula. And the weakest sufficient
condition will need to satisfy two properties. First of all, it feeds to imply the
original formula and second it needs to be weaker than any other formula phi
prime. That also implies original formula and that doesn't contain the same set of
choice variables.
>>: [inaudible] correspond to [inaudible].
>> Isil Dillig: Exactly, yes.
>>: So if there are existing names like universal quantification and existential
quantification, why do you need to invent new names for those concepts?
>> Isil Dillig: Because the point -- we want to eliminate them, we don't want the
quantifiers.
>>: [inaudible] you don't want -- you don't -- you want to eliminate everything to
do with certain types of variables.
>> Isil Dillig: So the [inaudible].
>>: Predicate obstruction, you only want only invariable stuff, you don't want any
program [inaudible].
>>: So is it equivalent to universally or existentially quantifying this formula.
>> Isil Dillig: And then eliminating the -- exactly. It is equivalent.
>>: Okay.
>> Isil Dillig: But the point is we don't want the -- we don't want these existential
quantified variables because we want to separate out what's key to like as we'll
see -- what's key to path sensitive analysis versus what's noise in the
background. So I -- that's kind of the intuition.
>>: [inaudible] eliminating, elimination of [inaudible].
>> Isil Dillig: You could, yeah.
>>: But you have the restriction of there is only one like top level quantification
going on, there's no mix ->> Isil Dillig: Right. Exactly. Yes. And just like strongest necessary conditions
were satisfiability preserving, unsurprisingly it turns out that weakest sufficient
conditions are related to preserving. So if the original condition phi is valid, if and
only if it's weakest sufficient condition is also valid. Again, if we use weakest
sufficient conditions to determine validity, we're getting an exact answer, not an
over or underapproximation.
Now, again to give some intuition, let's look at this constraint from key new
private. Here the weakest sufficient condition is just alpha is less than or equal to
zero or alpha is greater than or equal to four. And if we think about what this
function does, it makes sense because if the type is neither key RSA1, nor key
RSA, nor key DSA, then we'll just trivially hit the default case and won't even try
to allocate memory, so it will trivially succeed.
On the other hand, if we consider the constraint from query user, here the
weakest sufficient condition will be just false. And again, this makes sense
because there is no condition of feature enabled that will guarantee that query
user will return true. It all depends on what the user input is. So therefore the
weakest sufficient condition is false.
So now what have we achieved? So one thing we have achieved is by having
pairs of strongest necessary and weakest sufficient conditions, we can now
finally make negation work again. And the way we can do it is if we've computed
the strongest necessary and weakest sufficient conditions of phi, then we can
compute the strongest necessary condition on not phi by just negating the
weakest sufficient condition of phi.
And similarly, we can compute the weakest sufficient condition on not phi by
taking the negation of phi's strongest necessary condition. So again the duality
of existential and universal quantifiers comes into play directly here. Now that
was a little bit wordy, so again let's look at the concrete example. So in the
previous examples we computed the strongest necessary condition for key new
private to succeed as true and the weakest sufficient condition for this function to
succeed as alpha is either less than or equal to zero or alpha is greater than the
or equal to four.
Now, suppose I want to know the constraint under which this function will fail. So
clearly this is going to be the negation of the constraint under which it will
succeed. So now to compute the strongest necessary condition for failure, we
take the weakest sufficient condition for success and negate that, which gives us
alpha must be between one and three. And to compute the weakest sufficient
condition for failure, we'll just negate the strongest necessary condition for
success so that will give us false. And again, it makes sense that the weakest
sufficient condition is false because nothing that we know about the type of the
cryptographic key will ensure that key new private will fail.
And similarly, it's sensible that the strongest necessary condition says alpha must
be between one and three because otherwise if the type is neither key RSA1,
key RSA, or key DSA, the function won't even get a chance at failing. Now, Tom
will actually tell you about how we can go about computing these strongest
necessary and weakest sufficient.
>> Thomas Dillig: So what have we done so far? So far we have really only told
you how we can identify that special class of variables which we call choice
variables. To model that uncertainty and imprecision in program analysis. And
we have argued that if we compute pairs of strongest necessary and weakest
sufficient conditions that do not contain these choice variables we can overcome
these termination problems that from having these fresh beta substitutions on
each recursive invocation.
We can also mitigate some of the scaleability problems because we don't have to
drag these betas which accumulate everywhere through all constraints in our
program. And perhaps most importantly, we can still negate our constraints in
the sound way and actually in a way that's not just sound but also in a way that
preserves satisfiability and validity. However, we haven't really shown you at all
how to do any of this, so we've only talked about the general high level overview
here.
So from now on, let's take a look how we can actually compute the strongest
necessary and weakest sufficient conditions for a system of recursive constraints
that will represent the exact path and content sensitive conditions under which
some property which we're interested in holds. And more specifically, we will use
the strongest necessary and weakest sufficient conditions to perform a sound
and complete path sensitivity program analysis. And our goal here will be to
answer may and must queries about the program.
And obviously our completeness guarantee here assumes a user provider finite
abstraction. So we are only complete with respect to some finite abstraction of
your program. It would be obviously undecided otherwise.
And to sum up the rest of this talk in only three bullet points, we will remove
these choice variables from our constraints, we will therefore end up with
formulas we will argue very small in practice which in turn will mimic this
technique will scale better than existing techniques for Boolean paths and
content sensitive program analysis.
>>: [inaudible].
>> Thomas Dillig: Yeah?
>>: [inaudible]. If you have a user provider finite abstraction then does it mean
that your recursive constraints are over the propositional or propositional logic?
>> Thomas Dillig: Actually if you hold on for like two or three minutes, I'm walk
you exactly through details and you'll see exactly. Your question is right to the
point. So before I get started on exactly what the algorithm is, let me just sort of
point out where this approach fits in with existing path and content sensitive
analysis.
So sort of on the one hand of the spectrum, if you think about the sort of analysis,
there are tools that are sort of related are based on model checking ideas. This
will be tools such as Bebop, BLAST, SLAM, and so on. And so on the other side
we have sort of lighter static analysis type tools such as maybe SATURN or EPS,
which in our [inaudible] a more static analysis by the same problem. And if you
think of the apparent trade-off here, it's almost like the sort of static analysis
based tools like SATURN have actually scaled to millions of lines of code while a
tool like Bebop hasn't scaled quite that far on its own.
On the other hand, Bebop has this really strong guarantee of being sound and
complete with respect to some finite abstraction while SATURN and also EPS
certainly doesn't have any aspirations to be complete in any sense. So this
technique is sort of the idea here is that we want to eliminate this trade-off and
we really want to have a technique that can give the same or very similar
guarantees to a technique like Bebop while still scaling through these really large
programs.
And so the main contribution here is therefore an algorithm for sound and
complete path sensitive analysis that will actually scale to these really large real
world programs. And the key inside we're going to explore here is that while
these choice variables are very useful within their scope and boundary, we can
safely eliminate them outside their scope as long as we are only interested in
answering may and must queries about the program.
So what do I mean by that? Let's be concrete and let's take a look at this
process file function here. So process file takes a file point F from the user. It
then asks the user if the user would like to open a new file. If the user says yes,
it says sure I'm going to reassign F to the result of a new F open call, it then calls
process file internal with that file point F, and if the user chose to actually open a
new file, it will go ahead and close that new file before returning to be like a well
behaved function.
So as before, the user input here will be represented by a choice variable. And
that makes sense since we really can't predict what the user will input again at
static analysis type. And photo that this function is an interesting feature. It has
more specifically a branch correlation on the choice variable user input. And if
you for example interested in verifying whether the F open and F closed calls
here match up, you better have to track this branch correlation. So it's clearly
useful within that function.
However, since that user input is not visible in the calling context of process file,
there's really no additional information we can communicate with the outside and
sort of dragging this beta variable along. More specifically assume we're
interested in answering some may and must query about process file, such as
may this original file be de-referenced? And let's assume this for the sake of this
example that the process file internal here always unconditionally will be
reference file F.
So now really the best thing we can say if someone asks us that question is the
constraint true. In other words, yes. Sure, it may be de-referenced. There's no
other information I can tell you that will make your analysis any better if your call
F process file. Similarly if you ask me must that function de-reference this file
point F, really the best thing I can say is false or no, I don't know. There's no
additional thing I can tell you. It's really not under my control.
>>: [inaudible].
>> Thomas Dillig: Yeah?
>>: [inaudible] that go back to [inaudible] so that the [inaudible] so this, the
question means does there exist an input to the function such that if you
executed the function on that input and there is some execution which will lead to
the file F being de-referenced?
>> Thomas Dillig: Exactly.
>>: And the must question is for all paths [inaudible] de-reference the file.
>> Thomas Dillig: Exactly. That's exactly [inaudible] of course if you're
interested in files, you would be most likely be interested if we have program
analysis in the main part, right? But you might -- you might -- you know you will
need the must part to make negations work and so on [inaudible].
>>: So the must business, right, the must business actually requires you to solve
the termination problem, right?
>> Thomas Dillig: Yes. But of course like we don't really attempt to reason
about non terminating programs. The results are qualified, that we assume all
program paths will terminate. So as usual. You're absolutely right about that.
So now how are we actually going to go about generating these conditions and
solving these problems? Well, first we will first set up this recursive similar of
constraints and we will use the same notation here, we will describe the
constraint under which each function F in our program will return some interact
value, CI. Yes?
>>: [inaudible] about must -- my suspicion is that the definition of must is with
respect to a [inaudible] then the question that the you're asking is that if you
reach this program [inaudible] regardless of what it [inaudible].
>>: Yes. So I actually don't know. I mean, I don't know the definition [inaudible]
I was just conjecturing that this was definition ->>: But that definition that I conjectured has a termination built into it.
>>: Yes. My feeling is that you also define must without initial [inaudible].
>> Thomas Dillig: I think -- I think -- I think you're absolutely right about this. The
correct way of looking at it is if you get to this program point then this property
must hold. But we're not saying whether you will get there or not.
>>: So for all [inaudible].
>>: If you reach this program point, the property P must hold. For all inputs if
their exists a path on that input to this program point, then this property holds.
So I see. So there's a for all on the input and there exists on the path because
the function -- so it could be non-deterministic, right?
>>: Yes.
>> Thomas Dillig: So now after we've set up the system of recursive constraints
we will then convert that system to Boolean constraints, so we'll see how to do
that. And after we've done that, we will then remove all these choice variables
from the Boolean constraints, we will end up with two systems, one of them will
be a system describing unnecessary conditions and one will describe our
sufficient conditions. And then we will take these systems and twiddle them just
a little bit for a few technicalities to make sure that they will also preserve
strongest necessary and weakest sufficient conditions on the syntactic
substitution. And then we are ready to basically solve them using a standard
fixed point computation.
So to set up this recursive system E of these initial constraints, we will again use
the same notation we've used before where we want the to express the fact that
some function, F, given some input, alpha will return some abstract values, CI.
And for the purposes of this talk, we'll assume that the only thing, the only side
effect the function has is its return value. This can easily be extended to like. It's
not being the case but it just makes the notation a lot cleaner.
And we will end up with this matrix E here. And if phi IJ is here, I just Boolean
constraint so we have the form alpha being equal to some value CI, beta being
equal to some value, CI, return variables pi and comparisons between two
constants. Again as before the alphas here will represent the function inputs.
And they are obviously provided by the calling context of the function.
The beta serve [inaudible] and the scope of each beta will just be the function
body in which its introduced. Yeah? Yeah?
>>: [inaudible] constraints for a recursive program? It's not clear to me
[inaudible].
>> Thomas Dillig: That's a really good question. Actually we just handle them by
treating them as recursive functions. So really like I'm only going to talk about
recursive functions and the implementation just turns them into [inaudible]
recursive.
And unsurprisingly exactly your question if pi is on the right hand side here, as
we have seen result from the results from function calls, and they have the usual
substitutions we discussed. So let's be specific, and let's look at this very simple
function F here. So F takes an integer X, it then declares another Y, calls the
user -- queries the user for some other integer to get user input. If X is 1 or Y is
2, it actually returns 1, and otherwise it returns F of 1.
Now, this is actually a really stupid function. If you look at this function for just a
second you will see [inaudible] function always returns 1. Just highly inefficient
at doing so. And let's assume for the purposes of our example here that we'll
have three abstract rallies. So more specifically we'll have the abstract value C1
for the integer 1, C2 for the integer 2, and less say 3, C3 here stands for all other
integers not equal to 1 and 2. And then the constraint we would write would look
like this. And here we'll get alpha 1 or beta 2 because for this function to actually
return the constant C1, it will certainly do it if X is equal to 1, alpha equals 1, or
beta equals 2.
Or the recursive call must return C1, so if pi of F alpha, C1 and the conditions
under which the recursive call holds is exactly the negation of the condition on
the virtual return at this return point marked in red. So after we have this
recursive system of constraints, we now want to convert them to Boolean
constraints. And here we are going to really just do the most straightforward
thing possible first. So we are going to see expression like CI equals CI, we're
going to say true, that wasn't a surprise. If we say CI equals CJ, we'll say false.
That's also not a surprise. And anything else has to be of the form some variable
VI equals CJ and we'll make up some fresh of VIJ for that.
So for example, if you look at the constraint from function F from just a couple
seconds ago, we'll just replace alpha equals 1 by some fresh Boolean variable
alpha 1, beta equals 2 by some fresh available beta 2, and the substitution 1
replaces alpha, unsurprisingly by 2 replaces alpha 1, false replaces alpha 2 and
false replaces alpha 3.
So while this was very simple it's not quite correct yet. And it's easy to see why
it's not correct. Because there is no condition that stipulates yet that each of
these variables has to have exactly one value at a time, it can't have two values,
it can't have zero values, unless you sort of [inaudible] constraint world, and we
don't want to be there.
And fortunately we can very easily enforce this additional constraints. And we
can do that just by properly conjoining these exist and uniqueness constraints
when we create satisfiability and validity. And if you do that and go back to our
three initial abstract values from the beginning, C1, C2, C3 from the example, we
would for example conclude that alpha 1 and alpha 2 will be unsatisfiable.
Because clear alpha 1 can't be equal to 1 and 2 at the same time. And we would
also conclude that beta 1 or beta 2 or beta 3 has to be valid so if you numerate
all possible abstract values clearly they resolve the constraints true.
So after this step, we're really left with Boolean constraints. And the first thing
we're going to recall, we're going to recall that well known result that states that if
you want to compute the strongest necessary condition, of some formula phi not
containing some choice variable data you can do that by replace phi with greater
replace by true or phi beta replace by false.
And similar if you're on a computer because sufficient condition you can do the
same thing. You just have to conjoin the two parts of the formula. And it's ->>: [inaudible].
>> Thomas Dillig: Yeah?
>>: [inaudible] an example where you had an integer variable X, so how did you
come up with the finite abstraction?
>> Thomas Dillig: So that's a very good point. For now we're just going to
assume someone gave it to you. So for example you scanned your program
syntactically every integer you saw you put in the abstract set. Something like
that. Later on in the experiments, I'll elaborate a little bit on what you actually do
in practice and how it turns out not to be a very big limitation. But, yeah. So this
result was actually first given by Mr. George Boole the picture in 1892 in a book
called On the Laws of Thought and interestingly enough, this little lemma has
been reproofed maybe 10 times since then. The whole sequence of paper
[inaudible] some sort of lemma, but and we actually first fell for some of the
earlier ones, too, and eventually we dug all the way down to this book. I'm pretty
sure it's the first one to introduce this prepositional logic.
And so if you think back of what this step will actually achieve, the recursive
system has these beta choice variables here and after we apply Mr. Boole's
method we are then left with two systems, one a necessary condition and one
insufficient condition. And note that there are no more betas in these constraints,
which is very good. But of course they are still recursive, which means it's not
quite lunch time we still have to keep going and actually work through the rest,
right?
So now let's see an example how this actually turns out. So if we go back to our
small F function here and let's say you want to start computing the strongest
necessary condition, so as expected we'll replace beta 2 by true and false
respectively and we'll put an or between those two parts. So the first part
immediately simplifies to true, so we get true or anything, true or anything is
anything, so we will end up with it's actually true, sorry, we will end up with the
necessary condition under which this function will return 1 will just be true.
And let's do the same thing for the sufficient condition here. So again we replace
beta 2 by true and false. Now we just conjoin the two parts. First part simplifies
to true. True and anything is anything. So now we get the weakest sufficient
condition under which this function returns C1. We'll just be alpha 1. Or the
recursive call returns C1.
So now to actually solve these recursive constraints, we obviously have to make
sure that these constraints must preserve their strongest necessary and weakest
sufficient conditions under syntactic substitution. In the current form there's two
small difficulties that prevent them from having that. And the first reason we've
sort of seen early already it's just the fact that the negation of the necessary
condition of phi is not equivalent to the negation -- to the necessary condition of
the negation of phi. And the same thing also holds obviously for weakest
sufficient conditions.
The second problem here arises from the fact that contradictions and topologies
have to be enforced explicitly when we apply these substitutions. So what do I
mean by that? Well, let's be concrete and look at this constraint in blue at the
bottom of the slide. That pi F alpha C1 and pi F alpha C2. So if you think about
what this stands for, it really just says some function at the same call site returns
for the same input alpha, absolute value C1 and C2. Clearly this constraint is
false. It can't happen at the same time. So since the strongest necessary
condition is expected to preserve satisfiability, the only condition to preserve
satisfiability of false again will be false. It didn't have any choice variable, so it's
[inaudible]. Now, if we assume that the strongest necessary conditions of pi F
alpha C1 and pi F alpha C2 are true perfectly possible, then if you just substitute
them in we would get true, which is a necessary condition but certainly not the
strongest one. So we have to make sure that this can't happen. We have to, in
other words, make sure that for necessary conditions there's no way that a
substitution can accidentally weaken our constraint.
So how do we get around these two difficulties? Well, to deal with the first
problem, we can either recall since we are sort of upgrading with a finite constant
assumption that negation isn't really as bad as it looks and we can always
replace it with some big [inaudible] or we can be much more intelligent or slightly
more intelligent and use the property that the necessary condition of not phi is
equivalent to not the sufficient condition of phi. And the same with the sufficient
condition. And obviously if you want to use this property, it will require us to
simultaneously fix points strongest necessary and weakest sufficient conditions.
But it's really important for a practical implementation.
So how can we deal with these contradictions and topologies? Well, if we're
concerned about strongest necessary conditions, one very easy way of making
sure there's no way that the substitution can actually weaken or constraint is to
convert the constraint to disjunctive normal form and drop all contradictions
which have to be of a very special syntactic form at this stage so we can find it
really easily. And again for weakest sufficient conditions things are pretty much
the same, just upside down. So here what really gets us into trouble is topology.
So you want to make sure they can't inadvertently be strengthened in a
substitution. And you prevent that by converting to conjunctive [inaudible] form
and dropping all topologies from them.
So after we've done that, the resulting constraints will now preserve strongest
necessary and weakest sufficient conditions just under simple syntactic
substitution. So we'll go ahead, we'll throw them in our fixed computation, we
wait, and out comes the system with non-recursive constraints not containing any
choice variables which we can use for may and must queries.
And let's go back be to the F example in its original constraint here. Remember
that we computed the strongest necessary condition for this function to return C1
as true, but [inaudible] non-recursive there's nothing left to solve. The weakest
sufficient condition is still recursive. So let's say if you want to compute a
greatest fixed point so we get alpha 1 or false equals alpha 1, alpha 1 or alpha 1
where 2 replaces alpha 1 will give us alpha 1 or 2, and we'll affix point at true.
And note that this sufficient expression -- the efficient condition here expresses
exactly that this function must return to -- must always return one because it's
valid. So you can see how the -- how this technique was able to deduce that this
function F is really not really intelligent function, it's a function it must return 1.
So the main results sufficient face the effort technique that's only complete for
answering these may and must queries again with respect to some finite
abstraction of course.
And the claim here is that by eliminating these choice variables people end up
with much smaller formulas in practice which in turn will meet that we can scale
quite a bit better than existing approaches to similar problems. Now, of course,
I'm going to back up this claim that this action is -- this condition's stay small and
this actually scales, we decided to do some experiments and see what things
look like in practice, because they often look very different than you think they do.
And inform that, we decided to compute the full intraprocedural constraint for
every single point of the reference and openness is H [inaudible] the entire Linux
kernel. And of course they will say full constraint we're going to compute what
we showed new that talk, so we're going to compute necessary and sufficient
conditions.
Now, we believe that this is a stress test for this technique since we couldn't
really think of anything that's more ubiquitous in C then that point in the
reference. So we therefore hope that this test, if this technique scales to point of
references it should also scale to many other properties that you might be
interested in.
>>: [inaudible] for every point of [inaudible] ->> Thomas Dillig: So our goal here is basically to say for each function what's
the constraint under which you will de-reference your argument? Or one of your
arguments or one of the fields of your arguments and so on. So for example, you
might ask what's the constraint under which you de-reference your first
argument's F field? And you know, obviously this might be potentially be
recursive, it's like intraprocedural because you might make off. So that's a
question you ask.
>>: [inaudible] we are going to compute a separate constraint.
>> Isil Dillig: Yes. And if you look at the graph here, so this graph shows on the
X axis the size of these necessary and sufficient conditions when necessary
conditions are marked in red and sufficient conditions are marked in green. And
in the Y axis, it shows the frequency. And this graph is a Linux -- it's just the
largest so it gives the best sample size. But they're very similar. There's no
significant difference.
And one thing to note that that the Y axis here is actually [inaudible] scale. So if
you look at this closely, you can see that more than 99 percent of all of these
constraints have less than 9 Boolean connectors this them. They're really, really
small. And it's exactly the main -- the main point here that allows this technique
to scale with something like Linux because it can separate also the beat from the
[inaudible] sensitivity so it doesn't have to learn about these huge constraints that
accumulate. The things it ends up with in practice are very small. And again,
this is -- this graph gives a whole explanation why this [inaudible] Linux.
>>: [inaudible].
>> Thomas Dillig: That's a very good question. So the fine abstraction we used
-- so as I have seen -- as I have -- as we have said in this talk for this technique
to be complete obviously it's with respect to that finite abstraction. But you can
still be sound and almost be complete if you just, you know, use some sort of
potentially not finite abstractions. So for example you take all the integers that
syntactically appear in your program and all the things you compared to and stuff
like that.
So we'll use some technique like that to generate a set of things we track. And
that's the things we -- that's -- you know, that's basically the conditions we get
from the program. And [inaudible] sort of unless you start computing arithmetics
you're returning twice your input, this technique likes till its complete.
>>: So [inaudible] I mean if you have like nine Boolean variables per procedure,
[inaudible] scale to Linux no problem, and the thing that may beef up scale is the
refinement loop in slam, right, to where you start generating, you know, really big
BDDs. So fundamentally like I'm still trying to -- so fundamentally if I have a fixed
finite abstraction like Bebop does is it computes post iteratively and it uses
Boole's law when it computes procedure summaries, all local variables of a
procedure get existentially quantified out just like you -- just like you show there
and then it's a post so the image under post also has existential quantification.
>> Thomas Dillig: [inaudible].
>>: But it's using [inaudible] to suck it up. So just try to think so what would
[inaudible] not doing is it's not doing it's not -- it's not doing this dual thing.
>>: [inaudible].
>>: [inaudible].
>>: [inaudible].
>>: Can I finish my question?
>>: Yeah.
>>: So when I'm doing post, I'm doing strongest computation, so I'm doing all
this existential, everything's existential. So what is the universal bind? So
basically I have a transition system and I do an image computation and I
existentially quantify out the intermediate state and when I lift the summary to a
caller I existentially quantify at the local. So I just want to understand that if I
understand sort of what you've done in the context of Bebop, then you have an
extra step which is this must and I want to really try to understand what is the
must find because that's something that Bebop doesn't do, but we could also do
the universal thing because I could also do your rule. And I'm not quite sure --
>> Thomas Dillig: So the main thing, the main thing, the must -- first of all if
you're interested in must properties obviously it's important. Let's assume we're
interested in may properties. So if you're interested in may properties what it
really buys you is that you can do a negation easily and you can do the negation
without explicitly enumerating this disjunction of abstract values which might be
huge. Like for example in Linux it might be like hundreds of thousands of
elements, it might be more. Not in one function but overall. So if I want to like
take every integer, some are compared to anyplace of the program, I mean, I can
write [inaudible] this disjunction theory but I can basically go home after I wrote it.
>>: Right. But [inaudible] in BDDs, a function and its negation have the same
representation complexity. If you represent the function using BDDs.
>> Thomas Dillig: So you're saying it would be possible to encode something
similar efficiently in BDDs as well?
>>: [inaudible] the problem but this problem that you -[brief talking over]
>>: I mean, I have negation, I mean essentially -- essentially every Boolean
variable I also have its negation. So maybe we should take it offline.
>>: You want to negate without [inaudible].
>> Thomas Dillig: Exactly. Because what you're going to do, you're going to turn
it existential, you're going to [inaudible] it's not. Because if I say the excess on
beta that's equal to Y I don't say the negation of this doesn't mean that for all -you know, like they're really just phi variables on my constraint. That's how I
want the negation to work from what it's supposed to mean.
>>: [inaudible] basically what you're -- what you're saying is that if I have a may
query and I want to negate it then I can use some must information, right, to
refine that?
>> Thomas Dillig: Yes.
>>: And that can cut off a lot of search [inaudible] potentially.
>> Thomas Dillig: And note that in this technique there's no refinement loop,
right. So we.
>>: I understand.
>> Thomas Dillig: Yeah. But I think that's a good way of ->>: [inaudible] is so you're computing may information for procedure and this
procedure calls another procedure and returns a value and you compare that
return value and you can say things like if that other procedure -- if you know it's
-- if you have must information you can use that to give may information
[inaudible] computing.
>>: Sure, sure, sure.
>>: And that way ->>: I really haven't understood how they interact in this framework.
>>: [inaudible] is that existential quantifications use [inaudible] procedures for
computing [inaudible] information and universal quantification is used for some
random procedures.
>>: Yes.
>>: But I want to understand the interaction.
>> Thomas Dillig: The interaction really comes from the fact if you have some
may fact and you negate it becomes a must fact. Similarly if a must fact you
negate it becomes a may fact.
>>: Once you negate.
>> Thomas Dillig: So whenever, you don't know. So I'm computing something
for myself, right, I'm computing that say the constraint on which you return the
integer. So now I do not know -- some of my colleague contacts may be asking
am I equal to 2, I might say am I greater than 1, so there will be some sort of
negation in there. And if they're in -- and basically what this technique gives you
that I look at this procedure once, I just compute both. I'm going to say whatever
you want. If you use my return value in negation, I'm ready. If you don't use it in
negation, fine with me. I have a formula for both. So by doing this both sort of
negation you can just naturally pick the sufficient [inaudible] flip it up. And that's
what you take. That sort of domain. Yeah.
>>: [inaudible] use it to compute must information you need must information in
each step.
>> Thomas Dillig: Yes.
>>: And [inaudible].
>> Thomas Dillig: Oh, yes.
>>: [inaudible].
>> Thomas Dillig: So now they do interact at the negation points, right. So like
basically like every -- like from to be really concrete, right, in this system we
implemented this. But if you talk about a constraint, it's up here, right? It's a may
[inaudible] necessary in the [inaudible] conditions everywhere. So that's exactly
the way this works. So it is everywhere in that sense.
>>: [inaudible] so but you do get [inaudible] a lot of must information.
>> Thomas Dillig: You surprisingly do. Not always but you do because there's a
lot of ->>: Because generally in my experimentation with must which has been very
small compared to what you've done, there's -- for perfect abstraction there's
very often this [inaudible] it's hard to get must. Right? Generally, generally you
need a lot more information. Because must is an underapproximation.
>> Thomas Dillig: Yes.
>>: And so generally you need a lot more predicates to get must. And depends
on your abstraction.
>> Isil Dillig: And so for example [inaudible] we needed this must information
suppose like we're doing pointer analysis, right, and some [inaudible] has some
side effect that [inaudible] the conditions under which is may and must happen,
now to in the column context it had some other targets prior to calling this
function, right? Now to determine the constraint under which it will still point to
it's old targets, I have to use the must information from the calling and negate
that so that it stays like -- do you see what I'm trying to say?
>> Thomas Dillig: So for example you want to make -- you want to find the
analysis at very [inaudible] I initialize that and that past available to [inaudible]
variables function. In this case, I'm really interested in must, right? Must is
where it will be initialized. So that sort of let's -- that's the constraint under which
I can keep my old targets in the points to set. That's where must comes
everywhere. And sort of the [inaudible].
>>: I have a question. I find this whole terminology very confusing. Can I think
of must query as an assertion somewhere in the program?
>> Thomas Dillig: As an assertion that can never fail.
>>: I suppose you're asking is it true that this assertion never fails?
>> Thomas Dillig: Yes.
>>: [inaudible] you asking is it true that this assertion never fails?
>> Thomas Dillig: Yes.
>>: And a may query can also be captured by an assertion, but in this case
you're asking the reverse question, is it false that this assertion never fails?
>> Thomas Dillig: Yes. I think that's ->>: It's like this [inaudible] right?
>> Thomas Dillig: Yeah.
>>: So then what I don't understand is that supposing I am only interested in
proving assertions in programs, right, then I won't be interested in must queries.
>>: I think must is equal [inaudible] not equal to [inaudible].
>>: So why can't -- give me an example of may query.
>> Thomas Dillig: For example may is point [inaudible] reference. So you want
to know in this function [inaudible] is it safe, right? So you want to say may
someone de-reference that.
>>: [inaudible].
>>: [inaudible] what I'm really telling you is that X does not point to U, X does not
point to V.
>>: Can we do the following? Actually maybe you should finish your talk.
>> Thomas Dillig: Okay. Let me ->>: Because I would like to understand may and must.
>> Thomas Dillig: Yes.
>>: In the case of straight line programs manipulating only Boolean variables.
Get away from all this [inaudible].
>>: But I think if -- I have a [inaudible] like the fact that you're not only looking at
Boolean values, it's important ->>: Okay. So then I would like to understand that first.
>> Thomas Dillig: All right. Well, let me just -- so I have shown the graph.
You've seen that they stay very fall. And I mean, this again absolutely as you
pointed out correctly with respect to what abstraction, the respect -- the
abstraction here is very fine grain, so we allow anything, any constant compared
to any constant that flows around. So you know there's a -- potentially if you
would call them up there would be hundreds of thousands of values easily on
something like Linux.
So now -- I mean, that's all nice and good and I've shown you like a graph and
they stay small and so on, and they are green and red, but like maybe more
interesting is the question how useful is this actually for like a real program
analysis problem? So to see whether there's any real use from being fully path
and content sensitive we decided to try a little reference analysis using these
techniques. And to be able to compare what difference this particular technique
makes, we implemented sort of two version of this. And on the left hand side you
see the number for a fully path sensitive analysis that computes exactly these
strongest necessary and weakest sufficient conditions. And on the right hand
side, you see an analysis that's only intraprocedurally pass sensitive. So it drops
all constraints of procedure boundaries and just says true or false, depending on
whether they're satisfiable or not.
And if you look at the report to bug ratio here for example again that's on the
Linux -- it's on the same three applications as is H [inaudible] in the Linux kernel,
you can see that we were able to get close to an order of magnitude reduction in
the false positives and we did that without resorting to any sort of usual tricks like
find a few values that may be important for null and this sort of stuff, just sort of
use this general technique and plug it right in. So one caveat, the numbers I
have shown you on the previous slide do not track any null values that flow into
unbounded data structures such as a recent link lists. And the reason for that is
very simple and really orthogonal. It's just that the underlink framework which we
use to implement this prototype doesn't track any data structures, it just makes
one big summary blob, and everything goes in there. And we've found it to be
unacceptably imprecise for verifying memory safety. So for that part, this
technique isn't -- it's really an orthogonal issue we're trying to attack here from
shape analysis.
And actually the analysis of contents of these precision dependent data
structures such as arrays, linkless and so on is actually one of our current
projects and it grew exactly out of the limitations of this previous prototype.
Now, a second question that you might ask that's very interesting perhaps is
we've really only shown you how to compute these things in the biggest possible
of all theories, namely proper assertion of logic. So what about doing that in
richer theories? And it for example you might be interested in computing
necessary and sufficient conditions for the theory of uninterpreted functions. Or
the combined theory of linear integer arithmetic and uninterpreted functions and
so on.
And it turns out that many of the issues here are very closely related to this idea
of cover algorithms for existential quantify elimination. And really you can see a
cover algorithm as computing a necessary condition for a non recursive
constraint.
So as far as related work is concerned I'll be brief since we are agent late. I've
already sort of talked about previous path and conflict sensitive analysis. Now,
just a couple remarks and this idea of over and under approximation. So this
idea certainly has been around for a while, both model checking and abstract
interpretation. Probably the most related work to this particular work is Dave
Schmitz' work on over and under approximation in abstract interpretation. So
one of the main differences here is that we are not just interested in any over and
under approximation but in one that actually preserves satisfiability and validity,
and we give a specific algorithm for a specific domain for doing that.
Another difference is that we also sort of really want to handle these negations in
a fundamental way by having these pairs of constraints which flip. You really
don't want to start enumerating things since we need to -- you know, we don't -we can't really make these sort of monotonicity assumptions like interpretation
can, at least not without really ruining our scaleability or state approach.
All right. Thank you very much.
[applause].
>> Tom Ball: Any questions? We can have some discussion. We can go to
lunch for that matter. [laughter].
>> Tom Ball: You can discuss [inaudible] you can discuss now, but maybe we
should just let them -- let them take their microphones off.
>> Thomas Dillig: Thank you.
Download