>> Tom Ball: Hi, I'm Tom Ball, and it's my pleasure this early Thursday morning to welcome the Dilligs, Isil and Thomas Dillig from Stanford University, where they work as a power duo on program analysis with Alex Aiken, and they're going to describe to us some work on doing scaleable precise analysis in the presence of uncertainty and imprecision. Welcome. And we look forward to the talk. >> Isil Dillig: Thank you, Tom. So this talk is about constraint-based analysis in the presence of uncertainty and imprecision. And this is joint work with Tom, who will give the second half of this talk, and also our advisor Alex Aiken. First of all when we try to reason about program statically, it will be great if we had perfect knowledge about the world, but unfortunately uncertainty and imprecision come up all the time. First of all, uncertainty comes up because we cannot model every aspect of the environment that a program executes in. And similarly, imprecision comes up because any program verification technique is necessarily based on some sort of abstraction of the program. Now to convince you that uncertainly and imprecision really are recurring themes in program analysis, let's walk through a few simple but realistic examples. So one typical example of uncertainty is whenever we ask the user for some kind of input. So here we all function that returns true if the user input is Y and otherwise it returns false. Here since we have no control over what the user input is going to be, we have to assume that the this function can non-deterministically either return true or false, but we don't know which one. Another situation where uncertainty might come up is whenever we receive data over the network. So for example here I've opened some socket and then I'm using the receive function to get data over the network and populate the contents of this one kilobyte buffer that I just allocated on the stack. Now, after this call to receive, I have to assume that the contents of the buffer could be anything because I don't know what time -- what kind of data I'm receiving. So therefore, for example, among other things it could mean that the cast on the next line could be unsafe. A third situation where uncertainty comes up might be when something becomes an operating system state. So for example, since we don't typically reason about the free list of the operating system, then we have to assume that malloc can non-deterministically either return null or non null. And we could think of many more situations where uncertainty plays a role. For example, calling a function for which we don't have the source code, or when the thread -- when the scheduler gets to make a decision about which thread to run next and so on. So therefore in the all of these situations, there are certain values that appear as non-deterministic choices made by the environment. Now, moving on to imprecision, at first glance imprecision seems quite different from uncertainty because imprecision arises from the abstraction that's chosen intentionally by the analysis designer. However, as we'll see in the next couple of examples, imprecision will have very similar consequences as uncertainty. For instance, if my program analysis doesn't integrate sophisticated shape analysis reasoning, then it will most likely end up smashing all elements of an unbounded data structure or an abstract data type into a single summary note, and if that's the case and I read a particular element out of this array, then I have to assume that this array read could result any one of the possible values that I previously wrote to this array. Another source of imprecision could be due to not tracking complicated arithmetic, for instance for the sake of scaleability. So for example, if my analysis doesn't reason about null in arithmetic, then it will have no idea what this expression coefficient times A times B plus min size is going to evaluate. And so therefore, this expression will again look to my analysis as though it was a non-deterministic environment choice. Yet the third source of imprecision could be for example not knowing about complicated loop invariance. So here we have a compute GCD function that uses Euclid's algorithm for computing the greatest common divisor of two elements, A and B. So here unless my analysis somehow knows about some non trivial number theoretic axioms, then it will most likely have no clue what's going on inside compute GCD. And therefore, again, the result of calling compute GCD will have to be treated as a non-deterministic environment choice. So therefore, to recap, in all of these situations sources of imprecision appear to my analysis as non-deterministic environment choices even though there is no true non-determinism anywhere here. Now, to deal with these problems that arise from uncertainty and imprecision, many sound program analysis systems and in particular constraint based systems will typically model environment choice by introducing fresh, unconstrained variables. And in this talk, we're going to refer to such variables as choice variables. So to be more concrete, for instance whenever there is a call to a function like get user input, we're just going to make up some fresh variable data. And then someone asks is it possible that beta is equal to the character Y? Well, the answer is of course. I don't know if the user input [inaudible]. >>: [inaudible] what exactly is [inaudible]. >> Isil Dillig: Yes. We'll get to the more specific part like a little bit later. Yeah. Similarly someone asks is it possible that beta is some value other than the character Y. Well, again, the answer's of course because beta is an unconstrained variable. On the other hand, if someone asks can we guarantee that beta is equal to the character Y, well unsurprisingly the answer is of course not. Now, while the introduction of these choice variables allows us to be sound in the presence of uncertainty and imprecision, the use of these choice variables will introduce two important classes of problems. First of all on the theoretical side, whenever we have recursive constraints that contain choice variables, it's far from clear how we can go about solving them. And we'll see an example of this in just a second. Furthermore, on the practical side, the number of these choice variables grows proportionately in the size of the analyzed program, and therefore this results in a large formulas which then directly translates into the poor scaleability of the analysis. Now to illustrate the theoretical problems that arise from these choice variables, let's look at the simple but recursive query user function. So what this function does is it takes a Boolean variables called feature enabled. If this particular feature is not enabled, it returns false. Then if it is enabled, it asks the user for some kind of input. Then if the user input is Y, it will return true. If the user input is N, it returns false. And if the user can't follow instructions and enters some invalid character, then it will invoke itself recursively to input the user for a valid input character. So suppose here we want to know when will this query user function return true? Or stating this a little bit differently, given some arbiter argument alpha that denotes feature enabled, what can we say about the constraint pi alpha, true under which query user will return true? Now, let's try to write this constraint together? First of all, as we can see from this highlighted line, a prerequisite for this function to return true is that feature enabled must be true. Otherwise, the function returns false on the first line. So therefore we have alpha equals true as part of this formula. So one way in which this function will return true is if in addition to alpha being true if the user input that we denote by the choice variables beta is equal to the character Y on the first invocation of the function. But obviously this isn't the only way this function will return true. It will also return true is if in addition to alpha being true, if the user input beta is not equal to the character N on the first invocation and in addition the result of the recursive call is true. Here note that we've applied the substitution true replaces alpha, because we know if we were able to make the recursive call then feature enabled must be true, otherwise we would have returned on the first line. >>: So what is beta [inaudible]. >> Isil Dillig: I'll get to that in just a second. Yeah. Okay. Furthermore, note that we need a substitution that says beta prime where beta prime is a fresh choice variable replaces the old choice variable beta because there is a distinct user input on each distinct recursive invocation. Otherwise in the general case it would not be sound not to rename this choice variable. >>: [inaudible] in the flow analysis, right, because feature enable just flew [inaudible]. >> Isil Dillig: Exactly. Right. >>: [inaudible]. >> Isil Dillig: Yes. >>: Okay. >> Isil Dillig: So therefore this constraint here characterizes the exact condition under which query user will return true. However, note that this constraint is recursive which is not surprising, given that query user is a recursive function. But for this constraint to be immediately useful to us so that we can issue satisfiability and validity queries and so on, we have to solve it and bring it to closed form. Now if we try to solve this constraint [inaudible] using a standard fixed point computation, then we're going to end up introducing an unbounded number of choice variables, beta, beta prime, beta double prime and so on. And obviously this simple fixed point computation won't terminate. So therefore the lesson to be learned from this example is that when we have recursive constraints that contain these choice variables it's at least not immediately clear how you can go about solving them. But however, even if we did have some way of solving such recursive constraints, these choice variables still cause us headaches with scaleability even for reasonably small programs. And to see why, let's consider this key new private function from openness [inaudible] and even though this function may at first look a little bit scary, it's actually one of the smallest functions I could find in all of openness. So what this function does is it takes an integer that identifies the type of the cryptographic key we want to allocate. And then depending on the type of the requested cryptographic key, it tries to initialize various fields and if in doing so, if any of the memory allocations fail, it call this exit function called fatal, which obviously aborts the program and also logs that memory allocator function called BNU failed. On the other hand, if all the memory allocations succeed, this function will return a properly initialized cryptographic key called KNS function. Now here let's assume that key RSA1, key RSA and key DSA were pound defined as 1, 2, and 3 respectively. So therefore this is what the preprocess source code will look like. So now suppose we're interested in knowing the constraint under which key new successful, key new private will successfully return a new key. Or in other words, what's the constraint under which we will reach this line highlighted in red? So now as before, let's denote the argument to key new private by alpha. And to make it easier on us to reason about this function, let's slice the relevant part of this function. And if we do that, this is what the slice will look like. Now, as I mentioned earlier, BNU is a memory allocator, for instance it could be a [inaudible], so therefore as we've seen before its return rally should be treated as a non-deterministic environment choice. So therefore unsurprisingly we're going to replace each call to BNU with a fresh choice variable beta I. Now, if we do that, then we end up with this much simpler version of key new private that's given on the slide. And if we stare at this for just a second and put all of this together, then we can write the condition under which the function succeed as this constraint here that I won't even attempt to read. And to say the very least, this is a somewhat verbal way of stating that this function succeeds if all the memory allocations succeed. Now, if that wasn't convincing enough for you, now let's take a look at the call [inaudible] function. Now, in this three line code snippet here, we're trying to allocate three different kinds of cryptographic keys. One of type RSA1, one of type RSA, and one of type DSA. And suppose we want to know the constraint under which we will reach this line marked with success? So clearly the condition under which we will reach this line will be the conjunction of all the conditions under which every call to key new private successfully returns a new key. So therefore if we take the constraint we have from the previous slide and instantiate that with the correct type and conjoin them, we end up with this rather large constraint here. And if we take into account the fact that this huge constraint arises from this three line code snippet, then one starts to have doubts about how scaleable this approach really is. >>: [inaudible]. >> Isil Dillig: Uh-huh. >>: So is the problem that you will have [inaudible] qualifications not decided or something like that? >> Isil Dillig: No. We'll get to it later. >>: Okay. >> Isil Dillig: Yeah. So maybe like if you hold that question until the end, I can answer it better I think. >>: So like you can't simply find the formula ->> Isil Dillig: You can. You can. You can simplify it and get something simpler, but the point remains that like you get -- you end up with large formulas that contain lots and lots of redundancies in these variables that you don't want to have in the first place. . So I guess going back to our existential quantification question even if you did existentially quantify all the betas here, you would still end up with this huge formula that like -- and we want to get something much simpler that doesn't says the same thing. So now what conclusions can we then draw from these examples? First of all, as we saw in the key new private function, these choice variables result in large formulas which then in turn translates into scaleability problems. And furthermore, as we saw in the query user function, when we have recursive constraints that contain chose variables, it's not immediately clear how you can solve those. So therefore, it seems like it might be very desirable to eliminate these annoying choice variables from our constraints. And the first idea that immediately comes to mind is to play the same trick that we always play in program analysis and compute an overapproximation of the constraint not containing any of these choice variables. Now, this discussion brings us immediately to the topic of necessary conditions because an overapproximation of a formula phi not containing any choice variables is a necessary condition of this formula. In other words, it's implied by the original formula. But here, rather than computing any necessary condition, we specifically interested in computing the strongest necessary condition, if at all possible. And what this formula here says is that the strongest necessary condition, which would denote by the ceiling function, is stronger than any other formula phi prime that's also implied by the original constraint and doesn't contain the same set of choice variables. And the -- yes? >>: [inaudible]. >> Isil Dillig: What do you mean? >>: [inaudible] exist? >> Isil Dillig: It depends on what kind of -- what -- in what here you're trying to compute the strongest necessary condition in. For example, in first order logic computing the strongest necessary condition is undecidable but in propositional logic, it certainly is. It depends on the theory. And we'll talk more about that later. And the reason we like strongest necessary conditions so much is because they have this desirable property of being satisfiability preserving. So in other words, the original constraint phi is satisfiable if and only if its strongest necessary condition is satisfiable. So again, to emphasize here, if we use the strongest necessary condition to determine satisfiability, we're either overapproximating or underapproximating our answer to satisfiability is exact. >>: How did you get that [inaudible] one direction is here, right, because phi implies ceiling of ->> Isil Dillig: Yeah. The other direction, one simple way to see it is for example if the original constraint is unsatisfiable, its equivalent to false, right, and false didn't contain any choice variables or any set of variables. >>: The definition that you have for strongest necessary condition does not refer to choice variables in any way. So your explanation doesn't make sense to me. >> Isil Dillig: No, it does. What's lacking here is that phi prime doesn't contain the same set of variables as does the strongest necessary condition. So for example, if beta is a variable that we want to eliminate then the strongest necessary condition needs to be any other condition that also doesn't contain beta, and that's also implied by the original formula. >>: So it's the same as if in your set phi you quantify, you universally quantify phi over all choice variables? >> Isil Dillig: No, they are free. They're free variables. >>: Is the strongest necessary condition semantically equivalent to quantifying out the choice ->> Isil Dillig: It is. >>: [inaudible]. >> Isil Dillig: It is. >>: All right. >> Isil Dillig: So now to be more concrete, let's consider the constraint we had from key new private. So to give you some intuition here the strongest necessary condition will be just true. And this makes sense because there is no particular requirement that needs to hold about the type of the requested cryptographic key for this function to succeed. In other words, key new private may successfully return a valid key, no matter what kind of cryptographic key the programmer requests. Now, the second example let's look at this recursive constraint we had from query user. In this case, the strongest necessary condition is alpha equals true. And again, this makes sense because the only condition that needs to hold is the feature enabled must be true. So feature enabled is true, query user may return true. That's all we can say. Now, so far we haven't talked at all about how we can compute these strongest necessary conditions, but even if we did have some way of computing these strongest necessary conditions our situation would still not be entirely satisfactory. And the reason for that is if we only compute strongest necessary conditions now we have basically lost our power to soundly negate the constraints. So this is the case because the strongest necessary condition of not phi is not logically equivalent to the negation of the strongest necessary condition of phi. In fact, it turns out that the negation of the strongest necessary condition of phi is not even a necessary condition of not phi, let alone being the strongest one. So therefore, this suggests that we need to dual and complementary notion to strongest necessary conditions which is weakest sufficient conditions. So now just like we denote the strongest necessary conditions by the ceiling function, we're going to denote weakest sufficient conditions by the floor function. Because they underapproximate the formula. And the weakest sufficient condition will need to satisfy two properties. First of all, it feeds to imply the original formula and second it needs to be weaker than any other formula phi prime. That also implies original formula and that doesn't contain the same set of choice variables. >>: [inaudible] correspond to [inaudible]. >> Isil Dillig: Exactly, yes. >>: So if there are existing names like universal quantification and existential quantification, why do you need to invent new names for those concepts? >> Isil Dillig: Because the point -- we want to eliminate them, we don't want the quantifiers. >>: [inaudible] you don't want -- you don't -- you want to eliminate everything to do with certain types of variables. >> Isil Dillig: So the [inaudible]. >>: Predicate obstruction, you only want only invariable stuff, you don't want any program [inaudible]. >>: So is it equivalent to universally or existentially quantifying this formula. >> Isil Dillig: And then eliminating the -- exactly. It is equivalent. >>: Okay. >> Isil Dillig: But the point is we don't want the -- we don't want these existential quantified variables because we want to separate out what's key to like as we'll see -- what's key to path sensitive analysis versus what's noise in the background. So I -- that's kind of the intuition. >>: [inaudible] eliminating, elimination of [inaudible]. >> Isil Dillig: You could, yeah. >>: But you have the restriction of there is only one like top level quantification going on, there's no mix ->> Isil Dillig: Right. Exactly. Yes. And just like strongest necessary conditions were satisfiability preserving, unsurprisingly it turns out that weakest sufficient conditions are related to preserving. So if the original condition phi is valid, if and only if it's weakest sufficient condition is also valid. Again, if we use weakest sufficient conditions to determine validity, we're getting an exact answer, not an over or underapproximation. Now, again to give some intuition, let's look at this constraint from key new private. Here the weakest sufficient condition is just alpha is less than or equal to zero or alpha is greater than or equal to four. And if we think about what this function does, it makes sense because if the type is neither key RSA1, nor key RSA, nor key DSA, then we'll just trivially hit the default case and won't even try to allocate memory, so it will trivially succeed. On the other hand, if we consider the constraint from query user, here the weakest sufficient condition will be just false. And again, this makes sense because there is no condition of feature enabled that will guarantee that query user will return true. It all depends on what the user input is. So therefore the weakest sufficient condition is false. So now what have we achieved? So one thing we have achieved is by having pairs of strongest necessary and weakest sufficient conditions, we can now finally make negation work again. And the way we can do it is if we've computed the strongest necessary and weakest sufficient conditions of phi, then we can compute the strongest necessary condition on not phi by just negating the weakest sufficient condition of phi. And similarly, we can compute the weakest sufficient condition on not phi by taking the negation of phi's strongest necessary condition. So again the duality of existential and universal quantifiers comes into play directly here. Now that was a little bit wordy, so again let's look at the concrete example. So in the previous examples we computed the strongest necessary condition for key new private to succeed as true and the weakest sufficient condition for this function to succeed as alpha is either less than or equal to zero or alpha is greater than the or equal to four. Now, suppose I want to know the constraint under which this function will fail. So clearly this is going to be the negation of the constraint under which it will succeed. So now to compute the strongest necessary condition for failure, we take the weakest sufficient condition for success and negate that, which gives us alpha must be between one and three. And to compute the weakest sufficient condition for failure, we'll just negate the strongest necessary condition for success so that will give us false. And again, it makes sense that the weakest sufficient condition is false because nothing that we know about the type of the cryptographic key will ensure that key new private will fail. And similarly, it's sensible that the strongest necessary condition says alpha must be between one and three because otherwise if the type is neither key RSA1, key RSA, or key DSA, the function won't even get a chance at failing. Now, Tom will actually tell you about how we can go about computing these strongest necessary and weakest sufficient. >> Thomas Dillig: So what have we done so far? So far we have really only told you how we can identify that special class of variables which we call choice variables. To model that uncertainty and imprecision in program analysis. And we have argued that if we compute pairs of strongest necessary and weakest sufficient conditions that do not contain these choice variables we can overcome these termination problems that from having these fresh beta substitutions on each recursive invocation. We can also mitigate some of the scaleability problems because we don't have to drag these betas which accumulate everywhere through all constraints in our program. And perhaps most importantly, we can still negate our constraints in the sound way and actually in a way that's not just sound but also in a way that preserves satisfiability and validity. However, we haven't really shown you at all how to do any of this, so we've only talked about the general high level overview here. So from now on, let's take a look how we can actually compute the strongest necessary and weakest sufficient conditions for a system of recursive constraints that will represent the exact path and content sensitive conditions under which some property which we're interested in holds. And more specifically, we will use the strongest necessary and weakest sufficient conditions to perform a sound and complete path sensitivity program analysis. And our goal here will be to answer may and must queries about the program. And obviously our completeness guarantee here assumes a user provider finite abstraction. So we are only complete with respect to some finite abstraction of your program. It would be obviously undecided otherwise. And to sum up the rest of this talk in only three bullet points, we will remove these choice variables from our constraints, we will therefore end up with formulas we will argue very small in practice which in turn will mimic this technique will scale better than existing techniques for Boolean paths and content sensitive program analysis. >>: [inaudible]. >> Thomas Dillig: Yeah? >>: [inaudible]. If you have a user provider finite abstraction then does it mean that your recursive constraints are over the propositional or propositional logic? >> Thomas Dillig: Actually if you hold on for like two or three minutes, I'm walk you exactly through details and you'll see exactly. Your question is right to the point. So before I get started on exactly what the algorithm is, let me just sort of point out where this approach fits in with existing path and content sensitive analysis. So sort of on the one hand of the spectrum, if you think about the sort of analysis, there are tools that are sort of related are based on model checking ideas. This will be tools such as Bebop, BLAST, SLAM, and so on. And so on the other side we have sort of lighter static analysis type tools such as maybe SATURN or EPS, which in our [inaudible] a more static analysis by the same problem. And if you think of the apparent trade-off here, it's almost like the sort of static analysis based tools like SATURN have actually scaled to millions of lines of code while a tool like Bebop hasn't scaled quite that far on its own. On the other hand, Bebop has this really strong guarantee of being sound and complete with respect to some finite abstraction while SATURN and also EPS certainly doesn't have any aspirations to be complete in any sense. So this technique is sort of the idea here is that we want to eliminate this trade-off and we really want to have a technique that can give the same or very similar guarantees to a technique like Bebop while still scaling through these really large programs. And so the main contribution here is therefore an algorithm for sound and complete path sensitive analysis that will actually scale to these really large real world programs. And the key inside we're going to explore here is that while these choice variables are very useful within their scope and boundary, we can safely eliminate them outside their scope as long as we are only interested in answering may and must queries about the program. So what do I mean by that? Let's be concrete and let's take a look at this process file function here. So process file takes a file point F from the user. It then asks the user if the user would like to open a new file. If the user says yes, it says sure I'm going to reassign F to the result of a new F open call, it then calls process file internal with that file point F, and if the user chose to actually open a new file, it will go ahead and close that new file before returning to be like a well behaved function. So as before, the user input here will be represented by a choice variable. And that makes sense since we really can't predict what the user will input again at static analysis type. And photo that this function is an interesting feature. It has more specifically a branch correlation on the choice variable user input. And if you for example interested in verifying whether the F open and F closed calls here match up, you better have to track this branch correlation. So it's clearly useful within that function. However, since that user input is not visible in the calling context of process file, there's really no additional information we can communicate with the outside and sort of dragging this beta variable along. More specifically assume we're interested in answering some may and must query about process file, such as may this original file be de-referenced? And let's assume this for the sake of this example that the process file internal here always unconditionally will be reference file F. So now really the best thing we can say if someone asks us that question is the constraint true. In other words, yes. Sure, it may be de-referenced. There's no other information I can tell you that will make your analysis any better if your call F process file. Similarly if you ask me must that function de-reference this file point F, really the best thing I can say is false or no, I don't know. There's no additional thing I can tell you. It's really not under my control. >>: [inaudible]. >> Thomas Dillig: Yeah? >>: [inaudible] that go back to [inaudible] so that the [inaudible] so this, the question means does there exist an input to the function such that if you executed the function on that input and there is some execution which will lead to the file F being de-referenced? >> Thomas Dillig: Exactly. >>: And the must question is for all paths [inaudible] de-reference the file. >> Thomas Dillig: Exactly. That's exactly [inaudible] of course if you're interested in files, you would be most likely be interested if we have program analysis in the main part, right? But you might -- you might -- you know you will need the must part to make negations work and so on [inaudible]. >>: So the must business, right, the must business actually requires you to solve the termination problem, right? >> Thomas Dillig: Yes. But of course like we don't really attempt to reason about non terminating programs. The results are qualified, that we assume all program paths will terminate. So as usual. You're absolutely right about that. So now how are we actually going to go about generating these conditions and solving these problems? Well, first we will first set up this recursive similar of constraints and we will use the same notation here, we will describe the constraint under which each function F in our program will return some interact value, CI. Yes? >>: [inaudible] about must -- my suspicion is that the definition of must is with respect to a [inaudible] then the question that the you're asking is that if you reach this program [inaudible] regardless of what it [inaudible]. >>: Yes. So I actually don't know. I mean, I don't know the definition [inaudible] I was just conjecturing that this was definition ->>: But that definition that I conjectured has a termination built into it. >>: Yes. My feeling is that you also define must without initial [inaudible]. >> Thomas Dillig: I think -- I think -- I think you're absolutely right about this. The correct way of looking at it is if you get to this program point then this property must hold. But we're not saying whether you will get there or not. >>: So for all [inaudible]. >>: If you reach this program point, the property P must hold. For all inputs if their exists a path on that input to this program point, then this property holds. So I see. So there's a for all on the input and there exists on the path because the function -- so it could be non-deterministic, right? >>: Yes. >> Thomas Dillig: So now after we've set up the system of recursive constraints we will then convert that system to Boolean constraints, so we'll see how to do that. And after we've done that, we will then remove all these choice variables from the Boolean constraints, we will end up with two systems, one of them will be a system describing unnecessary conditions and one will describe our sufficient conditions. And then we will take these systems and twiddle them just a little bit for a few technicalities to make sure that they will also preserve strongest necessary and weakest sufficient conditions on the syntactic substitution. And then we are ready to basically solve them using a standard fixed point computation. So to set up this recursive system E of these initial constraints, we will again use the same notation we've used before where we want the to express the fact that some function, F, given some input, alpha will return some abstract values, CI. And for the purposes of this talk, we'll assume that the only thing, the only side effect the function has is its return value. This can easily be extended to like. It's not being the case but it just makes the notation a lot cleaner. And we will end up with this matrix E here. And if phi IJ is here, I just Boolean constraint so we have the form alpha being equal to some value CI, beta being equal to some value, CI, return variables pi and comparisons between two constants. Again as before the alphas here will represent the function inputs. And they are obviously provided by the calling context of the function. The beta serve [inaudible] and the scope of each beta will just be the function body in which its introduced. Yeah? Yeah? >>: [inaudible] constraints for a recursive program? It's not clear to me [inaudible]. >> Thomas Dillig: That's a really good question. Actually we just handle them by treating them as recursive functions. So really like I'm only going to talk about recursive functions and the implementation just turns them into [inaudible] recursive. And unsurprisingly exactly your question if pi is on the right hand side here, as we have seen result from the results from function calls, and they have the usual substitutions we discussed. So let's be specific, and let's look at this very simple function F here. So F takes an integer X, it then declares another Y, calls the user -- queries the user for some other integer to get user input. If X is 1 or Y is 2, it actually returns 1, and otherwise it returns F of 1. Now, this is actually a really stupid function. If you look at this function for just a second you will see [inaudible] function always returns 1. Just highly inefficient at doing so. And let's assume for the purposes of our example here that we'll have three abstract rallies. So more specifically we'll have the abstract value C1 for the integer 1, C2 for the integer 2, and less say 3, C3 here stands for all other integers not equal to 1 and 2. And then the constraint we would write would look like this. And here we'll get alpha 1 or beta 2 because for this function to actually return the constant C1, it will certainly do it if X is equal to 1, alpha equals 1, or beta equals 2. Or the recursive call must return C1, so if pi of F alpha, C1 and the conditions under which the recursive call holds is exactly the negation of the condition on the virtual return at this return point marked in red. So after we have this recursive system of constraints, we now want to convert them to Boolean constraints. And here we are going to really just do the most straightforward thing possible first. So we are going to see expression like CI equals CI, we're going to say true, that wasn't a surprise. If we say CI equals CJ, we'll say false. That's also not a surprise. And anything else has to be of the form some variable VI equals CJ and we'll make up some fresh of VIJ for that. So for example, if you look at the constraint from function F from just a couple seconds ago, we'll just replace alpha equals 1 by some fresh Boolean variable alpha 1, beta equals 2 by some fresh available beta 2, and the substitution 1 replaces alpha, unsurprisingly by 2 replaces alpha 1, false replaces alpha 2 and false replaces alpha 3. So while this was very simple it's not quite correct yet. And it's easy to see why it's not correct. Because there is no condition that stipulates yet that each of these variables has to have exactly one value at a time, it can't have two values, it can't have zero values, unless you sort of [inaudible] constraint world, and we don't want to be there. And fortunately we can very easily enforce this additional constraints. And we can do that just by properly conjoining these exist and uniqueness constraints when we create satisfiability and validity. And if you do that and go back to our three initial abstract values from the beginning, C1, C2, C3 from the example, we would for example conclude that alpha 1 and alpha 2 will be unsatisfiable. Because clear alpha 1 can't be equal to 1 and 2 at the same time. And we would also conclude that beta 1 or beta 2 or beta 3 has to be valid so if you numerate all possible abstract values clearly they resolve the constraints true. So after this step, we're really left with Boolean constraints. And the first thing we're going to recall, we're going to recall that well known result that states that if you want to compute the strongest necessary condition, of some formula phi not containing some choice variable data you can do that by replace phi with greater replace by true or phi beta replace by false. And similar if you're on a computer because sufficient condition you can do the same thing. You just have to conjoin the two parts of the formula. And it's ->>: [inaudible]. >> Thomas Dillig: Yeah? >>: [inaudible] an example where you had an integer variable X, so how did you come up with the finite abstraction? >> Thomas Dillig: So that's a very good point. For now we're just going to assume someone gave it to you. So for example you scanned your program syntactically every integer you saw you put in the abstract set. Something like that. Later on in the experiments, I'll elaborate a little bit on what you actually do in practice and how it turns out not to be a very big limitation. But, yeah. So this result was actually first given by Mr. George Boole the picture in 1892 in a book called On the Laws of Thought and interestingly enough, this little lemma has been reproofed maybe 10 times since then. The whole sequence of paper [inaudible] some sort of lemma, but and we actually first fell for some of the earlier ones, too, and eventually we dug all the way down to this book. I'm pretty sure it's the first one to introduce this prepositional logic. And so if you think back of what this step will actually achieve, the recursive system has these beta choice variables here and after we apply Mr. Boole's method we are then left with two systems, one a necessary condition and one insufficient condition. And note that there are no more betas in these constraints, which is very good. But of course they are still recursive, which means it's not quite lunch time we still have to keep going and actually work through the rest, right? So now let's see an example how this actually turns out. So if we go back to our small F function here and let's say you want to start computing the strongest necessary condition, so as expected we'll replace beta 2 by true and false respectively and we'll put an or between those two parts. So the first part immediately simplifies to true, so we get true or anything, true or anything is anything, so we will end up with it's actually true, sorry, we will end up with the necessary condition under which this function will return 1 will just be true. And let's do the same thing for the sufficient condition here. So again we replace beta 2 by true and false. Now we just conjoin the two parts. First part simplifies to true. True and anything is anything. So now we get the weakest sufficient condition under which this function returns C1. We'll just be alpha 1. Or the recursive call returns C1. So now to actually solve these recursive constraints, we obviously have to make sure that these constraints must preserve their strongest necessary and weakest sufficient conditions under syntactic substitution. In the current form there's two small difficulties that prevent them from having that. And the first reason we've sort of seen early already it's just the fact that the negation of the necessary condition of phi is not equivalent to the negation -- to the necessary condition of the negation of phi. And the same thing also holds obviously for weakest sufficient conditions. The second problem here arises from the fact that contradictions and topologies have to be enforced explicitly when we apply these substitutions. So what do I mean by that? Well, let's be concrete and look at this constraint in blue at the bottom of the slide. That pi F alpha C1 and pi F alpha C2. So if you think about what this stands for, it really just says some function at the same call site returns for the same input alpha, absolute value C1 and C2. Clearly this constraint is false. It can't happen at the same time. So since the strongest necessary condition is expected to preserve satisfiability, the only condition to preserve satisfiability of false again will be false. It didn't have any choice variable, so it's [inaudible]. Now, if we assume that the strongest necessary conditions of pi F alpha C1 and pi F alpha C2 are true perfectly possible, then if you just substitute them in we would get true, which is a necessary condition but certainly not the strongest one. So we have to make sure that this can't happen. We have to, in other words, make sure that for necessary conditions there's no way that a substitution can accidentally weaken our constraint. So how do we get around these two difficulties? Well, to deal with the first problem, we can either recall since we are sort of upgrading with a finite constant assumption that negation isn't really as bad as it looks and we can always replace it with some big [inaudible] or we can be much more intelligent or slightly more intelligent and use the property that the necessary condition of not phi is equivalent to not the sufficient condition of phi. And the same with the sufficient condition. And obviously if you want to use this property, it will require us to simultaneously fix points strongest necessary and weakest sufficient conditions. But it's really important for a practical implementation. So how can we deal with these contradictions and topologies? Well, if we're concerned about strongest necessary conditions, one very easy way of making sure there's no way that the substitution can actually weaken or constraint is to convert the constraint to disjunctive normal form and drop all contradictions which have to be of a very special syntactic form at this stage so we can find it really easily. And again for weakest sufficient conditions things are pretty much the same, just upside down. So here what really gets us into trouble is topology. So you want to make sure they can't inadvertently be strengthened in a substitution. And you prevent that by converting to conjunctive [inaudible] form and dropping all topologies from them. So after we've done that, the resulting constraints will now preserve strongest necessary and weakest sufficient conditions just under simple syntactic substitution. So we'll go ahead, we'll throw them in our fixed computation, we wait, and out comes the system with non-recursive constraints not containing any choice variables which we can use for may and must queries. And let's go back be to the F example in its original constraint here. Remember that we computed the strongest necessary condition for this function to return C1 as true, but [inaudible] non-recursive there's nothing left to solve. The weakest sufficient condition is still recursive. So let's say if you want to compute a greatest fixed point so we get alpha 1 or false equals alpha 1, alpha 1 or alpha 1 where 2 replaces alpha 1 will give us alpha 1 or 2, and we'll affix point at true. And note that this sufficient expression -- the efficient condition here expresses exactly that this function must return to -- must always return one because it's valid. So you can see how the -- how this technique was able to deduce that this function F is really not really intelligent function, it's a function it must return 1. So the main results sufficient face the effort technique that's only complete for answering these may and must queries again with respect to some finite abstraction of course. And the claim here is that by eliminating these choice variables people end up with much smaller formulas in practice which in turn will meet that we can scale quite a bit better than existing approaches to similar problems. Now, of course, I'm going to back up this claim that this action is -- this condition's stay small and this actually scales, we decided to do some experiments and see what things look like in practice, because they often look very different than you think they do. And inform that, we decided to compute the full intraprocedural constraint for every single point of the reference and openness is H [inaudible] the entire Linux kernel. And of course they will say full constraint we're going to compute what we showed new that talk, so we're going to compute necessary and sufficient conditions. Now, we believe that this is a stress test for this technique since we couldn't really think of anything that's more ubiquitous in C then that point in the reference. So we therefore hope that this test, if this technique scales to point of references it should also scale to many other properties that you might be interested in. >>: [inaudible] for every point of [inaudible] ->> Thomas Dillig: So our goal here is basically to say for each function what's the constraint under which you will de-reference your argument? Or one of your arguments or one of the fields of your arguments and so on. So for example, you might ask what's the constraint under which you de-reference your first argument's F field? And you know, obviously this might be potentially be recursive, it's like intraprocedural because you might make off. So that's a question you ask. >>: [inaudible] we are going to compute a separate constraint. >> Isil Dillig: Yes. And if you look at the graph here, so this graph shows on the X axis the size of these necessary and sufficient conditions when necessary conditions are marked in red and sufficient conditions are marked in green. And in the Y axis, it shows the frequency. And this graph is a Linux -- it's just the largest so it gives the best sample size. But they're very similar. There's no significant difference. And one thing to note that that the Y axis here is actually [inaudible] scale. So if you look at this closely, you can see that more than 99 percent of all of these constraints have less than 9 Boolean connectors this them. They're really, really small. And it's exactly the main -- the main point here that allows this technique to scale with something like Linux because it can separate also the beat from the [inaudible] sensitivity so it doesn't have to learn about these huge constraints that accumulate. The things it ends up with in practice are very small. And again, this is -- this graph gives a whole explanation why this [inaudible] Linux. >>: [inaudible]. >> Thomas Dillig: That's a very good question. So the fine abstraction we used -- so as I have seen -- as I have -- as we have said in this talk for this technique to be complete obviously it's with respect to that finite abstraction. But you can still be sound and almost be complete if you just, you know, use some sort of potentially not finite abstractions. So for example you take all the integers that syntactically appear in your program and all the things you compared to and stuff like that. So we'll use some technique like that to generate a set of things we track. And that's the things we -- that's -- you know, that's basically the conditions we get from the program. And [inaudible] sort of unless you start computing arithmetics you're returning twice your input, this technique likes till its complete. >>: So [inaudible] I mean if you have like nine Boolean variables per procedure, [inaudible] scale to Linux no problem, and the thing that may beef up scale is the refinement loop in slam, right, to where you start generating, you know, really big BDDs. So fundamentally like I'm still trying to -- so fundamentally if I have a fixed finite abstraction like Bebop does is it computes post iteratively and it uses Boole's law when it computes procedure summaries, all local variables of a procedure get existentially quantified out just like you -- just like you show there and then it's a post so the image under post also has existential quantification. >> Thomas Dillig: [inaudible]. >>: But it's using [inaudible] to suck it up. So just try to think so what would [inaudible] not doing is it's not doing it's not -- it's not doing this dual thing. >>: [inaudible]. >>: [inaudible]. >>: [inaudible]. >>: Can I finish my question? >>: Yeah. >>: So when I'm doing post, I'm doing strongest computation, so I'm doing all this existential, everything's existential. So what is the universal bind? So basically I have a transition system and I do an image computation and I existentially quantify out the intermediate state and when I lift the summary to a caller I existentially quantify at the local. So I just want to understand that if I understand sort of what you've done in the context of Bebop, then you have an extra step which is this must and I want to really try to understand what is the must find because that's something that Bebop doesn't do, but we could also do the universal thing because I could also do your rule. And I'm not quite sure -- >> Thomas Dillig: So the main thing, the main thing, the must -- first of all if you're interested in must properties obviously it's important. Let's assume we're interested in may properties. So if you're interested in may properties what it really buys you is that you can do a negation easily and you can do the negation without explicitly enumerating this disjunction of abstract values which might be huge. Like for example in Linux it might be like hundreds of thousands of elements, it might be more. Not in one function but overall. So if I want to like take every integer, some are compared to anyplace of the program, I mean, I can write [inaudible] this disjunction theory but I can basically go home after I wrote it. >>: Right. But [inaudible] in BDDs, a function and its negation have the same representation complexity. If you represent the function using BDDs. >> Thomas Dillig: So you're saying it would be possible to encode something similar efficiently in BDDs as well? >>: [inaudible] the problem but this problem that you -[brief talking over] >>: I mean, I have negation, I mean essentially -- essentially every Boolean variable I also have its negation. So maybe we should take it offline. >>: You want to negate without [inaudible]. >> Thomas Dillig: Exactly. Because what you're going to do, you're going to turn it existential, you're going to [inaudible] it's not. Because if I say the excess on beta that's equal to Y I don't say the negation of this doesn't mean that for all -you know, like they're really just phi variables on my constraint. That's how I want the negation to work from what it's supposed to mean. >>: [inaudible] basically what you're -- what you're saying is that if I have a may query and I want to negate it then I can use some must information, right, to refine that? >> Thomas Dillig: Yes. >>: And that can cut off a lot of search [inaudible] potentially. >> Thomas Dillig: And note that in this technique there's no refinement loop, right. So we. >>: I understand. >> Thomas Dillig: Yeah. But I think that's a good way of ->>: [inaudible] is so you're computing may information for procedure and this procedure calls another procedure and returns a value and you compare that return value and you can say things like if that other procedure -- if you know it's -- if you have must information you can use that to give may information [inaudible] computing. >>: Sure, sure, sure. >>: And that way ->>: I really haven't understood how they interact in this framework. >>: [inaudible] is that existential quantifications use [inaudible] procedures for computing [inaudible] information and universal quantification is used for some random procedures. >>: Yes. >>: But I want to understand the interaction. >> Thomas Dillig: The interaction really comes from the fact if you have some may fact and you negate it becomes a must fact. Similarly if a must fact you negate it becomes a may fact. >>: Once you negate. >> Thomas Dillig: So whenever, you don't know. So I'm computing something for myself, right, I'm computing that say the constraint on which you return the integer. So now I do not know -- some of my colleague contacts may be asking am I equal to 2, I might say am I greater than 1, so there will be some sort of negation in there. And if they're in -- and basically what this technique gives you that I look at this procedure once, I just compute both. I'm going to say whatever you want. If you use my return value in negation, I'm ready. If you don't use it in negation, fine with me. I have a formula for both. So by doing this both sort of negation you can just naturally pick the sufficient [inaudible] flip it up. And that's what you take. That sort of domain. Yeah. >>: [inaudible] use it to compute must information you need must information in each step. >> Thomas Dillig: Yes. >>: And [inaudible]. >> Thomas Dillig: Oh, yes. >>: [inaudible]. >> Thomas Dillig: So now they do interact at the negation points, right. So like basically like every -- like from to be really concrete, right, in this system we implemented this. But if you talk about a constraint, it's up here, right? It's a may [inaudible] necessary in the [inaudible] conditions everywhere. So that's exactly the way this works. So it is everywhere in that sense. >>: [inaudible] so but you do get [inaudible] a lot of must information. >> Thomas Dillig: You surprisingly do. Not always but you do because there's a lot of ->>: Because generally in my experimentation with must which has been very small compared to what you've done, there's -- for perfect abstraction there's very often this [inaudible] it's hard to get must. Right? Generally, generally you need a lot more information. Because must is an underapproximation. >> Thomas Dillig: Yes. >>: And so generally you need a lot more predicates to get must. And depends on your abstraction. >> Isil Dillig: And so for example [inaudible] we needed this must information suppose like we're doing pointer analysis, right, and some [inaudible] has some side effect that [inaudible] the conditions under which is may and must happen, now to in the column context it had some other targets prior to calling this function, right? Now to determine the constraint under which it will still point to it's old targets, I have to use the must information from the calling and negate that so that it stays like -- do you see what I'm trying to say? >> Thomas Dillig: So for example you want to make -- you want to find the analysis at very [inaudible] I initialize that and that past available to [inaudible] variables function. In this case, I'm really interested in must, right? Must is where it will be initialized. So that sort of let's -- that's the constraint under which I can keep my old targets in the points to set. That's where must comes everywhere. And sort of the [inaudible]. >>: I have a question. I find this whole terminology very confusing. Can I think of must query as an assertion somewhere in the program? >> Thomas Dillig: As an assertion that can never fail. >>: I suppose you're asking is it true that this assertion never fails? >> Thomas Dillig: Yes. >>: [inaudible] you asking is it true that this assertion never fails? >> Thomas Dillig: Yes. >>: And a may query can also be captured by an assertion, but in this case you're asking the reverse question, is it false that this assertion never fails? >> Thomas Dillig: Yes. I think that's ->>: It's like this [inaudible] right? >> Thomas Dillig: Yeah. >>: So then what I don't understand is that supposing I am only interested in proving assertions in programs, right, then I won't be interested in must queries. >>: I think must is equal [inaudible] not equal to [inaudible]. >>: So why can't -- give me an example of may query. >> Thomas Dillig: For example may is point [inaudible] reference. So you want to know in this function [inaudible] is it safe, right? So you want to say may someone de-reference that. >>: [inaudible]. >>: [inaudible] what I'm really telling you is that X does not point to U, X does not point to V. >>: Can we do the following? Actually maybe you should finish your talk. >> Thomas Dillig: Okay. Let me ->>: Because I would like to understand may and must. >> Thomas Dillig: Yes. >>: In the case of straight line programs manipulating only Boolean variables. Get away from all this [inaudible]. >>: But I think if -- I have a [inaudible] like the fact that you're not only looking at Boolean values, it's important ->>: Okay. So then I would like to understand that first. >> Thomas Dillig: All right. Well, let me just -- so I have shown the graph. You've seen that they stay very fall. And I mean, this again absolutely as you pointed out correctly with respect to what abstraction, the respect -- the abstraction here is very fine grain, so we allow anything, any constant compared to any constant that flows around. So you know there's a -- potentially if you would call them up there would be hundreds of thousands of values easily on something like Linux. So now -- I mean, that's all nice and good and I've shown you like a graph and they stay small and so on, and they are green and red, but like maybe more interesting is the question how useful is this actually for like a real program analysis problem? So to see whether there's any real use from being fully path and content sensitive we decided to try a little reference analysis using these techniques. And to be able to compare what difference this particular technique makes, we implemented sort of two version of this. And on the left hand side you see the number for a fully path sensitive analysis that computes exactly these strongest necessary and weakest sufficient conditions. And on the right hand side, you see an analysis that's only intraprocedurally pass sensitive. So it drops all constraints of procedure boundaries and just says true or false, depending on whether they're satisfiable or not. And if you look at the report to bug ratio here for example again that's on the Linux -- it's on the same three applications as is H [inaudible] in the Linux kernel, you can see that we were able to get close to an order of magnitude reduction in the false positives and we did that without resorting to any sort of usual tricks like find a few values that may be important for null and this sort of stuff, just sort of use this general technique and plug it right in. So one caveat, the numbers I have shown you on the previous slide do not track any null values that flow into unbounded data structures such as a recent link lists. And the reason for that is very simple and really orthogonal. It's just that the underlink framework which we use to implement this prototype doesn't track any data structures, it just makes one big summary blob, and everything goes in there. And we've found it to be unacceptably imprecise for verifying memory safety. So for that part, this technique isn't -- it's really an orthogonal issue we're trying to attack here from shape analysis. And actually the analysis of contents of these precision dependent data structures such as arrays, linkless and so on is actually one of our current projects and it grew exactly out of the limitations of this previous prototype. Now, a second question that you might ask that's very interesting perhaps is we've really only shown you how to compute these things in the biggest possible of all theories, namely proper assertion of logic. So what about doing that in richer theories? And it for example you might be interested in computing necessary and sufficient conditions for the theory of uninterpreted functions. Or the combined theory of linear integer arithmetic and uninterpreted functions and so on. And it turns out that many of the issues here are very closely related to this idea of cover algorithms for existential quantify elimination. And really you can see a cover algorithm as computing a necessary condition for a non recursive constraint. So as far as related work is concerned I'll be brief since we are agent late. I've already sort of talked about previous path and conflict sensitive analysis. Now, just a couple remarks and this idea of over and under approximation. So this idea certainly has been around for a while, both model checking and abstract interpretation. Probably the most related work to this particular work is Dave Schmitz' work on over and under approximation in abstract interpretation. So one of the main differences here is that we are not just interested in any over and under approximation but in one that actually preserves satisfiability and validity, and we give a specific algorithm for a specific domain for doing that. Another difference is that we also sort of really want to handle these negations in a fundamental way by having these pairs of constraints which flip. You really don't want to start enumerating things since we need to -- you know, we don't -we can't really make these sort of monotonicity assumptions like interpretation can, at least not without really ruining our scaleability or state approach. All right. Thank you very much. [applause]. >> Tom Ball: Any questions? We can have some discussion. We can go to lunch for that matter. [laughter]. >> Tom Ball: You can discuss [inaudible] you can discuss now, but maybe we should just let them -- let them take their microphones off. >> Thomas Dillig: Thank you.