1 >> Francesco Logozzo: Good morning. So it's my pleasure to introduce Xavier Rival. So I've been knowing Xavier for a very long time, more than ten years. More than I knew my wife. So usually, I always say silly things when I introduce someone. So I'll stop now with the silly things with Xavier. So let's talk about serious things. So Xavier did his Ph.D. thesis with [indiscernible] at DNS, then he post-doc at Berkeley, and now he's researcher at the INRIA and has been working on many interesting projects, many of these he can talk about, so I can make it short and today he's going to talk about his work he's done for [indiscernible] to discover the properties of complex data structure. >> Xavier Rival: Okay. Thank you very much, Francesco. So indeed, today we'll talk about MemCAD, which is a modular abstract domain for reasoning about memory states. So many people have been involved in this work. So let me cite them on this slide. So Bor-Yuh Evan Chang has been working with me. I mean, we have been working on this together since the beginning of this work in 2006. And then Vincent Laviron, Pascal Sotin, and Antoine Toubhans are doing their interns or Ph.D.s or post-docs on this topic. So in this work, I am interested in safety of software, so I want to verify properties such as the absence of run time errors, the absence of non-desired behaviors. Some functional properties as well, and these are all undecidable properties. So we have to do some compromise here. And our compromise is to not be complete so we'll use the abstract interpretation approach, which is sound. We will discover all bugs of certain kind. For instance, [indiscernible] for run time errors, we would find all run time errors of some case. It will result in automatic tools. How we write not complete. So the three main types are to how to choose an abstraction of the concrete semantic so we have to define a concrete semantics and an abstraction, which is -- which we can do conservative approximation. Then we'll derive from it abstract transfer functions which allow to preserve conservative approximations when computing post conditions, for instance. And the last thing is we want to ensure that we always terminate so we use widening for that, which are basically over-approximations of concrete joins. 2 And an abstract domain is the data of those three things, an abstraction plus a set of transfer functions plus a widening [indiscernible]. So before we talk about memory properties, let us look at the status in numerical properties. The status is actually quite good. Many people have been working on this topic for over 30 years. And we have many abstractions of functions from variables to [indiscernible] integers or floating points. And in particular, we have a lot of available abstractions such as intervals, octagons, convex polyhedra main users. So actually, we even have implementations of those domains, which are libraries so you can just use one interface for transfer functions and widening, and actually, just change one line in the analyzer and you can switch from the cheapest one to the most expensive one in a very short time. So it's also possible to combine many of those numerical abstractions in quite efficient ways. So [indiscernible] Astree analyzer has been obtained by combining over 40 numerical abstract domains and it's a very scaleable which can work in programs in over one million lines of code, which are embedded, which are [indiscernible] softwares. It's also very precise. That is, it will allow to prove the correctness of some such families of such programs, which is quite, quite nice. On the other hand, if we look at memory abstractions, there are many difficulties which have not been resolved yet so we cannot quite use such construction, such libraries. It's not so easy to use and [indiscernible] memory abstraction. So memory abstraction should abstract states which are more complex than functions from variables to values, because we to have take into account the format of the memory with the heap, the stack, the environment. The concrete operations which we want to analyze are also more complex. For instance, we have to do it with point arithmetic, memory management which are nontrivial operations. And the operations we want to verify can also be very [indiscernible]. For instance, we look at the preservation of some data structures. So there are several difficulties which are further fact that the concrete semantics is complex. The second point is that there is a huge variety of data structures which we may want to abstract like linked structures, arrays, structures involving relations between pointers, between pointers and numerical values, and combinations of all of them. So it's a very 3 challenging task. And for many of those structures, you can find some specific algorithms. For instance, some algorithms to analyze [indiscernible] or just arrays. And in many cases, those can be quite expensive, which is also quite challenging to achieve scaleability. So my long-term goal is to set up a general purpose library of memory abstract domains and in this talk, I will present three main contributions. First, we'll choose concrete semantics, which is not limiting. So we'll be able to analyze a wide variety of structures and I'm going to give some examples in the end of talk. I will also show an abstraction, which is mostly based on inductive structural properties but which can be extended in a number of ways, which can be parameterized with a choice of numeric abstraction, with a choice of a set of structures to consider. And I will show static analysis algorithms. So I will also give an implement, a demo of the tool which I am implementing, which is called MemCAD. It's an OCami implementation of this site domain together with an analyzer for [indiscernible]. So it's still a work in progress. So, of course, it may not be -- it's not a finished tool. So let's look at the parametric abstraction first. So on this slide, I'm giving an overview of the abstraction. So at the top, you can see what a concrete state looks like. So we have a concrete memory, which is partitioned into region. So it's a very concrete view of the memory, where we have cells with addresses, variables which are mapped to some addresses. We have blocks which contain either pointers, which are numerical values corresponding to addresses, other sequences [indiscernible] could be interpreted as integers like this one. In the first step, we make a first abstraction of this concrete, of this concrete states with shape graphs. And in a shape graph, basically, we have notes which denote values. And integers which correspond to memorization. So we should notice here that in many cases, shape graphs have -- draw assumptions that is nodes correspond to memory cells and edges correspond to values. In this case, we do the opposite. So the reason why we do this will become clear a little bit later. 4 And in the second step of the abstraction, we can summarize regions. This case, I assume that we have some [indiscernible], which is supplied by the users, for instance. And what we can see here is that this node, which I am pointing here -- I'm sorry, I don't have a pointer. So this node correspond to the address, which as I said, a node corresponds to a value. So it corresponds to the address of? List, and the purple region here, all those purple integers correspond to the purple region here can be summarized with this [indiscernible] edge departing to the node corresponding to the address of that list. So we can notice that on those three steps, I did preserve [indiscernible]. That is, I have some green cell here, which corresponds to this green points to at here and same for the red and blue here and the purple region here was abstracted into a bunch of points to edges in the first step. In the second one, it gets summarized. And the one to definition. So this is an overview of the abstraction we are using for [indiscernible] data switchers. So how now to concrete semantics? Well, we start from a very concrete semantics, where the heap is divided into blocks, which are marked either as allocated or as free. Memory cells all have a numeric address and a numeric size so it's very concrete. And pointers are also numeric values, which can be composed and a base and an offset. So now I will describe in a slightly more formal way how the abstraction works and comment on its features. So as I said, edges will denote memory region, whereas nodes denote values. So the most important kind of edge will have to -- we'll deal with is called a points-to edge, and it denotes a contiguous memory region. So particularly note points to edge like this. So we have [indiscernible] field F, this nation beta and what it denotes is a memory cell at some address corresponding to [indiscernible] containing value corresponding to beta with some offset corresponding to the field F. So what is this new here? It's actually implementing -- well, it's denoting the physical mapping of symbolic values. That is nodes into values. So this play as role in the abstraction which I'm going to comment on a little bit more later. But basically, when we concretize a shape graph, we take a set of predicates like these, which we can also write a mathematical formula like 5 that. F points to beta. And we'll get a set of concrete memories together with pairs made of concrete memory together with what we call a variation, which maps abstract addresses into physical addresses. So this is a concretization formula. I think it just does what we said so there is no need to comment on it more. So we use separation logic; that is, a graph is made of a set of edges, and each edge will denote a disjoined part of the memory. So disjoined edges denotes disjoined pair wise memory regions. And the advantage is that this will allow local reasoning. So when we compute transfer functions, it will be quite easy. On the other hand, computing a joint of abstract space is more challenging than if we were doing regular conjunctions. So one thing we should note is we use so-called field splitting model. Again, in the [indiscernible] community, there are several approaches. And in our approaches, we always impact edges and not values of pointers. So for instance, if we look at this shape graph, we have F for here and two points-to edges pointing to fields F and G. And it stands both for stores like the one in the left where the two fields are different and denote different cells, addresses of different cells. And stores like the one in the right where actually both fields contain the same pointer value. So, so far, we have only considered points to address and we cannot quite summarize -- we cannot quite summarize unbounded memory regions so we are going to do that now. So let's consider the set of concrete states we have here. So all those states, we have a variable X, which is allocated into the green cell here. This is variable X, and it contains the address of something. So here we have another pointer, one, two, three and so on. And so what we want to do is to abstract all those states and all those elsewhere we'll have a list of links [indiscernible] into just one state. So to do this, we add some additional predicate to all our logic. So those are indicator for X. Here we assume that we have a list predicate, which is defined and which basically expresses that some address, alpha, is the address of [indiscernible]. And basically, all those states can be abstracted into this formula, where we say that we have the address of X contains the set of address -- the address of X contains alpha and alpha is the address of this, 6 which is the joint from the cell corresponding to X. And in a graph representation, we use [indiscernible] edges for those predicates. So how do we write this definition in separation logic? It's just trivial like you'll see a definition where we have a list predicate, which is defined by alpha dot list. Which means either that alpha is equal to zero and we have an empty memory fragment or we have a memory fragment which contains two fields, alpha dot next and alpha dot data, and some region of address the value of field next from alpha, which is [indiscernible]. In this case, alpha is not [indiscernible]. It's just classical inductive definition. And in practice, we either use a handwritten definitions or we'll use definitions built in the analyzer. So this is a case of Smallfoot, Space Invader, or that we'll let the user supply, and that was the case in Xisa. So TVLA does use another approach that is not based on separation logic but we can view TVLA as a way to hand-write -- I mean, you can hand write in TVLA something which corresponds to this definition. And it is also the case in the analyzer I'm going to demo. It's also possible to infer part of those definitions. We have done this in Xisa with [indiscernible] and in POPL two years ago. And actually, it was with some specific assumption. Here, we were inferring some inductive definitions of some kind, and that was done, that was achieved thanks to an improved way to [indiscernible]. In general, it's very hard to infer the definitions at this time. How about the concretization? Well, the concretization, again, is very intuitive. It's simply a list fixed point over a formula over a memory. So basically, when we have such an intricate definition, which is a notion of syntactic unfolding, which comes up by just [indiscernible] the formula. And what we can just do is compute the join of all the concretizations of those sequences of unfolding in one of several steps. This is a [indiscernible] point some of function so we can compute it. While we cannot quite compute it in the term of computer computability, but in terms of on a sheet of paper, it's computable. So this define as concretization of inductive formula. So here are some examples of more involved structures that we can define with 7 similar definitions. In those cases, I just give the most significant formula. I don't give the formula definition like I do for the [indiscernible]. So first, tree. Well, is very similar to a list except that we have two signs with both hold [indiscernible]. We can also capture relations between pointers. So this is the case of the doubly-linked lists. In this case, we extend the definition with the parameter which tells where the [indiscernible] field should point to. So basically, when we unfold address alpha with peripheral data and the peripheral gets data, and the [indiscernible] should point to doubly-linked lists, the next of which is alpha. We just get the pointer definitions through parameters of the inductive definition. And we can also do the same trick with numerical also here. It's an indication of the sorted list so in that case, the parameter denotes a lower bound on the first value. So we just, when we unfold it once, we just get a data field and should be [indiscernible] delta parameter here and also we'll get the parameter, the lower bound of the [indiscernible] list is value of the data field. Now, if we look at, when we start looking at programs, we see that there are many, in many cases when we have such structures, we may have pointers into the structures. So we may have states like this one where we have variable X and variable Y, which both points to some list. But actually, X points to the head of the list and Y points somewhere inside the list. So this would occur, for instance, if you use some inception in side some sort of list, you would encounter such states. And we cannot quite say that we have X points to a list and Y points to a list where the [indiscernible]. It's not possible because of the separation and also we have to capture these relations at Y points somewhere into the list. But it's okay, because we can actually split the state with colors like I did a few slides ago. So here we have a blue area and here some red area. So the red area is simply a list and the blue area is actually also can also be expressed using some integer combination, which I give here, which is a list with some end point. However, the definition derives in a straight manner from this definition so it does not quite make sense to overload, to add more definitions to the parameters of the abstraction. It's more interesting to just keep the knowledge that here we have a segment of a list and this is what this new blue predicate does. But what this predicate does is exactly what this list was end point in the definition expresses. The only difference is that we have added single 8 predicates which are also parameterized by that data of the inductive definitions that you want to consider. So we don't have -- we just have one single predicate which can also be used, for instance, for [indiscernible] and other data switchers. So now, let's look at the numerical information. So if we want to [indiscernible] some real data structures, we often have to take care of our data, which is stored inside it. So we have noticed that we decided to let each node, which holds a name stand for some numeric value. In a concretization, a node will always concretize into some sequence of bits. So which could be either the address of the cell or the contents of a cell. So this is very convenient to do the product with a numerical abstraction. For instance, we'll have -- so our abstract values will be made off a shape graph plus a numeric value. So, for instance, let me just switch to the next one. So if we look at certain lists, for instance, we just take this definition. If we unfold it a little bit, we get shape graph like this one. This is partly unfolded so here we have the data, the contents of the data fields of the first [indiscernible] which I exposed. And actually, if we look at the constraints, they should satisfy for it to be a sorted list, but it also should be a [indiscernible] better one. And this constraint can be expressed in the numerical [indiscernible] of octagons. So we just have to pair numeric absolute value together with our graph. However, there is one subtle difficulty that we may want to compare that abstract value with this shape graph. Actually, what we can see is that this one is included into that one, because this one is -- where it basically denotes any sorted list of links, at least one. This one denotes any sorted list of links, at least two. So this one is included into that one. However, comparing them is not [indiscernible]. In particular, you can notice that here we have this node which is called [indiscernible] which does not exist here. So the comparison is not trivial. And this is due to the fact that the numerical abstract value we have here depends on the graph. So we have different graphs with different sets of nodes. They are [indiscernible] use and to compare them is not quite trivial. So actually, it was the construction we use was introduced by Venet in 1996. It's not a reduced product. It's slightly more complex. It's a reduced product where the second -- where the lattice, where the second member of list depends on the first member of the product. 9 So here we have a shape graph and basically, their values will be made off the shape graph and some numeric value which is taken in the lattice, which is defined by the shape graph, that is, the dimensions of the numeric lattice or the nodes is a graph. So we look at the shape of the whole abstract lattice, it will be a bit like that. So here, we have shape graphs in the left and here we have numerical lattices corresponding to those shape graphs. So one abstract value will be given by one shape graph and one value in the lattice in the right. And when we do comparisons or when we compute transfer functions, we often have to, for instance, compute, compare two shape graphs which are different and we are going to use the calculations between those shape graphs to make the numerical values by converting one, for instance this one, into a format which is in that lattice by doing several approximations and then the comparison can be done. So okay. I've presented the abstraction and now I can talk a little bit about the static analysis. I don't know what is the tradition. Should I just ask for questions? >>: You will be interrupted. >> Xavier Rival: I will be interrupted so I don't need to -- there is no need for me to just -- okay. >>: [indiscernible] in this abstract [indiscernible]. >> Xavier Rival: >>: Yes, there is. Is it constraints? >> Xavier Rival: Sorry? >>: So what kind of representation, I'm just trying to understand what kind of representation do you have for ->> Xavier Rival: So for data structure, it's very trivial, you just have pairs. But if you look at the implementation, it's not just data structure, 10 you have to look at the structure of the code. So I don't know if we normal implement code in ML. But basically, the way we implement this is that we have some modules corresponding to the numeric lattice. We have a numeric lattice module, and the abstraction for the whole memory is a factor which takes -- I mean, which takes this numeric lattice and lifts it into an abstraction of the memory, including both shape and ->>: I would think that you have your shape graph with summarizations and some nodes are associated with elements from a numerical abstract domain? >> Xavier Rival: Basically, yes. So I mean, this depend sir, I mean, the fact that the numerical domain leaves the shape abstraction can be seen in the implementation. I mean, if you implementing a reduced product, basically you will have, say, a numerical abstract domain of one, zero, a numeric abstract do maybe one, and above this you make a product [indiscernible] which takes those two modules and builds a new numerical abstraction for you. In this case, we cannot quite do that. If we do that, it's a disaster. So what we do is we get a numeric domain as an input and we add the shape and we get an abstraction. So it's a very significant property for the implementation of the analyzer. Okay. Other questions? Okay. So let's look at static analysis algorithms now. So here is an overview of the analysis. So here is a very simple list insertion function. So what the analysis will do, it will just input such a code. It will also input some inductive definitions for the structures we care for. That is here at list, list. We may have other structures given as parameters to the domain, but it should not do anything with it. And also, we assume that we have an abstract precondition. So here we assume that we have a list which is already given to us. So this is a whole shape graph, but we may also not have this assumption and analyze a list construction routine. Here we don't have it. But here, I just give, just make the assumption to make the analysis simpler. And what the analysis will compute is an over-approximation for reachable states at each control point. So, for instance, in this case, at the loop exit, we get this abstract state which says that L points to a list, which basically points to a list segment, the tail of which would be pointed to [indiscernible]. So this is because C was used to travel the list in various types. So it creates not at the loop exit. I think it's after the insertion 11 because here we have the element T, which has been inserted right after C. here we have the list state. So this is what this abstract state says. And So what are the main algorithms for analysis? Well, there are two main, which are unfolding and widening. So what unfolding will do, it will pair from a case analysis on summaries to enable abstract post-conditions to be computed in a precise manner. So, for instance, when we insert the element in the list here, when we have this representation here, this abstraction of a list, before we can analyze an insertion at position Y, we actually need to make a case analysis under this structure so we get two cases here. So the first case is a case where Y points to a list of at least one element, which is made concrete here. And the second case corresponds to the case where Y is actually another pointer. So, of course, if we look back at the code, we start before the next field, before [indiscernible]. So this case, we'll actually go and we'll try to analyze the insertion in the list because we know that it's not going to be equal to zero. Then abstract post-conditions can be actually not so hard to compute, because they are always going to be computed on states which have been, at least which have been to use a [indiscernible] partially concretized. So that's the part of the memory which will be impacted is already completely concrete here so we don't have to do reasoning of [indiscernible] predicates at this stage. This is what we do here. And to ensure terminations, to ensure termination, you know that to avoid abstracts states to grow further, like here, for instance, we have a widening parameter which we fold back graphs into more simple graphs. So the unfolding principle is trivial. It just stems from the concretization of the inductive definitions. So there is one subtle point, though. That we look at the unfolding operation in the analysis, it will actually compute an over-approximation of the [indiscernible] unfolding because when we unfold a predicate like this one, we get also some numeric site predicates which may not be accounted exactly in the numerical abstractions. The numerical abstraction may make some approximation here. So here those are very trivial predicates, but we could imagine that we lose precision here. So this is why the soundness of unfolding expresses that way. So the 12 concretization of the input is entered into the joint of the concretization of the output. And in some cases, it's not equal here. It may be [indiscernible]. So after we do unfolding, we can do our reasoning, like I said. So I think I can pass on this slide. It's very trivial. So one issue, though, is that unfolding is not so trivial as it looks. So unfolding in itself is not hard. But what may be hard is to discover what part of the shape graph needs to be unfolded. So let's look at this example. So here we are starting with the [indiscernible]. We are traversing it up to some point and then we stop, and then we decide to go back a number of steps after we've checked that we're allowed to so we won't [indiscernible] here. So let's look at what the analysis will compute. So here is the state we get after the [indiscernible], which there is no surprise here. We have [indiscernible] here. Now, the question is what should we unfold to materialize this edge. So the first thing we do, we see is that we want to materialize fields of C. So the first thing we want to do is to -- the first thing [indiscernible] tells us to do is to unfold this right now. It's actually a good choice, because after we do it, we can see that here we have C field which has been materialized. However, so we can materialize this value. Now the problem is that we want to materialize one more [indiscernible] of a prev field and we are in trouble here, because there is no prev field starting from that node. But all is not lost, actually. Better prime occurs to be a parameter at the [indiscernible] segment. So what this means is that better prime is actually the last prev field encounter before alpha prime. So this suggests unfolding the segment. And we can actually do this because segment just some form of [indiscernible] predicate. So in particular, we can prove a signal [indiscernible] if you have a segment of length I plus J, it can always be described -- oh, I think I had some problem with figures so I have two figures which are the same. So will rewrite here. So if we have a segment of lengths I plus J, we can always put it into a segment of lengths I, plus a segment of lengths J. And in particular, if we 13 make J equal to 1, a segment of links one with [indiscernible] into just one, it will increase the amount so this is what we have done in this graph. We have unfolded this segment into a segment of the link of that segment as well plus a segment of links 1. So actually, there is another case which I have not represented here, which is a case where the segment is a link zero, and normally this is ruled out by this, just because see the prev cannot be equal to null and the prev field of alpha that is beta was equal to null in the beginning. So in that case, the segment of length zero, the case goes away. So the issue is not to do the unfolding, although unfolding steps are not hard to complete. What is hard to do is to find out which edge to unfold. And to do this, we have implemented some strategies, but they are not complete. And actually, I think it's not possible to decide in general what definition should be unfolded. So now let's look at termination. So let's look at the analysis of this simple program here, which traverses a list with a cursor and iterations, we have the cursor equal to the pointing to the head of the list. After one iteration, it points one step farther, then two steps. Of course, we can see that if we keep running the analysis like this it will not terminate. We never reach a fixed point unless we do some kind of widening. So let's look at the principles of widening. Again, widening is based on the separation principle. So it will do two things at the same time. There is a process where we discover where we split both inputs into regions in a pairwise manner that is if those are the two inputs of the widening, the algorithm will have to compute a blue region here, and matching the blue region there. The red region here matching the red region there. And at the same time as it does this region matching, it will apply some semantic rules for region weakening. This is, for instance, if we have on one side a segment and the other side a segment plus one element of the structure corresponding to the segments, then we can always over-approximate this one with a segment. This one already is a segment so we can say that the widening of this region together with that one will be a segment. And the widening will actually discover the matching at the same time as it progresses, as it computes, as it applies those semantic rules. So it's a 14 fairly complex process. I'm going to show a couple of holes to give you an idea. So here is a segment introduction hole so what a segment introduction hole does is if we notice that we have two nodes in one input which should be regulated -- which should have approximated the same, the symbolic variables of which should approximate the same thing as the symbolic variable corresponding to one single node in the other input, then we can always say that this can be over-approximated by an [indiscernible] segment. So the region between those two nodes can be over-approximated by the segments, then we can just introduce one segment in the result. So this actually occurs when we do widening after one iteration of our list, for example. So here we have variables storing the same value denoted too by alpha zero. Here we see that C corresponds to beta 1, which is the address of the list. So the first hole will simply say that here we have the [indiscernible]. We know that in both cases, C points to a list. We'll keep that. And then we are left with this blue region here and we see that here, our value corresponds to our value here. C corresponds to our value over there. So we can just try to look whether we can over-approximate some left over here by a list segment. And we have some inclusion checking algorithm here, so this is another algorithm which I won't present. Its principle is very similar to the widening, actually. Which we'll check that. And after it proves that this corresponds to a list segment of [indiscernible] 1, we'll just introduce a segment here. So and actually, after we get this, in the next iteration, what we'll do, what we'll see is that we'll exactly have -- we'll exactly apply this rule and this will allow us to just prove that this graph is a fixed point. So this widening operator is based on a rewrite system where we have -- where both the regions matching and the output are computed in the same time. So those rules are all be proved sound, of course. So it's a terminating system, of course, which means that it's a widening computation. It self-terminates. However, the big issue is that those rules are not confluent. So what this means is that depending on the strategy, the order in which we apply those 15 reels, we may have the algorithm may get stuck so we would risk imprecise results. So there's a lot of language that goes into the strategy for applying those rules. It's not very interesting, but it needs to be done in order to get -- it's not very interesting to discuss today, but it's a lot of work, actually, when designing the analyzer to make sure we don't apply the wrong widening rule in the beginning and then fail to compute a precise result later on. So the properties of the widening, as a consequence of those are for the sound so it returns an over-approximation of the concrete join. We can also prove that the widening it rates, terminates at least after. So typically, the way we prove it using all the rules is by first considering the segment introduction. And depending on the concrete states, we over-approximating. Depending on the number of variables, we can introduce only finite signals and so after a given number of iterations, this rule cannot apply anymore. And when the segment introduction rule cannot apply anymore, all the rules will at least preserve the number of edges or decrease it, and this allows to prove termination. So since we have a widening in the shape abstract domain, the structure that I did talk about, the cofiber structure which allows us to make the product with the numerical domain, even though the number of dimensions in the numeric abstract values depend on the graph, so this construction also comes with a widening construction. So if you basically, [indiscernible] proved is that if you have widening in the [indiscernible] domain, and if you have widening on the base domain, such as a graph domain, that will just get widening for free in the combined domain. So let me see. I thought I had a picture. So the way it worked, I will just come back to the figure. So the way the widening in the combined domain works is first, it will compute -- it will result, it will [indiscernible] a fixed point. It will [indiscernible] a stable value in the left domain. And then when this one is stable, then all iterates will be a pair of S1 and some numerical abstract value in the domain corresponding to S1. And since we have a widening on S1, then we'll convert [indiscernible]. So if we compare with other shape analysis tools, so we decided to use a widening. So what widening does is it relies on history of abstract computations in order to guess a bound, to guess a [indiscernible] point. This 16 is quite different from what is done in many other tools. So in many tools, people use what they call canonicalization. So that is a unary operator that will turn some abstract value and possibly some finite lattice into a [indiscernible] lattice, which is finite. And this allows to compute a fixed point in an abstract domain. So this does not -- I think that this approach is kind of different. So first, the fact that the use of finite lattice can be used [indiscernible] limitation. But on the other hand, I think it's good to have the ability to weaken abstract values at other points than widening point and loop iteration point. So in our case, we started with widening ornamenter. We observed that this allowed us to not only deal with an infinite lattice, but to have rather quick convergence and to manipulate, typically low numbers of disjuncts compared to the frameworks. But on the other hand, I think we won't have a look at a fraction too, so I'm currently working on one, which would share some of the principles of widening [indiscernible] not being as [indiscernible] as canonicalization parameters have to be because they have to be done in a finite lattice to ensure termination. Okay. So now, I will talk about two applications of analysis. The first one will be the analysis of low-level C softwares. So far, what I've described looked like an analysis of java program somehow. I did not show any C-specific features in the examples I did show, so I will show C-specific features now. And in the next part, I will discuss about some very specific forms of embedded softwares we have to analyze and how we extended [indiscernible] to do so. So before I do that, is there any question on the analysis algorithms? I have a question. How much time do I have? >>: Theoretically, you have 'til noon. >> Xavier Rival: >>: Okay. And practically? You can finish earlier, if you want. >> Xavier Rival: Okay. So this means I cannot finish later, I guess. Okay. So let's continue. So what happens when we look at C-codes? Well, typically, we have to deal with many low-level features. So, for instance, typical programming pattern is to use nested aggregates. So structures inside 17 structures. Even unions inside structures. In this case, we have aggregates. So we may manipulate pointers to fields and also there is memory management that is allocation and deallocation which needs to be analyzed and verified in a sound manner. So let's look at contiguous regions first. So how about nested aggregates. Well, we found that we just had to change offsets. So in the beginning, I think in all the slides I did show so far, all I was using is fields as points to edges labels. That is, offsets were very simple. Well, now we'll add some more complex offsets. So the first extension we need to make in order to deal with nested structures is to have sequences of fields. So one offset will be a sequence. So in this case, we'll have, A will have B dot X, B dot Y. Once we do that, nothing changes. We just need to change the way we deal with offsets. The other thing is arrays. So the first thing is, well, again, we can change [indiscernible] of sets and just say offsets will be numeric in values. It's not quite enough as I'm going to show in the next slide, but in practice, in the principle, we just, again, we just need to change the notion of offsets and go from, say, symbolic offsets we have here and move to some numerical construction of offsets. So I will discuss this some more on the next slide. So but what I want to stress is that when we made those changes, we did not revise them. So statistically, the widening and folding remain the same. So now how about array abstraction. So if we have some array of, say, length 20 sorted into bit integers. So the first approach we can have is to just use one single points-to edge and let this edge have size 80. So we never asserted that the point-to edge should have as a size, a single memory cell so it could be several cells. The advantage of this approach is that we can actually delegate the abstraction of the content of the arrays that is better to some array-specific abstraction. So we can actually just say that beta will be a very, very long numeric value, which is actually the content of the array and will use some array abstraction, so there are many, I think in people in this room have proposed. So you can just imagine using one of them. So in the second approach, which we 18 are actually, which we have implemented in the analyzer recently, actually consists in to put into our own analyzer one of those array abstraction where we actually use segmentations of blocks. So this is what we have here. And the segments correspond to numeric offsets, which are symbolic. So, for instance, here I expose one fraction of this array where we have one set which is materialized here, and here we have two array signals which may be of length zero or could be the lengths one, could be of lengths, say, 19 in that case, because the array is of lengths 20. And we implemented those array segmentations in the analyzer. Again, this changed [indiscernible] changing the unfolding or the widening of the shape domain. Those just remain the same. So I'm going to talk some more about how we introduced this segmentation -- I mean this abstraction with segmentations in the last section, because it did play a role in the analysis of those [indiscernible] I will tell you about. So now how about the pointer models. So in our analyzer, pointer, which can be refracted by some points to edge where one set of corresponding to an address to another set of corresponding to an address. However, as I said, in C you may have to deal with pointers to fields. So how do we deal with pointers to field? So our first solution would be to rely on the numerical abstraction to capture some [indiscernible] between symbolic abstractions. Several can. But it's a poor solution because in that case, we would lose the shape abstraction important information about the pointer. So our solution is to actually to label points to edges as a destination with some offsets. So, for instance, if we want to write this concrete state, we have two blocks with an address beta, and here we have field F, which points to field G, and what we'll do is simply extend points to edge, add a G field as a target. And what it says is that the contents of the cell we have tried it is this one, is correspond to the value denoted too by beta plus the offset corresponding to G. So what is quite interesting to notice is that when we introduce this in our abstraction, we met Jan Kreiker, who was doing some analysis with TVLA at the same time, and I think he was using the same techniques which are very similar. So that's quite interesting that using different frameworks, we can take it to [indiscernible] using actually very similar techniques. 19 So a couple of years ago, Pascal Sotin and Bertrand Jeannet did look at how we did [indiscernible]. We noticed that when we made this change, it also did not require any change to the [indiscernible] algorithm. So we made a classification of pointer models where we have either numeric or symbolic offsets. Pointers to fields could be either allowed or disallowed. So we get four models. For instance, [indiscernible] we will have symbolic offsets and no pointers to fields. In C, you have numeric offsets and you have pointers to fields. And the nice thing is that we follow our framework, basically all four models. But in my current implementation, I just take numeric fields with pointers to fields allowed. So basically, in this model, you can -- you usually can encode all the [indiscernible] anyway. So now how about union structures. So as I said, in C, you have aggregates where with unions, where you have basically multiple views on memory regions. So Antoine Mine proposed a solution in 2006 which basically introduced nonseparating conjunction in the analysis in the [indiscernible]. And we have implemented this technique as well. So we extended our abstractions with non-separating conjunctions. That is, if we look at our abstract predicates, I mean, [indiscernible], sorry, what they represent is that form where we have a big separating conjunction of many sub-terms, which could be either point to predicates or local conjunction. But under the local conjunction, we only have points to predicates. So the fact that the conjunctions are local makes it quite easy to preserve the analysis efficiency. And if you want to reason about unions, you never need to actually have one of those non-separating conjunction above [indiscernible]. Okay. So the last thing about the analysis of C softwares is about dynamic memory management. So the question is how do we verify that free of P is safe? So if we actually look at the [indiscernible], we do free of P, whereas P is very pointer but not [indiscernible] which was located by [indiscernible]. The behavior is undefined. So we have to make sure before we do that P is indeed the address of something which is allocated. And so the funny thing is to actually capture this behavior, we have to look at very, very concrete view of the program semantics where we actually have the 20 allocation table taken into account. It's not the model where most people actually reason about the program. But in that case, we did have to use this as a concrete semantics, but it was not a problem at all with our model. In our analysis, we tracked the location table in a very trivial way by marking each nodes of the shape graph which correspond to address of regions and the size they correspond to those additional small bits of information. We need to add to the shape graphs. Again, this doesn't require any change to the widening and unfolding algorithm, except that the -- I mean, we have to over-approximate, of course, the fact that on both sides, we have allocation markers would be the same and so on. But [indiscernible]. So when we look at the inductive structures, we should also, actually, summarize those [indiscernible]. For instance, if we have list living in the heap, then we should recall on each node that it correspond to an allocated region of lengths [indiscernible] data field. So in that case, we did see that we did benefit a lot from the fact we start the from very concrete base semantics. It made the extension of the analysis to handle free in a sound [indiscernible]. So now I'm going to talk about the analysis of some family of C-embedded softwares which is work in progress. Actually, we have an implementation of a prototype which will analyze some code taken from some embedded code. But the analysis of the whole embedded code is still a work in progress. It's also a synchronous code so most of the work on that code is done by Antoine Mine, in fact. So I think he published at least one paper on this work. So before I talk about the problem in itself, I want to give just make a few words about the context. First of all, it's a follow-up of the Astree project. So in follow up, in Astree, we have some synchronous embedded codes which is very contiguous fly-by-wire C softwares. And Astree worked quite well on those examples, but some points we had -- at some point, proposed to the group to [indiscernible] other codes which are more complex, which present different sets of problems. So including some asynchronous codes, so typically monitoring of aircraft systems. And those systems always also have to comply with a number of [indiscernible]. So [indiscernible] is called [indiscernible]. It's a big book which says how people should write aircraft software. So what [indiscernible] so in this level, you cannot quite use much data structures 21 besides arrays. So it's very, very limiting. So the fact that we can use more complex structures is not written in [indiscernible]. But what is written in the [indiscernible] would make it very painful to analyze, to use list of trees or whatever. And those monitoring applications are level C so it's still quite critical. So in that case, what the software has to do, it has to manipulate lists of messages. So those would typically be implemented with a dynamic structure. But in DO-178, even level C, so C is a middle level in terms of criticality. So levels are from A to E. And in level C, still quite hard to justify using malloc, so people did not want to do that. So what they did, they used what I would refer to as a free pool later on. So we have a static array. So here are lengths 5. And each cell of the array will start a structure. So it's just a contiguous [indiscernible]. And what the program will do is it will do its own memory allocation inside this block. So, for instance, it will have some sets of invalid messages which are not pointed to by the system. It will have a pointer, say, to the first element of the array and then in the beginning of the array, there would be a number of elements which are linked as this element. So in this case, we just have a straightforward case where the second element is just pointed by the first one. So we just have elements appearing in order. But in this case, the messages, the case of that program, the messages may [indiscernible] order. Like here. But the program would always make sure that the data are stored at the beginning of the static region. So the program will do its own -- it manages, it implements its own malloc. So this is actually quite challenging code to analyze because in that case, what we have is a structure which is accessed also not only as an array, but also accessed to as a single link list, and it's no option to -- sorry. See, I'm going to discuss this later on. Let me see. No, actually, I did move that slide. So what we could do is make [indiscernible] to represent the value -- I mean to keep the representation concrete and to rely on this assurance to represent all possible changing, but that would not work. It would have an exponential blow-up here. So what we have here, is we need to abstract a single link list here and we need to abstract it together with the knowledge that it lives in some static array. 22 So what we do is we are going to use abstraction at two levels. So actually, we'll start, so here is a concrete state on the left. And what we can do in the first step is say that this concrete state can be abstracted with this shape graph here. So what this shape graph says is we have two variables, T and L, which point to some region of address something here. The address corresponding to that node. So in that concrete example, it would 0XA0. And the T points to the base at rest, here is a pointer to array. So here the offset is zero. And L points somewhere inside the array. May not be the beginning. So now if we look at the content of the array, it will be split into two regions. So here comes the fragmentations I introduced a little bit earlier. And what this fragmentation says is we first have this blue region corresponding to node B to zero, and this green region corresponding to need B to 2. And what we want to do next is to also relate B to zero to some other predicate which says that B to zero actually starts a single link list. So if we look, zoom inside this block, what we have here is a searcher which is simply a single increase so we can use [indiscernible] shape graph to [indiscernible]. So this is what we have here. This is call we call a sub-memory abstraction. So here in this case, what we have is we have two sub-memories. The first one corresponding to B to zero is a single link list because the whole area corresponding to B to zero is one single link list. And here we just have one [indiscernible], which is of link 16, because here we have two structures with no knowledge about their contents, which, in fact, this value, this node -- sorry, cannot write. It's small. This node does not carry any information altogether. It just says that here we have a raw sequence of bits. So what we have is we are going to introduce a new numeric lattice. I put numeric in between quotes here because it is numeric in the sense that it will constrain symbolic nodes and corresponding to numeric values. But it's not numeric at all in its definition. So what it will do, just map to some numeric nodes, to some symbolic variables, sorry. Some sub-memory predicates which correspond to [indiscernible], including the content, the symbolic variable corresponding to the contents of the memory. A range which is defined by symbolic offsets from base address alpha. And then we have a shape graph which corresponds to the sub-memory 23 contents and some kind of environment which kind of special environment, because what it maps is actually offsets from alpha into some nodes of this shape graph. So this actually is a binding between the view from the main memory point of view into the sub-memory. So what operations can we do on those predicates? Well, the first one is introduction. So whenever we have a points to edge, we can actually say that we have points to -- we have some memory corresponding to its content, and but we don't know anything about the layout of this memory. So it's just yet another points-to edge. But this shape graph, actually, is constrained here. I was describing links of beta. So actually, this is not introduction. This is fusion. So the [indiscernible] is fusion. If we have two contiguous memories, then we can compute the fusion and actually, what is quite interesting is that the shape graphs just get merged together as a separating conjunction, as a normal. So we just get two sets of edges put together, we get another shape graph and this is a new, a merged -- it's a merged sub-memory shape graph. So how about high level operation. So join is actually the most interesting operation, because this is one which we cause some memories to be introduced and to be extended. So here, I just give one example. Here we have -- so it's not an -- it is -- so again, I think I've done some -- no, I have not. So it's a case where we extend one sub-memory after adding one element to the list, which is being constructed into the static block. So if we look again as a program, here we had this code which was building, which was extending this sub-memory area by allocating more and more listed elements and inserting them in the right position. So what the join will do here, so first let's look at the first argument. So in the first argument, we have a sub-memory B to zero, which corresponds to some list. And second argument we have added one more element to the list. And this element correspond to this symbolic variable, which is actually also a quote to, which is also known to be a quote to the base address of the block plus some offsets. [indiscernible] prime which was the offset corresponding to the list because we did add here the element at the head of the list. And what the widening will do is it will introduce a new sub-memory corresponding to that single set here. And it will merge it together with one which existed before. 24 So there are two operations here. So it was this one. And this will give us a new sub-memory here where we have a segment of a list between delta zero and delta one and we still have the [indiscernible] which is the list. So the transfer functions are not very interesting. I think I'm running a little bit out of time so I'm going to forget about this one. I think I can go to the demo now, unless we have questions. Any question? Okay. So demo. So I took three, I did select three examples. So the first one is a list example. So here we have a list, we have a few list function, one which will allocate a list. Another will free one, and then another will iterate our list and do some things. This could be a printing function, for instance. And in the main function, we first allocate a list. We reverse it and then we [indiscernible] over it and we deallocate it. So what I have made, this is a version of [indiscernible] which just runs my three examples. So it's useful for testing normally. So I'm sorry, it's a pure ASCii tool, so it is just a text file which will give us all the invariants at all points. Let me check. I forgot which point I wanted to show. So after allocation, we have -- actually, well, this is a kind of an assert which will verify that error is a list here. So let's check line 48. Okay. So this is the invariant here. So we have variable LL corresponding to node 5. So this is the node corresponding to the address of this variable. And we see that 5 contains one point to edge of size 4, and contains seven. And it the contain seven corresponds to some [indiscernible] predicate. So in that case, in the case of this one, I forgot, I omitted to say that I used a five as a parameter, which corrects a number of [indiscernible] definition and it's not a [indiscernible] definition, it's also a set of other [indiscernible] definitions, which I use for user examples. But in this case, what happened here is that the tool basically computed this list predicate as a list fixed point over on the routine which is allocating. So in this code, we never give the tool any information about what we are building here. It is discovered by the widening that it is building a list. 25 So now, let's see what happens later on. For instance, we can look at the -well, I will look at the [indiscernible] function. So the loop is at line 6. Let me search the invariants we get. Okay. So this is the first invariant. We get. So in the beginning, we see that we have variable LL, variable -- so this is the cursor. This is a local variable of the function L. All points to the same node 5, which is the address [indiscernible] so this is because before we started iterating the other structure. And after, what we see here is we obtain a segment so we still have [indiscernible] variable L and LL which point to node 4, which is the beginning of the segment. The end of the segment is node 5, pointed to by the cursor here. So what this means is basically the widening which was computed was a loop here, did materialize the signal so it did infer there was a segment. And I think that since I don't have too much time, I will move on to the next example. So the second example is a tree example. So it's binary trees with [indiscernible] pointers. So in this case, I did not include the construction routine. So example is a little bit contrived, but what it does is it assumes that it starts with some well-formed binary tree and then it will search pass through the tree and do values transformation operations along that pass, including swap, rotation in the left to right and right to left. And then there is also the [indiscernible]. So in order to make the example a little built simpler, I just put -- I did just abstract conditions using [indiscernible] variables. But this is just an example to make it a little bit more [indiscernible]. So let's look. Okay. So here is a log. So it's quite long. So what is the state? So I'm just going to show you one state. At the end of the loop, so at point 59, okay, so this is the very last, very last line in the loop. Maybe I should just show the first one, 19. I won't be help here. Yes, so I did not prepare the examples well enough. >>: That's okay. So this is what we get at the second. So this is the beginning of the loop. We have just one disjunct. So we have a disjunct domain. It's one very, for now very trivial implementation of the trace partitioning domain. So at the beginning of the loop, you have only one disjunct. And here again, what we can note is that the analysis did infer that we could have this tree segment predicate here. 26 So in the case of the tree, what the tree segment means is that we have the tree like this, and so this is the root of the tree which is pointed to by variable T here. And we have our cursor, which is C, which points to somewhere inside the tree, okay. So what the tree segment predicates indicates here at right is whatever is not below C. So this is this [indiscernible]. And the tail of the tree here corresponds to the third tree pointed to by C. So this is what the segment inductive predicate looks like when looking at a tree. So this is what we obtain at the loop head. So and if we reach -- so I think it was 59. The tail of the loop. So we had a number of disjunct which were introduced corresponding to all -- sorry? We're running out of time. Should I stop right now? >>: Yeah, you'll be here all the week. >> Xavier Rival: I will be here all the week. So if you have more questions. Sorry for talking for too long. I have a conclusion, but I think it's less interesting than questions. So questions are always more interesting. >>: You can do the conclusion. >> Xavier Rival: does not. >>: Okay. I conclude. This is what it does and this is what it You can read it. >> Xavier Rival: That was right, exactly, because so you see I did not talk that long in introduction. So now you have questions? >>: So you have questions? >>: So you [indiscernible]. [indiscernible]. Do you have something like a decision to >> Xavier Rival: No. So we have trace partitioning. However, I think that we have not figured out yet the right criteria for trace partitioning. What I mean here is that in Astree, we do trace partitioning too using information about the [indiscernible] history. That is, we got into which bunch was taken here and there. And it seems that this is not the best we can do in the case 27 of shapes. So I think it's useful to do this in the case of shapes. But in other cases, you need to remember which [indiscernible] were applied to obtain which disjunct and sometimes to merge stuff later on. So this we don't do yet. >>: [inaudible]. >> Xavier Rival: It should be added here, what we don't do. long. Other questions? >>: We can conclude here. >> Xavier Rival: Okay. We can conclude here. The list is very