1

advertisement
1
>> Francesco Logozzo: Good morning. So it's my pleasure to introduce Xavier
Rival. So I've been knowing Xavier for a very long time, more than ten years.
More than I knew my wife. So usually, I always say silly things when I
introduce someone. So I'll stop now with the silly things with Xavier. So
let's talk about serious things.
So Xavier did his Ph.D. thesis with [indiscernible] at DNS, then he post-doc at
Berkeley, and now he's researcher at the INRIA and has been working on many
interesting projects, many of these he can talk about, so I can make it short
and today he's going to talk about his work he's done for [indiscernible] to
discover the properties of complex data structure.
>> Xavier Rival: Okay. Thank you very much, Francesco. So indeed, today
we'll talk about MemCAD, which is a modular abstract domain for reasoning about
memory states. So many people have been involved in this work. So let me cite
them on this slide. So Bor-Yuh Evan Chang has been working with me. I mean,
we have been working on this together since the beginning of this work in 2006.
And then Vincent Laviron, Pascal Sotin, and Antoine Toubhans are doing their
interns or Ph.D.s or post-docs on this topic.
So in this work, I am interested in safety of software, so I want to verify
properties such as the absence of run time errors, the absence of non-desired
behaviors. Some functional properties as well, and these are all undecidable
properties. So we have to do some compromise here. And our compromise is to
not be complete so we'll use the abstract interpretation approach, which is
sound. We will discover all bugs of certain kind. For instance,
[indiscernible] for run time errors, we would find all run time errors of some
case.
It will result in automatic tools. How we write not complete. So the three
main types are to how to choose an abstraction of the concrete semantic so we
have to define a concrete semantics and an abstraction, which is -- which we
can do conservative approximation.
Then we'll derive from it abstract transfer functions which allow to preserve
conservative approximations when computing post conditions, for instance. And
the last thing is we want to ensure that we always terminate so we use widening
for that, which are basically over-approximations of concrete joins.
2
And an abstract domain is the data of those three things, an abstraction plus a
set of transfer functions plus a widening [indiscernible].
So before we talk about memory properties, let us look at the status in
numerical properties. The status is actually quite good. Many people have
been working on this topic for over 30 years. And we have many abstractions of
functions from variables to [indiscernible] integers or floating points. And
in particular, we have a lot of available abstractions such as intervals,
octagons, convex polyhedra main users.
So actually, we even have implementations of those domains, which are libraries
so you can just use one interface for transfer functions and widening, and
actually, just change one line in the analyzer and you can switch from the
cheapest one to the most expensive one in a very short time.
So it's also possible to combine many of those numerical abstractions in quite
efficient ways. So [indiscernible] Astree analyzer has been obtained by
combining over 40 numerical abstract domains and it's a very scaleable which
can work in programs in over one million lines of code, which are embedded,
which are [indiscernible] softwares. It's also very precise. That is, it will
allow to prove the correctness of some such families of such programs, which is
quite, quite nice.
On the other hand, if we look at memory abstractions, there are many
difficulties which have not been resolved yet so we cannot quite use such
construction, such libraries. It's not so easy to use and [indiscernible]
memory abstraction. So memory abstraction should abstract states which are
more complex than functions from variables to values, because we to have take
into account the format of the memory with the heap, the stack, the
environment.
The concrete operations which we want to analyze are also more complex. For
instance, we have to do it with point arithmetic, memory management which are
nontrivial operations. And the operations we want to verify can also be very
[indiscernible]. For instance, we look at the preservation of some data
structures. So there are several difficulties which are further fact that the
concrete semantics is complex. The second point is that there is a huge
variety of data structures which we may want to abstract like linked
structures, arrays, structures involving relations between pointers, between
pointers and numerical values, and combinations of all of them. So it's a very
3
challenging task.
And for many of those structures, you can find some specific algorithms. For
instance, some algorithms to analyze [indiscernible] or just arrays. And in
many cases, those can be quite expensive, which is also quite challenging to
achieve scaleability.
So my long-term goal is to set up a general purpose library of memory abstract
domains and in this talk, I will present three main contributions. First,
we'll choose concrete semantics, which is not limiting. So we'll be able to
analyze a wide variety of structures and I'm going to give some examples in the
end of talk.
I will also show an abstraction, which is mostly based on inductive structural
properties but which can be extended in a number of ways, which can be
parameterized with a choice of numeric abstraction, with a choice of a set of
structures to consider.
And I will show static analysis algorithms. So I will also give an implement,
a demo of the tool which I am implementing, which is called MemCAD. It's an
OCami implementation of this site domain together with an analyzer for
[indiscernible]. So it's still a work in progress. So, of course, it may not
be -- it's not a finished tool.
So let's look at the parametric abstraction first. So on this slide, I'm
giving an overview of the abstraction. So at the top, you can see what a
concrete state looks like. So we have a concrete memory, which is partitioned
into region. So it's a very concrete view of the memory, where we have cells
with addresses, variables which are mapped to some addresses. We have blocks
which contain either pointers, which are numerical values corresponding to
addresses, other sequences [indiscernible] could be interpreted as integers
like this one.
In the first step, we make a first abstraction of this concrete, of this
concrete states with shape graphs. And in a shape graph, basically, we have
notes which denote values. And integers which correspond to memorization. So
we should notice here that in many cases, shape graphs have -- draw assumptions
that is nodes correspond to memory cells and edges correspond to values. In
this case, we do the opposite. So the reason why we do this will become clear
a little bit later.
4
And in the second step of the abstraction, we can summarize regions. This
case, I assume that we have some [indiscernible], which is supplied by the
users, for instance. And what we can see here is that this node, which I am
pointing here -- I'm sorry, I don't have a pointer. So this node correspond to
the address, which as I said, a node corresponds to a value. So it corresponds
to the address of? List, and the purple region here, all those purple integers
correspond to the purple region here can be summarized with this
[indiscernible] edge departing to the node corresponding to the address of that
list.
So we can notice that on those three steps, I did preserve [indiscernible].
That is, I have some green cell here, which corresponds to this green points to
at here and same for the red and blue here and the purple region here was
abstracted into a bunch of points to edges in the first step. In the second
one, it gets summarized. And the one to definition.
So this is an overview of the abstraction we are using for [indiscernible] data
switchers.
So how now to concrete semantics? Well, we start from a very concrete
semantics, where the heap is divided into blocks, which are marked either as
allocated or as free. Memory cells all have a numeric address and a numeric
size so it's very concrete. And pointers are also numeric values, which can be
composed and a base and an offset.
So now I will describe in a slightly more formal way how the abstraction works
and comment on its features. So as I said, edges will denote memory region,
whereas nodes denote values. So the most important kind of edge will have
to -- we'll deal with is called a points-to edge, and it denotes a contiguous
memory region. So particularly note points to edge like this. So we have
[indiscernible] field F, this nation beta and what it denotes is a memory cell
at some address corresponding to [indiscernible] containing value corresponding
to beta with some offset corresponding to the field F.
So what is this new here? It's actually implementing -- well, it's denoting
the physical mapping of symbolic values. That is nodes into values. So this
play as role in the abstraction which I'm going to comment on a little bit more
later. But basically, when we concretize a shape graph, we take a set of
predicates like these, which we can also write a mathematical formula like
5
that. F points to beta. And we'll get a set of concrete memories together
with pairs made of concrete memory together with what we call a variation,
which maps abstract addresses into physical addresses.
So this is a concretization formula. I think it just does what we said so
there is no need to comment on it more.
So we use separation logic; that is, a graph is made of a set of edges, and
each edge will denote a disjoined part of the memory. So disjoined edges
denotes disjoined pair wise memory regions. And the advantage is that this
will allow local reasoning. So when we compute transfer functions, it will be
quite easy. On the other hand, computing a joint of abstract space is more
challenging than if we were doing regular conjunctions.
So one thing we should note is we use so-called field splitting model. Again,
in the [indiscernible] community, there are several approaches. And in our
approaches, we always impact edges and not values of pointers. So for
instance, if we look at this shape graph, we have F for here and two points-to
edges pointing to fields F and G. And it stands both for stores like the one
in the left where the two fields are different and denote different cells,
addresses of different cells. And stores like the one in the right where
actually both fields contain the same pointer value.
So, so far, we have only considered points to address and we cannot quite
summarize -- we cannot quite summarize unbounded memory regions so we are going
to do that now.
So let's consider the set of concrete states we have here. So all those
states, we have a variable X, which is allocated into the green cell here.
This is variable X, and it contains the address of something. So here we have
another pointer, one, two, three and so on.
And so what we want to do is to abstract all those states and all those
elsewhere we'll have a list of links [indiscernible] into just one state. So
to do this, we add some additional predicate to all our logic. So those are
indicator for X. Here we assume that we have a list predicate, which is
defined and which basically expresses that some address, alpha, is the address
of [indiscernible]. And basically, all those states can be abstracted into
this formula, where we say that we have the address of X contains the set of
address -- the address of X contains alpha and alpha is the address of this,
6
which is the joint from the cell corresponding to X.
And in a graph representation, we use [indiscernible] edges for those
predicates.
So how do we write this definition in separation logic? It's just trivial like
you'll see a definition where we have a list predicate, which is defined by
alpha dot list. Which means either that alpha is equal to zero and we have an
empty memory fragment or we have a memory fragment which contains two fields,
alpha dot next and alpha dot data, and some region of address the value of
field next from alpha, which is [indiscernible]. In this case, alpha is not
[indiscernible]. It's just classical inductive definition. And in practice,
we either use a handwritten definitions or we'll use definitions built in the
analyzer. So this is a case of Smallfoot, Space Invader, or that we'll let the
user supply, and that was the case in Xisa. So TVLA does use another approach
that is not based on separation logic but we can view TVLA as a way to
hand-write -- I mean, you can hand write in TVLA something which corresponds to
this definition.
And it is also the case in the analyzer I'm going to demo. It's also possible
to infer part of those definitions. We have done this in Xisa with
[indiscernible] and in POPL two years ago. And actually, it was with some
specific assumption. Here, we were inferring some inductive definitions of
some kind, and that was done, that was achieved thanks to an improved way to
[indiscernible]. In general, it's very hard to infer the definitions at this
time.
How about the concretization? Well, the concretization, again, is very
intuitive. It's simply a list fixed point over a formula over a memory. So
basically, when we have such an intricate definition, which is a notion of
syntactic unfolding, which comes up by just [indiscernible] the formula. And
what we can just do is compute the join of all the concretizations of those
sequences of unfolding in one of several steps. This is a [indiscernible]
point some of function so we can compute it.
While we cannot quite compute it in the term of computer computability, but in
terms of on a sheet of paper, it's computable. So this define as
concretization of inductive formula.
So here are some examples of more involved structures that we can define with
7
similar definitions. In those cases, I just give the most significant formula.
I don't give the formula definition like I do for the [indiscernible]. So
first, tree. Well, is very similar to a list except that we have two signs
with both hold [indiscernible]. We can also capture relations between
pointers. So this is the case of the doubly-linked lists. In this case, we
extend the definition with the parameter which tells where the [indiscernible]
field should point to. So basically, when we unfold address alpha with
peripheral data and the peripheral gets data, and the [indiscernible] should
point to doubly-linked lists, the next of which is alpha. We just get the
pointer definitions through parameters of the inductive definition. And we can
also do the same trick with numerical also here. It's an indication of the
sorted list so in that case, the parameter denotes a lower bound on the first
value. So we just, when we unfold it once, we just get a data field and should
be [indiscernible] delta parameter here and also we'll get the parameter, the
lower bound of the [indiscernible] list is value of the data field.
Now, if we look at, when we start looking at programs, we see that there are
many, in many cases when we have such structures, we may have pointers into the
structures. So we may have states like this one where we have variable X and
variable Y, which both points to some list. But actually, X points to the head
of the list and Y points somewhere inside the list.
So this would occur, for instance, if you use some inception in side some sort
of list, you would encounter such states. And we cannot quite say that we have
X points to a list and Y points to a list where the [indiscernible]. It's not
possible because of the separation and also we have to capture these relations
at Y points somewhere into the list. But it's okay, because we can actually
split the state with colors like I did a few slides ago.
So here we have a blue area and here some red area. So the red area is simply
a list and the blue area is actually also can also be expressed using some
integer combination, which I give here, which is a list with some end point.
However, the definition derives in a straight manner from this definition so it
does not quite make sense to overload, to add more definitions to the
parameters of the abstraction. It's more interesting to just keep the
knowledge that here we have a segment of a list and this is what this new blue
predicate does.
But what this predicate does is exactly what this list was end point in the
definition expresses. The only difference is that we have added single
8
predicates which are also parameterized by that data of the inductive
definitions that you want to consider. So we don't have -- we just have one
single predicate which can also be used, for instance, for [indiscernible] and
other data switchers.
So now, let's look at the numerical information. So if we want to
[indiscernible] some real data structures, we often have to take care of our
data, which is stored inside it. So we have noticed that we decided to let
each node, which holds a name stand for some numeric value. In a
concretization, a node will always concretize into some sequence of bits. So
which could be either the address of the cell or the contents of a cell. So
this is very convenient to do the product with a numerical abstraction.
For instance, we'll have -- so our abstract values will be made off a shape
graph plus a numeric value. So, for instance, let me just switch to the next
one. So if we look at certain lists, for instance, we just take this
definition. If we unfold it a little bit, we get shape graph like this one.
This is partly unfolded so here we have the data, the contents of the data
fields of the first [indiscernible] which I exposed. And actually, if we look
at the constraints, they should satisfy for it to be a sorted list, but it also
should be a [indiscernible] better one. And this constraint can be expressed
in the numerical [indiscernible] of octagons.
So we just have to pair numeric absolute value together with our graph.
However, there is one subtle difficulty that we may want to compare that
abstract value with this shape graph. Actually, what we can see is that this
one is included into that one, because this one is -- where it basically
denotes any sorted list of links, at least one. This one denotes any sorted
list of links, at least two. So this one is included into that one. However,
comparing them is not [indiscernible]. In particular, you can notice that here
we have this node which is called [indiscernible] which does not exist here.
So the comparison is not trivial. And this is due to the fact that the
numerical abstract value we have here depends on the graph.
So we have different graphs with different sets of nodes. They are
[indiscernible] use and to compare them is not quite trivial. So actually, it
was the construction we use was introduced by Venet in 1996. It's not a
reduced product. It's slightly more complex. It's a reduced product where the
second -- where the lattice, where the second member of list depends on the
first member of the product.
9
So here we have a shape graph and basically, their values will be made off the
shape graph and some numeric value which is taken in the lattice, which is
defined by the shape graph, that is, the dimensions of the numeric lattice or
the nodes is a graph.
So we look at the shape of the whole abstract lattice, it will be a bit like
that. So here, we have shape graphs in the left and here we have numerical
lattices corresponding to those shape graphs. So one abstract value will be
given by one shape graph and one value in the lattice in the right.
And when we do comparisons or when we compute transfer functions, we often have
to, for instance, compute, compare two shape graphs which are different and we
are going to use the calculations between those shape graphs to make the
numerical values by converting one, for instance this one, into a format which
is in that lattice by doing several approximations and then the comparison can
be done.
So okay. I've presented the abstraction and now I can talk a little bit about
the static analysis. I don't know what is the tradition. Should I just ask
for questions?
>>:
You will be interrupted.
>> Xavier Rival: I will be interrupted so I don't need to -- there is no need
for me to just -- okay.
>>:
[indiscernible] in this abstract [indiscernible].
>> Xavier Rival:
>>:
Yes, there is.
Is it constraints?
>> Xavier Rival:
Sorry?
>>: So what kind of representation, I'm just trying to understand what kind of
representation do you have for ->> Xavier Rival: So for data structure, it's very trivial, you just have
pairs. But if you look at the implementation, it's not just data structure,
10
you have to look at the structure of the code. So I don't know if we normal
implement code in ML. But basically, the way we implement this is that we have
some modules corresponding to the numeric lattice. We have a numeric lattice
module, and the abstraction for the whole memory is a factor which takes -- I
mean, which takes this numeric lattice and lifts it into an abstraction of the
memory, including both shape and ->>: I would think that you have your shape graph with summarizations and some
nodes are associated with elements from a numerical abstract domain?
>> Xavier Rival: Basically, yes. So I mean, this depend sir, I mean, the fact
that the numerical domain leaves the shape abstraction can be seen in the
implementation. I mean, if you implementing a reduced product, basically you
will have, say, a numerical abstract domain of one, zero, a numeric abstract do
maybe one, and above this you make a product [indiscernible] which takes those
two modules and builds a new numerical abstraction for you.
In this case, we cannot quite do that. If we do that, it's a disaster. So
what we do is we get a numeric domain as an input and we add the shape and we
get an abstraction. So it's a very significant property for the implementation
of the analyzer.
Okay. Other questions? Okay. So let's look at static analysis algorithms
now. So here is an overview of the analysis. So here is a very simple list
insertion function. So what the analysis will do, it will just input such a
code. It will also input some inductive definitions for the structures we care
for. That is here at list, list. We may have other structures given as
parameters to the domain, but it should not do anything with it. And also, we
assume that we have an abstract precondition. So here we assume that we have a
list which is already given to us. So this is a whole shape graph, but we may
also not have this assumption and analyze a list construction routine. Here we
don't have it. But here, I just give, just make the assumption to make the
analysis simpler.
And what the analysis will compute is an over-approximation for reachable
states at each control point. So, for instance, in this case, at the loop
exit, we get this abstract state which says that L points to a list, which
basically points to a list segment, the tail of which would be pointed to
[indiscernible]. So this is because C was used to travel the list in various
types. So it creates not at the loop exit. I think it's after the insertion
11
because here we have the element T, which has been inserted right after C.
here we have the list state. So this is what this abstract state says.
And
So what are the main algorithms for analysis? Well, there are two main, which
are unfolding and widening. So what unfolding will do, it will pair from a
case analysis on summaries to enable abstract post-conditions to be computed in
a precise manner. So, for instance, when we insert the element in the list
here, when we have this representation here, this abstraction of a list, before
we can analyze an insertion at position Y, we actually need to make a case
analysis under this structure so we get two cases here.
So the first case is a case where Y points to a list of at least one element,
which is made concrete here. And the second case corresponds to the case where
Y is actually another pointer. So, of course, if we look back at the code, we
start before the next field, before [indiscernible]. So this case, we'll
actually go and we'll try to analyze the insertion in the list because we know
that it's not going to be equal to zero.
Then abstract post-conditions can be actually not so hard to compute, because
they are always going to be computed on states which have been, at least which
have been to use a [indiscernible] partially concretized. So that's the part
of the memory which will be impacted is already completely concrete here so we
don't have to do reasoning of [indiscernible] predicates at this stage. This
is what we do here.
And to ensure terminations, to ensure termination, you know that to avoid
abstracts states to grow further, like here, for instance, we have a widening
parameter which we fold back graphs into more simple graphs. So the unfolding
principle is trivial. It just stems from the concretization of the inductive
definitions.
So there is one subtle point, though. That we look at the unfolding operation
in the analysis, it will actually compute an over-approximation of the
[indiscernible] unfolding because when we unfold a predicate like this one, we
get also some numeric site predicates which may not be accounted exactly in the
numerical abstractions. The numerical abstraction may make some approximation
here. So here those are very trivial predicates, but we could imagine that we
lose precision here.
So this is why the soundness of unfolding expresses that way.
So the
12
concretization of the input is entered into the joint of the concretization of
the output. And in some cases, it's not equal here. It may be
[indiscernible].
So after we do unfolding, we can do our reasoning, like I said. So I think I
can pass on this slide. It's very trivial. So one issue, though, is that
unfolding is not so trivial as it looks. So unfolding in itself is not hard.
But what may be hard is to discover what part of the shape graph needs to be
unfolded. So let's look at this example.
So here we are starting with the [indiscernible]. We are traversing it up to
some point and then we stop, and then we decide to go back a number of steps
after we've checked that we're allowed to so we won't [indiscernible] here.
So let's look at what the analysis will compute. So here is the state we get
after the [indiscernible], which there is no surprise here. We have
[indiscernible] here. Now, the question is what should we unfold to
materialize this edge.
So the first thing we do, we see is that we want to materialize fields of C.
So the first thing we want to do is to -- the first thing [indiscernible] tells
us to do is to unfold this right now. It's actually a good choice, because
after we do it, we can see that here we have C field which has been
materialized.
However, so we can materialize this value. Now the problem is that we want to
materialize one more [indiscernible] of a prev field and we are in trouble
here, because there is no prev field starting from that node. But all is not
lost, actually. Better prime occurs to be a parameter at the [indiscernible]
segment. So what this means is that better prime is actually the last prev
field encounter before alpha prime. So this suggests unfolding the segment.
And we can actually do this because segment just some form of [indiscernible]
predicate. So in particular, we can prove a signal [indiscernible] if you have
a segment of length I plus J, it can always be described -- oh, I think I had
some problem with figures so I have two figures which are the same. So will
rewrite here.
So if we have a segment of lengths I plus J, we can always put it into a
segment of lengths I, plus a segment of lengths J. And in particular, if we
13
make J equal to 1, a segment of links one with [indiscernible] into just one,
it will increase the amount so this is what we have done in this graph. We
have unfolded this segment into a segment of the link of that segment as well
plus a segment of links 1.
So actually, there is another case which I have not represented here, which is
a case where the segment is a link zero, and normally this is ruled out by
this, just because see the prev cannot be equal to null and the prev field of
alpha that is beta was equal to null in the beginning. So in that case, the
segment of length zero, the case goes away.
So the issue is not to do the unfolding, although unfolding steps are not hard
to complete. What is hard to do is to find out which edge to unfold. And to
do this, we have implemented some strategies, but they are not complete. And
actually, I think it's not possible to decide in general what definition should
be unfolded.
So now let's look at termination. So let's look at the analysis of this simple
program here, which traverses a list with a cursor and iterations, we have the
cursor equal to the pointing to the head of the list. After one iteration, it
points one step farther, then two steps. Of course, we can see that if we keep
running the analysis like this it will not terminate. We never reach a fixed
point unless we do some kind of widening.
So let's look at the principles of widening. Again, widening is based on the
separation principle. So it will do two things at the same time. There is a
process where we discover where we split both inputs into regions in a pairwise
manner that is if those are the two inputs of the widening, the algorithm will
have to compute a blue region here, and matching the blue region there. The
red region here matching the red region there. And at the same time as it does
this region matching, it will apply some semantic rules for region weakening.
This is, for instance, if we have on one side a segment and the other side a
segment plus one element of the structure corresponding to the segments, then
we can always over-approximate this one with a segment. This one already is a
segment so we can say that the widening of this region together with that one
will be a segment.
And the widening will actually discover the matching at the same time as it
progresses, as it computes, as it applies those semantic rules. So it's a
14
fairly complex process.
I'm going to show a couple of holes to give you an idea. So here is a segment
introduction hole so what a segment introduction hole does is if we notice that
we have two nodes in one input which should be regulated -- which should have
approximated the same, the symbolic variables of which should approximate the
same thing as the symbolic variable corresponding to one single node in the
other input, then we can always say that this can be over-approximated by an
[indiscernible] segment.
So the region between those two nodes can be over-approximated by the segments,
then we can just introduce one segment in the result.
So this actually occurs when we do widening after one iteration of our list,
for example. So here we have variables storing the same value denoted too by
alpha zero. Here we see that C corresponds to beta 1, which is the address of
the list. So the first hole will simply say that here we have the
[indiscernible]. We know that in both cases, C points to a list. We'll keep
that. And then we are left with this blue region here and we see that here,
our value corresponds to our value here. C corresponds to our value over
there. So we can just try to look whether we can over-approximate some left
over here by a list segment.
And we have some inclusion checking algorithm here, so this is another
algorithm which I won't present. Its principle is very similar to the
widening, actually. Which we'll check that. And after it proves that this
corresponds to a list segment of [indiscernible] 1, we'll just introduce a
segment here.
So and actually, after we get this, in the next iteration, what we'll do, what
we'll see is that we'll exactly have -- we'll exactly apply this rule and this
will allow us to just prove that this graph is a fixed point.
So this widening operator is based on a rewrite system where we have -- where
both the regions matching and the output are computed in the same time. So
those rules are all be proved sound, of course. So it's a terminating system,
of course, which means that it's a widening computation. It self-terminates.
However, the big issue is that those rules are not confluent. So what this
means is that depending on the strategy, the order in which we apply those
15
reels, we may have the algorithm may get stuck so we would risk imprecise
results. So there's a lot of language that goes into the strategy for applying
those rules. It's not very interesting, but it needs to be done in order to
get -- it's not very interesting to discuss today, but it's a lot of work,
actually, when designing the analyzer to make sure we don't apply the wrong
widening rule in the beginning and then fail to compute a precise result later
on.
So the properties of the widening, as a consequence of those are for the sound
so it returns an over-approximation of the concrete join. We can also prove
that the widening it rates, terminates at least after. So typically, the way
we prove it using all the rules is by first considering the segment
introduction. And depending on the concrete states, we over-approximating.
Depending on the number of variables, we can introduce only finite signals and
so after a given number of iterations, this rule cannot apply anymore.
And when the segment introduction rule cannot apply anymore, all the rules will
at least preserve the number of edges or decrease it, and this allows to prove
termination.
So since we have a widening in the shape abstract domain, the structure that I
did talk about, the cofiber structure which allows us to make the product with
the numerical domain, even though the number of dimensions in the numeric
abstract values depend on the graph, so this construction also comes with a
widening construction. So if you basically, [indiscernible] proved is that if
you have widening in the [indiscernible] domain, and if you have widening on
the base domain, such as a graph domain, that will just get widening for free
in the combined domain. So let me see. I thought I had a picture.
So the way it worked, I will just come back to the figure. So the way the
widening in the combined domain works is first, it will compute -- it will
result, it will [indiscernible] a fixed point. It will [indiscernible] a
stable value in the left domain. And then when this one is stable, then all
iterates will be a pair of S1 and some numerical abstract value in the domain
corresponding to S1. And since we have a widening on S1, then we'll convert
[indiscernible].
So if we compare with other shape analysis tools, so we decided to use a
widening. So what widening does is it relies on history of abstract
computations in order to guess a bound, to guess a [indiscernible] point.
This
16
is quite different from what is done in many other tools. So in many tools,
people use what they call canonicalization. So that is a unary operator that
will turn some abstract value and possibly some finite lattice into a
[indiscernible] lattice, which is finite. And this allows to compute a fixed
point in an abstract domain.
So this does not -- I think that this approach is kind of different. So first,
the fact that the use of finite lattice can be used [indiscernible] limitation.
But on the other hand, I think it's good to have the ability to weaken abstract
values at other points than widening point and loop iteration point.
So in our case, we started with widening ornamenter. We observed that this
allowed us to not only deal with an infinite lattice, but to have rather quick
convergence and to manipulate, typically low numbers of disjuncts compared to
the frameworks. But on the other hand, I think we won't have a look at a
fraction too, so I'm currently working on one, which would share some of the
principles of widening [indiscernible] not being as [indiscernible] as
canonicalization parameters have to be because they have to be done in a finite
lattice to ensure termination.
Okay. So now, I will talk about two applications of analysis. The first one
will be the analysis of low-level C softwares. So far, what I've described
looked like an analysis of java program somehow. I did not show any C-specific
features in the examples I did show, so I will show C-specific features now.
And in the next part, I will discuss about some very specific forms of embedded
softwares we have to analyze and how we extended [indiscernible] to do so.
So before I do that, is there any question on the analysis algorithms?
I have a question. How much time do I have?
>>:
Theoretically, you have 'til noon.
>> Xavier Rival:
>>:
Okay.
And practically?
You can finish earlier, if you want.
>> Xavier Rival: Okay. So this means I cannot finish later, I guess. Okay.
So let's continue. So what happens when we look at C-codes? Well, typically,
we have to deal with many low-level features. So, for instance, typical
programming pattern is to use nested aggregates. So structures inside
17
structures.
Even unions inside structures.
In this case, we have aggregates.
So we may manipulate pointers to fields and also there is memory management
that is allocation and deallocation which needs to be analyzed and verified in
a sound manner. So let's look at contiguous regions first. So how about
nested aggregates.
Well, we found that we just had to change offsets. So in the beginning, I
think in all the slides I did show so far, all I was using is fields as points
to edges labels. That is, offsets were very simple. Well, now we'll add some
more complex offsets. So the first extension we need to make in order to deal
with nested structures is to have sequences of fields.
So one offset will be a sequence. So in this case, we'll have, A will have B
dot X, B dot Y. Once we do that, nothing changes. We just need to change the
way we deal with offsets.
The other thing is arrays. So the first thing is, well, again, we can change
[indiscernible] of sets and just say offsets will be numeric in values. It's
not quite enough as I'm going to show in the next slide, but in practice, in
the principle, we just, again, we just need to change the notion of offsets and
go from, say, symbolic offsets we have here and move to some numerical
construction of offsets. So I will discuss this some more on the next slide.
So but what I want to stress is that when we made those changes, we did not
revise them. So statistically, the widening and folding remain the same.
So now how about array abstraction. So if we have some array of, say, length
20 sorted into bit integers. So the first approach we can have is to just use
one single points-to edge and let this edge have size 80. So we never asserted
that the point-to edge should have as a size, a single memory cell so it could
be several cells.
The advantage of this approach is that we can actually delegate the abstraction
of the content of the arrays that is better to some array-specific abstraction.
So we can actually just say that beta will be a very, very long numeric value,
which is actually the content of the array and will use some array abstraction,
so there are many, I think in people in this room have proposed.
So you can just imagine using one of them.
So in the second approach, which we
18
are actually, which we have implemented in the analyzer recently, actually
consists in to put into our own analyzer one of those array abstraction where
we actually use segmentations of blocks. So this is what we have here. And
the segments correspond to numeric offsets, which are symbolic.
So, for instance, here I expose one fraction of this array where we have one
set which is materialized here, and here we have two array signals which may be
of length zero or could be the lengths one, could be of lengths, say, 19 in
that case, because the array is of lengths 20. And we implemented those array
segmentations in the analyzer. Again, this changed [indiscernible] changing
the unfolding or the widening of the shape domain. Those just remain the same.
So I'm going to talk some more about how we introduced this segmentation -- I
mean this abstraction with segmentations in the last section, because it did
play a role in the analysis of those [indiscernible] I will tell you about.
So now how about the pointer models. So in our analyzer, pointer, which can be
refracted by some points to edge where one set of corresponding to an address
to another set of corresponding to an address. However, as I said, in C you
may have to deal with pointers to fields. So how do we deal with pointers to
field?
So our first solution would be to rely on the numerical abstraction to capture
some [indiscernible] between symbolic abstractions. Several can. But it's a
poor solution because in that case, we would lose the shape abstraction
important information about the pointer.
So our solution is to actually to label points to edges as a destination with
some offsets. So, for instance, if we want to write this concrete state, we
have two blocks with an address beta, and here we have field F, which points to
field G, and what we'll do is simply extend points to edge, add a G field as a
target. And what it says is that the contents of the cell we have tried it is
this one, is correspond to the value denoted too by beta plus the offset
corresponding to G.
So what is quite interesting to notice is that when we introduce this in our
abstraction, we met Jan Kreiker, who was doing some analysis with TVLA at the
same time, and I think he was using the same techniques which are very similar.
So that's quite interesting that using different frameworks, we can take it to
[indiscernible] using actually very similar techniques.
19
So a couple of years ago, Pascal Sotin and Bertrand Jeannet did look at how we
did [indiscernible]. We noticed that when we made this change, it also did not
require any change to the [indiscernible] algorithm. So we made a
classification of pointer models where we have either numeric or symbolic
offsets. Pointers to fields could be either allowed or disallowed. So we get
four models. For instance, [indiscernible] we will have symbolic offsets and
no pointers to fields. In C, you have numeric offsets and you have pointers to
fields. And the nice thing is that we follow our framework, basically all four
models.
But in my current implementation, I just take numeric fields with pointers to
fields allowed. So basically, in this model, you can -- you usually can encode
all the [indiscernible] anyway.
So now how about union structures. So as I said, in C, you have aggregates
where with unions, where you have basically multiple views on memory regions.
So Antoine Mine proposed a solution in 2006 which basically introduced
nonseparating conjunction in the analysis in the [indiscernible]. And we have
implemented this technique as well.
So we extended our abstractions with non-separating conjunctions. That is, if
we look at our abstract predicates, I mean, [indiscernible], sorry, what they
represent is that form where we have a big separating conjunction of many
sub-terms, which could be either point to predicates or local conjunction. But
under the local conjunction, we only have points to predicates. So the fact
that the conjunctions are local makes it quite easy to preserve the analysis
efficiency.
And if you want to reason about unions, you never need to actually have one of
those non-separating conjunction above [indiscernible].
Okay. So the last thing about the analysis of C softwares is about dynamic
memory management. So the question is how do we verify that free of P is safe?
So if we actually look at the [indiscernible], we do free of P, whereas P is
very pointer but not [indiscernible] which was located by [indiscernible]. The
behavior is undefined. So we have to make sure before we do that P is indeed
the address of something which is allocated.
And so the funny thing is to actually capture this behavior, we have to look at
very, very concrete view of the program semantics where we actually have the
20
allocation table taken into account. It's not the model where most people
actually reason about the program. But in that case, we did have to use this
as a concrete semantics, but it was not a problem at all with our model.
In our analysis, we tracked the location table in a very trivial way by marking
each nodes of the shape graph which correspond to address of regions and the
size they correspond to those additional small bits of information. We need to
add to the shape graphs. Again, this doesn't require any change to the
widening and unfolding algorithm, except that the -- I mean, we have to
over-approximate, of course, the fact that on both sides, we have allocation
markers would be the same and so on. But [indiscernible].
So when we look at the inductive structures, we should also, actually,
summarize those [indiscernible]. For instance, if we have list living in the
heap, then we should recall on each node that it correspond to an allocated
region of lengths [indiscernible] data field.
So in that case, we did see that we did benefit a lot from the fact we start
the from very concrete base semantics. It made the extension of the analysis
to handle free in a sound [indiscernible].
So now I'm going to talk about the analysis of some family of C-embedded
softwares which is work in progress. Actually, we have an implementation of a
prototype which will analyze some code taken from some embedded code. But the
analysis of the whole embedded code is still a work in progress. It's also a
synchronous code so most of the work on that code is done by Antoine Mine, in
fact. So I think he published at least one paper on this work.
So before I talk about the problem in itself, I want to give just make a few
words about the context. First of all, it's a follow-up of the Astree project.
So in follow up, in Astree, we have some synchronous embedded codes which is
very contiguous fly-by-wire C softwares. And Astree worked quite well on those
examples, but some points we had -- at some point, proposed to the group to
[indiscernible] other codes which are more complex, which present different
sets of problems.
So including some asynchronous codes, so typically monitoring of aircraft
systems. And those systems always also have to comply with a number of
[indiscernible]. So [indiscernible] is called [indiscernible]. It's a big
book which says how people should write aircraft software. So what
[indiscernible] so in this level, you cannot quite use much data structures
21
besides arrays. So it's very, very limiting. So the fact that we can use more
complex structures is not written in [indiscernible]. But what is written in
the [indiscernible] would make it very painful to analyze, to use list of trees
or whatever. And those monitoring applications are level C so it's still quite
critical. So in that case, what the software has to do, it has to manipulate
lists of messages. So those would typically be implemented with a dynamic
structure. But in DO-178, even level C, so C is a middle level in terms of
criticality. So levels are from A to E. And in level C, still quite hard to
justify using malloc, so people did not want to do that.
So what they did, they used what I would refer to as a free pool later on. So
we have a static array. So here are lengths 5. And each cell of the array
will start a structure. So it's just a contiguous [indiscernible].
And what the program will do is it will do its own memory allocation inside
this block. So, for instance, it will have some sets of invalid messages which
are not pointed to by the system. It will have a pointer, say, to the first
element of the array and then in the beginning of the array, there would be a
number of elements which are linked as this element.
So in this case, we just have a straightforward case where the second element
is just pointed by the first one. So we just have elements appearing in order.
But in this case, the messages, the case of that program, the messages may
[indiscernible] order. Like here. But the program would always make sure that
the data are stored at the beginning of the static region.
So the program will do its own -- it manages, it implements its own malloc. So
this is actually quite challenging code to analyze because in that case, what
we have is a structure which is accessed also not only as an array, but also
accessed to as a single link list, and it's no option to -- sorry. See, I'm
going to discuss this later on. Let me see. No, actually, I did move that
slide.
So what we could do is make [indiscernible] to represent the value -- I mean to
keep the representation concrete and to rely on this assurance to represent all
possible changing, but that would not work. It would have an exponential
blow-up here. So what we have here, is we need to abstract a single link list
here and we need to abstract it together with the knowledge that it lives in
some static array.
22
So what we do is we are going to use abstraction at two levels. So actually,
we'll start, so here is a concrete state on the left. And what we can do in
the first step is say that this concrete state can be abstracted with this
shape graph here. So what this shape graph says is we have two variables, T
and L, which point to some region of address something here. The address
corresponding to that node. So in that concrete example, it would 0XA0. And
the T points to the base at rest, here is a pointer to array. So here the
offset is zero. And L points somewhere inside the array. May not be the
beginning.
So now if we look at the content of the array, it will be split into two
regions. So here comes the fragmentations I introduced a little bit earlier.
And what this fragmentation says is we first have this blue region
corresponding to node B to zero, and this green region corresponding to need B
to 2. And what we want to do next is to also relate B to zero to some other
predicate which says that B to zero actually starts a single link list.
So if we look, zoom inside this block, what we have here is a searcher which is
simply a single increase so we can use [indiscernible] shape graph to
[indiscernible]. So this is what we have here. This is call we call a
sub-memory abstraction. So here in this case, what we have is we have two
sub-memories. The first one corresponding to B to zero is a single link list
because the whole area corresponding to B to zero is one single link list.
And here we just have one [indiscernible], which is of link 16, because here we
have two structures with no knowledge about their contents, which, in fact,
this value, this node -- sorry, cannot write. It's small. This node does not
carry any information altogether. It just says that here we have a raw
sequence of bits.
So what we have is we are going to introduce a new numeric lattice. I put
numeric in between quotes here because it is numeric in the sense that it will
constrain symbolic nodes and corresponding to numeric values. But it's not
numeric at all in its definition.
So what it will do, just map to some numeric nodes, to some symbolic variables,
sorry. Some sub-memory predicates which correspond to [indiscernible],
including the content, the symbolic variable corresponding to the contents of
the memory. A range which is defined by symbolic offsets from base address
alpha. And then we have a shape graph which corresponds to the sub-memory
23
contents and some kind of environment which kind of special environment,
because what it maps is actually offsets from alpha into some nodes of this
shape graph. So this actually is a binding between the view from the main
memory point of view into the sub-memory.
So what operations can we do on those predicates? Well, the first one is
introduction. So whenever we have a points to edge, we can actually say that
we have points to -- we have some memory corresponding to its content, and but
we don't know anything about the layout of this memory. So it's just yet
another points-to edge.
But this shape graph, actually, is constrained here. I was describing links of
beta. So actually, this is not introduction. This is fusion. So the
[indiscernible] is fusion. If we have two contiguous memories, then we can
compute the fusion and actually, what is quite interesting is that the shape
graphs just get merged together as a separating conjunction, as a normal. So
we just get two sets of edges put together, we get another shape graph and this
is a new, a merged -- it's a merged sub-memory shape graph.
So how about high level operation. So join is actually the most interesting
operation, because this is one which we cause some memories to be introduced
and to be extended. So here, I just give one example. Here we have -- so it's
not an -- it is -- so again, I think I've done some -- no, I have not.
So it's a case where we extend one sub-memory after adding one element to the
list, which is being constructed into the static block. So if we look again as
a program, here we had this code which was building, which was extending this
sub-memory area by allocating more and more listed elements and inserting them
in the right position.
So what the join will do here, so first let's look at the first argument. So
in the first argument, we have a sub-memory B to zero, which corresponds to
some list. And second argument we have added one more element to the list.
And this element correspond to this symbolic variable, which is actually also a
quote to, which is also known to be a quote to the base address of the block
plus some offsets. [indiscernible] prime which was the offset corresponding to
the list because we did add here the element at the head of the list. And what
the widening will do is it will introduce a new sub-memory corresponding to
that single set here. And it will merge it together with one which existed
before.
24
So there are two operations here. So it was this one. And this will give us a
new sub-memory here where we have a segment of a list between delta zero and
delta one and we still have the [indiscernible] which is the list.
So the transfer functions are not very interesting. I think I'm running a
little bit out of time so I'm going to forget about this one. I think I can go
to the demo now, unless we have questions. Any question? Okay. So demo.
So I took three, I did select three examples. So the first one is a list
example. So here we have a list, we have a few list function, one which will
allocate a list. Another will free one, and then another will iterate our list
and do some things. This could be a printing function, for instance.
And in the main function, we first allocate a list. We reverse it and then we
[indiscernible] over it and we deallocate it. So what I have made, this is a
version of [indiscernible] which just runs my three examples. So it's useful
for testing normally.
So I'm sorry, it's a pure ASCii tool, so it is just a text file which will give
us all the invariants at all points. Let me check. I forgot which point I
wanted to show. So after allocation, we have -- actually, well, this is a kind
of an assert which will verify that error is a list here. So let's check line
48.
Okay. So this is the invariant here. So we have variable LL corresponding to
node 5. So this is the node corresponding to the address of this variable.
And we see that 5 contains one point to edge of size 4, and contains seven.
And it the contain seven corresponds to some [indiscernible] predicate.
So in that case, in the case of this one, I forgot, I omitted to say that I
used a five as a parameter, which corrects a number of [indiscernible]
definition and it's not a [indiscernible] definition, it's also a set of other
[indiscernible] definitions, which I use for user examples.
But in this case, what happened here is that the tool basically computed this
list predicate as a list fixed point over on the routine which is allocating.
So in this code, we never give the tool any information about what we are
building here. It is discovered by the widening that it is building a list.
25
So now, let's see what happens later on. For instance, we can look at the -well, I will look at the [indiscernible] function. So the loop is at line 6.
Let me search the invariants we get. Okay. So this is the first invariant.
We get. So in the beginning, we see that we have variable LL, variable -- so
this is the cursor. This is a local variable of the function L. All points to
the same node 5, which is the address [indiscernible] so this is because before
we started iterating the other structure.
And after, what we see here is we obtain a segment so we still have
[indiscernible] variable L and LL which point to node 4, which is the beginning
of the segment. The end of the segment is node 5, pointed to by the cursor
here. So what this means is basically the widening which was computed was a
loop here, did materialize the signal so it did infer there was a segment. And
I think that since I don't have too much time, I will move on to the next
example.
So the second example is a tree example. So it's binary trees with
[indiscernible] pointers. So in this case, I did not include the construction
routine. So example is a little bit contrived, but what it does is it assumes
that it starts with some well-formed binary tree and then it will search pass
through the tree and do values transformation operations along that pass,
including swap, rotation in the left to right and right to left. And then
there is also the [indiscernible]. So in order to make the example a little
built simpler, I just put -- I did just abstract conditions using
[indiscernible] variables. But this is just an example to make it a little bit
more [indiscernible].
So let's look. Okay. So here is a log. So it's quite long. So what is the
state? So I'm just going to show you one state. At the end of the loop, so at
point 59, okay, so this is the very last, very last line in the loop. Maybe I
should just show the first one, 19.
I won't be help here.
Yes, so I did not prepare the examples well enough.
>>: That's okay. So this is what we get at the second. So this is the
beginning of the loop. We have just one disjunct. So we have a disjunct
domain. It's one very, for now very trivial implementation of the trace
partitioning domain. So at the beginning of the loop, you have only one
disjunct. And here again, what we can note is that the analysis did infer that
we could have this tree segment predicate here.
26
So in the case of the tree, what the tree segment means is that we have the
tree like this, and so this is the root of the tree which is pointed to by
variable T here. And we have our cursor, which is C, which points to somewhere
inside the tree, okay.
So what the tree segment predicates indicates here at right is whatever is not
below C. So this is this [indiscernible]. And the tail of the tree here
corresponds to the third tree pointed to by C. So this is what the segment
inductive predicate looks like when looking at a tree. So this is what we
obtain at the loop head. So and if we reach -- so I think it was 59. The tail
of the loop. So we had a number of disjunct which were introduced
corresponding to all -- sorry? We're running out of time. Should I stop right
now?
>>:
Yeah, you'll be here all the week.
>> Xavier Rival: I will be here all the week. So if you have more questions.
Sorry for talking for too long. I have a conclusion, but I think it's less
interesting than questions. So questions are always more interesting.
>>:
You can do the conclusion.
>> Xavier Rival:
does not.
>>:
Okay.
I conclude.
This is what it does and this is what it
You can read it.
>> Xavier Rival: That was right, exactly, because so you see I did not talk
that long in introduction. So now you have questions?
>>:
So you have questions?
>>: So you [indiscernible].
[indiscernible].
Do you have something like a decision to
>> Xavier Rival: No. So we have trace partitioning. However, I think that we
have not figured out yet the right criteria for trace partitioning. What I
mean here is that in Astree, we do trace partitioning too using information
about the [indiscernible] history. That is, we got into which bunch was taken
here and there. And it seems that this is not the best we can do in the case
27
of shapes. So I think it's useful to do this in the case of shapes. But in
other cases, you need to remember which [indiscernible] were applied to obtain
which disjunct and sometimes to merge stuff later on. So this we don't do yet.
>>:
[inaudible].
>> Xavier Rival: It should be added here, what we don't do.
long. Other questions?
>>:
We can conclude here.
>> Xavier Rival:
Okay.
We can conclude here.
The list is very
Download