>> Francesco Logozzo: So good morning, everyone. Happy... being there. So it's my honor and pleasure to...

advertisement
>> Francesco Logozzo: So good morning, everyone. Happy New Year. Thank you for
being there. So it's my honor and pleasure to introduce Mark Marron who is postdoc at
Imdea Software and today is going to talk to you about the work he has done on
modeling the heap in a practical way for the debugger or the statistical analysis. Thank
you.
>> Mark Marron: All right. Thank you for that introduction and thanks, everybody, for
showing up.
As Francesco said, my interest is in developing tools and techniques for modeling the
program heap. And just to sort of start out, I mean, the motivation for this, I think all of
us are interested in techniques for optimizing programs, helping programmers understand
their programs, finding bugs, testing et cetera. And all of these things that we want to
build require some understanding of the semantics of the program that we're talking
about.
And so as these two quotes sort of illustrate, that in order to understand the semantics of a
program, we need to understand both the algorithms and control flow structures as well
as the data structures that these algorithms are manipulating.
And one of the interesting things about these two quotes is that they come from a time
when procedural programming was sort of the state of the art so that the algorithms were
the fundamental organizing principle of the program that manipulated data structures.
Now, as we've moved on and techniques like object-oriented programming have become
more popular, sort of data and this abstraction around objects has become sort of the
central organizing principle of program development, and so this has made the notion of
data and the organization of data a more important concept in understanding the
semantics of the program.
Additionally, with the popularity of garbage-collected languages, it makes it much easier
to sort of build rich, complex data structures on the heap, much more so than if you had
to manage all the memory yourself.
So as time has gone on, understanding the program heap has become a more important
and a more difficult task in understanding the overall semantics of these programs that
we're interested in.
Now, just to sort of focus this a little more on the problems that I'm interested in and what
I'm focusing on, I'm not particularly convinced that there's one sort of heap analysis that
subsumes all others. I think in order to get the best results you're probably going to have
to focus on specific classes of problems a little more.
And what I'm interested in doing is sort of supporting applications that we'd like to see in
sort of the development environment of the future, a development environment that helps
programmers query the behavior of their program, understand what's going on if they're
refactoring bug fixing, adding new features, the compilation environment that would be
in the system.
We want to produce fast code, we want to understand what's going on in the program so
we can do, you know, interesting, sophisticated transformations. We want to support
automated test development, debuggers. We want to help the programmer really test all
these data structures that they're building and understand what data structures they are
building at runtime.
And in particular this is where sort of the practicality comes from. We're interested in -since we're building these tools for a development environment, we're going to assume
that most of the code that we're looking at and trying to figure out the semantics of are
going to be sort of userland applications; that these programs are going to be perhaps
fairly large, some parts of them might not be particularly well written. They're going to
use lots of libraries, IO, this sort of stuff.
And so we really wanted to come to the development of our heap analysis technique
thinking about how we're going to deal with these in sort of a manageable way, so that is
we have to ensure that we can analyze large programs fairly computationally efficiently.
If the programmer is doing something, writing some code in a very strange way that's
hard to understand, we're going to lose information, we need to make sure that this
doesn't destroy all the other information we have about the program. We need to be able
to cope with these things.
So sort of just to get a little more precise, some of the client applications that we're
interested in supporting and we've looked at as we've developed this analysis technique,
we have some sort of optimization examples, thread-level parallelization or redundancy
elimination; better memory management, garbage collection. Some program
understanding tools, automatically extracting pre and post conditions for methods or
computing class invariants. From the testing standpoint, maybe automated test case
generation for heap data structures, debugger visualization to help the programmer
understand the runtime heap that's present when they're figuring out what a bug is.
So in order to support all these things, we need a number of components across sort of a
range of parts of the program lifecycle. We need some sort of model that represents these
properties that we're interested in that we want to query or expose to the program or other
tools.
This model needs to be amenable to efficient static analysis. That's a key thing. And
we'd like it to be understandable by nonexperts. We'd like something that we can show to
the end user, they can quickly understand, they can give us feedback on whether or not
it's precise or not. So we'd like to try to have something that we cannot just automate but
also involve the end user if possible.
From the static analysis standpoint, as we mentioned we want to utilize userland codes,
so this needs to be scalable, at least to the module level. We might be willing to do some
other stuff for inter-module analysis, but we want to at least do, you know, whole module
analysis in a context-sensitive manner efficiently enough to run for the developer on their
machines, right?
I mean, ideally we'd like this to be able to run in the background interactively with the
rest of the development environment. But they should at least be able to run this analysis,
this static analysis when they do their other regression testing.
We also need of course some dynamic support. If we want to talk about debugger
visualizations, we need debugger hooks, hooks for doing specification testing. And of
course there's runtime needs for this heap structure generation.
Now, most of the work that I've done to date has focused on the development of this
program heap model in the static analysis. We're moving into some more of this work on
dynamic support, but it's not so well developed yet. So I'm not going to really talk about
this too much in the talk. But if you're interested in this, you can just imagine all the
things that I talk about in the heap model applying to the dynamic support as well as the
static analysis. And of course I'd be happy to answer any questions anybody has on that.
So the first general topic I wanted to talk about is how we actually model the heap; that
is, the properties we're going to extract and represent how we're going to do it and how it
all sort of fits together.
Now, the first question is what properties we're actually interested in understanding. And
these sort of break down into two categories. One is sort of the obvious, if we have a
heap, it has some structural properties, right?
And what we'd like to do is rather than view the heap as some uncollected set of objects
with pointers between them, we'd kind of like to try and break it up into sort of a more
hierarchical structure that hopefully mirrors what the programmer has in mind when he's
been defining these data objects.
So what we're going to do is we're going to initially try and take the heap and partition it
into logically related regions in some sense. So maybe we'd like to say rather than we
have just a bunch of objects, we'd like to say this set of objects makes up some recursive
data structure, this set of objects makes up another recursive data structure. This set of
objects is all stored in a given array, some stuff like that. And that way we can sort of
keep unrelated sections of the heap separate, we can more precisely model the
connectivity between these things.
Now, once we've broken everything down into these regions, we can then talk about sort
of connectivity both within the region, so the classic sort of shape property: this set of
objects make up a recursive data structure and they're linked together like a [inaudible]
tree or they're linked together like a singly linked list.
And then we can also have the connectivity between regions. So starting from variable X
I can reach part of the heap that's shaped like a binary tree, I can then reach part of the
heap that's shaped like a link list, I can then reach some other vector.
Now, sort of an important special case here is that if we have a region in the heap that
represents an array and a region in the heap that represents all of the objects stored in that
array, and we'd like to know sort of the special case, does this array point to a unique
object in each index of the array or does it -- maybe there's some sharing in this array.
And similarly for recursive data structure. So we're going to sort of handle this special
case a little specifically since it seems to be quite important.
The second class of information that we want to model is what I refer to as intrinsic
information. And so this is information that's sort of about objects themselves and might
actually not be visible to the executable program but nonetheless is very important to
understanding the overall behavior of a program.
So the first idea is this notion of object identity; that is, we might like at some point in the
program, say, statement 1 to say, okay, X refers to a given object, then we'd like to ask,
you know, at statement SK where has this object ended up, has it been moved into a
different data structure, how have the data structures been arranged around it, is it even
still alive. And we'd like to have a fairly compact naming scheme across program
locations.
And once we have this kind of information, then we can also start asking about a very
useful thing. So use-mod information on heap-based locations. So given a program
statement we'd like to say can you tell me all the locations that are read and written by
this statement.
And similarly given a method call, we'd like to know all the locations that are read or
written across the invocation of this method. So this is sort of the -- often called the
method frame or footprint.
Then we also have some information of course on allocation and escape sites. This is
pretty standard. Field nullity is useful. And we also tracked some information on
collection sizes and iterator positions within the collections, which can be interesting -- I
mean, to a more limited set of applications. But still useful from time to time.
So the technique that we're going to use to represent all of these properties is based on the
classic storage shape graph approach. Now, in this approach we're going to take an
actual graph model and each of the nodes is going to represent a set of objects or the
regions that we get from our decomposition.
And each edge represents a set of pointers between the regions. And so this is very nice.
As you can see it has a very natural representation for a lot of the properties we're
interested in. It's easy to visualize and sort of understand even by nonexpert developers.
There's sort of a natural map between the graph structure and the graph structure of the
actual concrete heap.
From sort of a cognitive understanding point, it's a very navigable structure, it's not prone
to a lot of sort of radical rearrangements when the heap changes a little bit. It also is very
efficient and compact for computation in terms of static analysis.
So just sort of a couple of technical issues that would be interesting to people who do a
lot of static analysis and shape analysis is in the storage shape graph disjointness
properties are pretty natural and mostly free.
So, for instance, if we have two disjoint sets of objects that are sort of different in some
way, they'll be represented by two distinct nodes. So this is implicitly encoding a lot of
the disjointness information. So we get nice frame rules and that type of stuff for free.
The second important point is the majority of computation is local. Since the transitive
closure of a lot of relationships is implicit in the graph structure, we don't need to update
a lot of transitive relations explicitly to do each operation. We just change given edges in
the graph where they point to, and we get all the transitive closure implicitly basically for
free for most operations.
And the third thing which I found is very, very important is the graph structure does a
very nice job of isolating imprecision that arises when we're analyzing a program.
As we said, we're sort of expecting in these large programs that we're looking at for the -you know, the programmer might not write a particular algorithm in a very
straightforward way. They might mangle data structures in just a weird way before they
clean them up. And we're not going to expect our analysis to be able to figure out the
clever thing the programmer's doing. So we're going to make very crude approximations
about what happened to some data structure. We might say we have no idea what
happened here, this data structure could be a cycle.
But we don't want the fact that we were unable to understand a specific section of code or
what happened in a specific data structure to destroy all the useful information we have
about the rest of the program heap.
I'm sure this is an issue that's come up with pretty much anybody who's written a static
analysis. You know, you have this beautiful expression language, you try and run it on a
program and you're like I should be able to discover all sorts of useful things and it comes
out and it says I couldn't figure anything out and you find out there was one point where
your transfer function was slightly imprecise and this just snowballed and wiped out all
the useful information you had.
And so this is a very important thing since we pretty much expect that we're going to
have these places where there's imprecision and we don't want this snowballing to occur.
So I thought that was an interesting technical aside, but not critical.
Okay. So the only downside with this storage shape graph is that there are some
properties such as use-mod information that it's not at all clear how you would encode
that in the connectivity structure of the graph, right? And there's some properties even in
the connectivity and structural information that you can't encode as precisely as we'd like
just based on the graph structure.
So what we're going to do is we're going to annotate the nodes and edges with some
additional instrumentation properties to precisely encode this information and encode all
the information we're interested in.
Okay. So the first thing I want to -- going to describe these properties and look at a
little -- how we build this model in a little more detail.
And so sort of the first issue that comes up is how do we group objects into nodes, how
do we do this partitioning. And this is sort of a fundamental issue that comes up when
you're using a storage shape graph approach.
Clearly, if you have too many nodes, this graph is large, it's expensive to compute with.
When you try and display this information to an end user, they see this huge graph, they
have no idea where they are or what they're looking for. So too many nodes is not going
to be good. If we have too few nodes, then what we're going to end up doing is we're
going to end up partitioning the heap too coarsely.
So a classic version is if we partition objects by their allocation site or type, then if we
have a linked list, then all linked lists end up in the same partition, and that's, you know,
clearly not the best way to do things.
So what we're going to do is we're going to have a dynamic partitioning scheme for the
heap; that is, at each point in the program, we're going to look at sort of the abstract heap,
we're going to identify recursive data structures and we're going to group those together
around objects stored in the same equivalence class of memory location. So same
collection, array, or recursive data structure.
And this -- like I said, this partitioning is dynamic, so if we have a method that takes two
linked lists which are represented by two nodes originally and appends them together,
we're going to say, okay, now these what were two distinct partitions are now one
partition and create that new partition.
Or if we partition a linked list, we'll have -- you know, the original linked list was
represented by one partition and at the end we have two partitions which are the two -are two nodes which represent the two partitions of the linked list.
So just as a little example here to show how this works, we have this -- an expression
evaluation program, let's say. And it's a nice object-oriented program, so it has a base
expression class. And then we derive from it add class, multiply class, negate class,
constants and variables.
And so we have this nice expression tree here pointed to as the variable exp. And just for
fun we've decided to intern all the variables in some environment. So we have this
environment array that points to all the variables and every variable only occurs once, so
it might be shared by certain expressions in the tree.
So if we're going to -- we're going to apply sort of our partitioning rule in this to figure
out what the logically related regions are, well, the first obvious one is the recursive data
structure that's the tree. So this add expression has a left multiply expression which has a
negate expression and two multiplication expressions.
The other equivalence class we have is everything that's stored in these same equivalent
storage locations, so the variables that are interned in the symbol table are all stored in
equivalent storage locations.
So then when we apply our abstraction function here, we get -- this is our abstract storage
shape graph. So we have the expression variable points to a node which represents some
adds, some multiply and some negate objects. There's some internal connectivity on the
left, right, and, you know, whatever the negate field is, A. We have a region of the heap
which represents a variable array pointing to a bunch of variable objects, and then these
are pointed to again by this recursive expression structure.
So this sort of segues a little nicely into the first of these instrumentation properties I
wanted to talk about, because when you look at this, it looks like this captured sort of the
fundamental structures of the heap pretty well.
The unfortunate thing is that although we've sort of grouped this set of objects into the
right recursive data structure, we lost information, a lot of information, more than we'd
like, on actually what that structure was. All we know is that there's some sort of internal
connectivity. We have no idea if it's a tree or a cycle or anything else.
So this is where the first of our instrumentation predicates comes in useful, this layout.
And this is fairly common to a lot of sort of heap analysis techniques, so notion of shape.
And so we're going to associate with each node the layout which is going to be the most
general way that you could traverse the set of objects that this node represents.
And so we have a couple of sort of hard-coded classes that we use. So singleton, there's
no internal connectivity. Lists, there might be a linear list or some simpler structure, so it
might be a singleton, but we don't know. Tree, it can contain a tree. Or cycle, which is
sort of any structure. It's our top.
So when we go ahead and annotate this model with this information, we now have that
we have this summary region which has add, multiply and negate objects, it had some
internal connectivity, but we know that it's laid out like a tree. So that is there is no DAG
or cycle structures in there. So that captures our concrete heap much better.
I should say that I'm sort of going to -- I'm cruising over these in sort of a high level, and
I just -- so if anybody has any questions or wants any more detail, feel free to interrupt
me.
>> Why can you say that's a tree and not a cycle? And I didn't see a self-loop on that
node [inaudible].
>> Mark Marron: So that's basically what this property is for. Right? Because as I -- let
me go back two slides. So here we just have a self-loop, but there's no information in the
actual structure of the graph to tell us what that loop means. Right? Like you say, it
could be a cycle. Right? And so what we're going to do is we're going to associate this
additional property ->> I understand, but how did you know that there wasn't a cycle?
>> Mark Marron: How did I know there wasn't a ->> Could be a cycle and then ->> Mark Marron: Oh.
>> -- in that structure.
>> Mark Marron: In the static -- when I'm performing the static analysis you mean.
>> Right.
>> Mark Marron: Well, so basically what we're doing is as the analysis goes along, it's
going to -- anytime it sees a new statement, it's going to create a new node to represent
the single object there. Then at some point in time it's says, you know, we need the
analysis to terminate.
So if I kept doing that I would build a graph of infinite size. So it -- control flow join
point. It says lets look at the graph and let's see if we're building something that looks
like a recursive data structure. And if we do, we'll compact it down.
And at that point in time we sort of have a number of sort of -- I mean, it's basically a K
statement that says if you have this sort of connectivity and this sort of structure property,
then you're building a linked list. And if you have this set of structure properties and
connectivity, then you're building a tree.
So it just sort of aggregates as the program executes. Does that answer your question?
Okay. Good.
All right. So the next property that I want to talk about that's related to structure is kind
of interesting. So this is the notion of sharing. So, again, if we look at the storage shape
graph, each edge abstracts some set of pointers.
But there's no information in the structure to describe how this set of pointers in this edge
share or alias or anything. All we know is we have a set of pointers. We don't know if
the pointers that this edge abstracts maybe point to the same object, point to different
objects, point to different objects but within the same recursive data structure.
And we'd really like to know that. For example, we have this array of objects, we'd like
to know if, you know -- do the pointers stored in this array refer to aliasing objects or not.
Now, this sharing could occur between two references abstracted by the same edge or
two pointers abstracted by different edges. So sort of these two dual versions,
interference and connectivity. Since they're basically the same, I'm just going to describe
interference here and you can imagine the same thing for pairs of edges in the
connectivity property.
So the question here is we have an edge E. It represents some set of pointers. And we
want to know if the edge is noninterfering; that is, every pair of references, R1 and R2,
abstracted by the edge refer to disjoint data structures. That is, in the region the edge
points to, there's no way to get to the same object from these two pointers.
We're going to say the edge represents interfering pointers if there may exist a pair of
references that refer into the same data structure; that is, they point into a data structure
where they could get to the same object.
So this is a little sort of strange definition, so I think a quick example is sort of the easiest
way to understand it.
So in this figure, we have two concrete programs states. We have the array A referring to
an array with -- or the variable A referring to an array with three elements. In the first
case, the first and the second index both alias on the same data object. In the second
case, all the indices have pointers to distinct data objects.
Now, if we look at the standard storage shape graph model of this, all of these pointers in
either case are going to be extracted by some edge that is the pointer stored in the data
array pointing to data objects.
And so there's no way with the structural information of the graph to distinguish whether
this abstract state represents this concrete state or this concrete state.
So what we're going to do is we're going to use these interfering properties, so we're
going to have in this case the pointers may interfere, which allows either there exists
some pair that alias or they -- none of them alias.
But the noninterfering property is going to explicitly exclude this, because it says for all
pairs they point to disjoint objects. And so you can think of this as being useful if you
wanted to perhaps parallelize a traversal of this array.
Interfering says it's not safe to do this, there might be a concrete state where there's some
loop carry dependencies. Noninterfering says it's safe to do these. Or if you wanted to,
you know, somehow refactor some processing of the array or how data elements are
stored in the array, maybe you wanted to inline them, if the array contains all
noninterfering data elements, then you know each index sort of owns its corresponding
data element.
Okay. So that's sort of the extent of the structural properties. And, now, again, I want to
think a little more interesting, these intrinsic appropriates. So the first one is this data
structure identity.
And so as you saw in the storage shape graph, we have these nodes which represent
partitions of the heap. And we can sort of split these partitions, we can merge them back
together, we can rearrange them. We can do lots of things.
And really what we'd like to know is if at sort of the start of the entry -- excuse me, the
entry of a method, I have some given partition. At the exit of the method, I've moved
these all around. And I'd like some quick way to compare the initial partition with the
final partition or some partition in an intermediate point in the method.
And sort of a classic way to do this is based on access pads. Right? But I'm sure, as a lot
of you have experienced or seen, these can quickly become very complicated to reason
about, to propagate in a model and so on.
And so what we'd like to do is we have this storage shape graph where each object is sort
of explicitly represented or each class of objects is explicitly represented by a node. So
we'd really like to just tag these nodes with some notion of identity, and then we can just
read off the identities the node represents and compare those to some sort of canonical
identity partitioning.
And, as I said, we fortunately have a nice canonical identity partitioning, the one at the
start of the method. And I'm going to talk here about the use-mod, and then we'll see an
example of how the two work together.
So once we have this notion of identity, right, we can then add use-mod information quite
easily to, you know -- we're going to add a tag to each node that says the last statement
that each field in the node was used or modified.
And then it's really easy at a statement level to just basically look at the use-mod tags for
that statement and say, okay, well, we can read off all the identities that were used and
modified at this statement. That gives us the use-mod sets of the heap locations for the
statement.
And, similarly, across the entire method we could just say, okay, initially the method, we
have an identity partition and no one has been used and modified in this method. At the
exit of the method we can read off the identity tags and the use-mod field for each of
those, and so we can compute the footprint relative to the initial partition.
And so that probably sounds a little confusing and it's a little weird, so here's a nice
example. So we have this function mangled. Takes a pair P, reads the first dot value of
the object in that pair piece.
So this is just a standard pair object with the first field in a second field. Then it swaps
the first and the second. It nullifies the first and it returns the result value that we read.
So we can see here that even in this really simple straight line bit of code, we rearrange a
few things, it takes a little thought to sort of keep track of what actually gets read and
written, who gets returned, what gets freed and so on.
But if we look at the storage shape graph, so here's the storage shape graph at the entry to
the method. And so we're going to say this is our canonical storage shape graph and our
canonical partitioning. We have a pair object. It's one region of the heap. We've given it
ID 1. We have some data object that's stored in the first field. We've given it ID 2. And
we have some data object that's stored in the second field. We've given it ID 3.
Then at the exit of the heap, we say, okay, well, we can see how this graph, this storage
shape graph has been rearranged. All the regions have been rearranged. P still points to
region 1. So it hasn't moved at all. We can see that region 3 no longer exists. So it's
dead. It's been freed. And we can see that 2 now is pointing in the second and 2 was
originally the first. So it's quite easy to see where things ended up without having to
think through each step of this rearrangement code.
The second thing that's quite easy to read off, so for the read-write use-mod information
in the heap, we do a little color coding. If -- now, in the previous description, I said we
do this in a field-sensitive way, the use-mod.
For the figures to try and display the individual fields that were used in mod, it becomes
rather difficult to see in the figure. So we just simply color the nodes if any field has
been used green, and if any field has been used and written it's colored red.
So we basically color this pair object red because the first and the second fields are both
read and written, and we color this object green because we see that in the first line it was
read.
Okay. Oh, yes.
>> Where did the diagram on the left come from? Is that from the analysis of just this
function or for the whole program?
>> Mark Marron: We actually -- this is actually from the whole program. The model is
designed to be fairly compact. So this is almost the complete semantics of this method.
There's also -- if you run the analysis on this sort of with a program that does all sorts of
crazy things, you'll also end up with a heap graph, a second state where the first and the
second both refer to the same data object. But those two models basically will define the
complete semantics for this method.
All right. Are there any questions on that? Or does that make a reasonable amount of
sense? Okay. All right.
So, then, the same technique we use to track field nullity and allocation site, we just add
tags to the nodes. Collection sizes, we track whether they're empty. They must be
nonempty or we don't know. And field nullities per tag. These are pretty straightforward
and the techniques that were used to model them don't differ too much from what we
used for the heap use-mod information.
So now that I've sort of humbled everybody with this barrage of properties and stuff and
claimed that it all does something useful, I want to just take a minute to look at this case
study of Barnes-Hut that hopefully puts it all in a slightly larger context and can you see
why we do this and maybe it will be a little more intuitive.
So this program, Barnes-Hut, is taken from the JOlden benchmark suite, and it's basically
sort of a high-performance computational kernel. It does an n-body simulation of the
gravitational interaction of a bunch of planets or something in three dimensions.
Now, the easiest way to do this is basically you have a bunch of time steps. At each time
step you compute the gravitational interaction between each pair of bodies. You compute
how this affects their acceleration. You update everybody's position based on this and
you go to the next iteration and you keep doing this.
The problem with that is if you compute the pairwise interaction of each of the bodies,
you get an n-squared algorithm which can get kind of expensive when you have a large,
large number of bodies.
So what this algorithm does is it takes advantage of the fact that gravity drops off rapidly
as distance increases and that you could sort of -- it's additive. So for faraway sets of
bodies, it computes a point mass and just computes the interaction with that point mass
for a body. So you get a much more efficient algorithm. I think it's N log N. Is that -anybody disagree with me? I think. Okay. I'm not going to claim that. But it's much
better than n-squared.
So the main part of this computation is the loop that actually computes the interaction
between pairs of bodies. And so I'm going to show the loop invariant heap state that we
compute for that loop.
Now, this is just showing the graph structure that we compute. I'll show some of the
instrumentation predicates in a minute.
So the heap has a B tree object, and this object, it just sort of encapsulates all the
computation structure present in the program. And it has a root that points to the space of
the composition tree.
This tree is made up of cell objects which contain vectors which either have pointers to
other cell objects, parts of the subtree, or contain pointers to the actual body objects that
represent the planets that we're doing the n-body simulation on.
We also have this field called body tab reverse, which is a vector of body objects that
we're going to use to iterate over the set of body objects.
So each body has a position, a velocity, an acceleration and a new acceleration field that
it uses to store the new acceleration computed on each update loop.
These positions are represented as mathematical vectors in three space, which are these
math vector objects, and each one of them has a double representing the vector, basically
the values of the vector.
Okay. So -- and I think this is interesting as well as it shows that this model can sort of
nicely represent this hierarchical construction, right? Bodies are made up of sets of math
vector objects which represent their positions and their velocities, you know, the overall
computation structure is a -- you know, space decomposition tree of bodies that are also
stored in the vector.
So this is a nice -- I think this emphasizes the compositionality of this model pretty well.
>> Can I ->> Mark Marron: Yeah.
>> The doubles, this is Java, right?
>> Mark Marron: Yeah, this is Java.
>> The doubles are actually boxed [inaudible]?
>> Mark Marron: I think double arrays, you actually get the doubles in the array.
>> Oh, you do?
>> Mark Marron: Yeah, I think in primitives. But if you were to do -- if you were to
allocate like a vector of doubles, then they would be boxed, I believe.
So the first thing that's kind of interesting about this is if you look at this figure, you see
that there are an awful lot of math vectors for the rest of the objects in the program. And
our analysis is able to extract some interesting properties.
First off, whoever implemented this went and wrote the math vector, and what I think is
sort of the canonical, nice object-oriented way. They wrote a sort of robust encapsulated
notion of what a mathematical vector is.
So it has a parameter that you can set for what dimensionality you want the vector to
static final in. It has a bunch of loops that iterate over the length of the -- whatever vector
you're doing. And it's sort of nice from a software engineering standpoint.
But, as you can see, these math vectors take up a lot of space on the heap. And when
you're doing these operations on them, you have a lot of these loops with fixed loop
bounds, and in particular the analysis knows that each of these arrays gets allocated of
size three, so we can actually unroll those loops entirely and turn it into straight line code.
Furthermore, as I mentioned previously, we have this interference/noninterference
properties. The analysis has determined that this set of the pointers represented by this
edge are all noninterfering.
So that means that there are a bunch of body -- or math vector objects in this region. And
each of them refers to a unique and distinct double array in that region. So they own, in
the ownership-type sense, their respective double arrays. So that basically means that we
could sort of inline this entire double array into the math vector object as well as
unrolling the loops.
And so by doing this we get about a 23 percent serial speedup and drastically reduce the
memory use of the program by about 37 percent.
So another sort of application that we played around with this information is looking at
parallelizing sort of the main computation loop in this program. This loop takes about I
think like 90 percent of the time of the program's run.
So what it's going to do is it's going to take that body table reverse, which is a vector of
all the bodies that we're doing the computation on, it's going to iterate over each of them,
grab the body out of there and call this hack gravity method.
And this hack gravity method is what computes the gravitational interaction with all the
other bodies that we're working on. And so it takes the space decomposition tree, it takes
the body we're interested in and walks around and computes the new acceleration for the
body given by B.
And so if we look at the use-mod information of this, we see that although it's really
accessing the whole heap and all different parts of it, it reads pretty much from
everything, but it only writes to this one double array that's the new acceleration field.
And in particular it writes to the new acceleration field of the body that B refers to.
And since we know that this vector here has all the objects -- excuse me -- this edge here
contains all noninterfering pointers, we know that this -- each body object appears once
in that vector so that each body object's new acceleration field is written only once in that
loop when it's referred to by B.
So we know that since all these locations are read and never written, there are no
read-write dependencies. And since we know each of these new acceleration fields is
written on exactly one loop iteration, we know there are no write-write dependencies. So
we can just -- yeah. Go ahead.
>> So the write-write depends like -- I can see the read-write. I'm not sure. I mean, it
could be that in a different iteration you read that thing that you're writing from the -from some other vector from the edge going to the left [inaudible] maybe that's not the
case.
>> Mark Marron: Well, because -- well, okay. So we know that this is the only place
where there could be a read-write dependence. All this is just read-only. And so we
know this is -- there are no read-write dependencies here because this thing is actually
only accessed through B. So if we looked at ->> It's not clear from this graph, right?
>> Mark Marron: No, no. You need a little bit more information. You need to actually
look in the hack gravity method. There's one line that says basically B dot new
acceleration dot data equals blah, right?
>> So you know there's no read [inaudible].
>> Mark Marron: Well, you know that there's a read of this, but only through B. So you
do have to do a little more work than just look at the graph. You know, this -- I mean,
this is under the assumption that you have a pretty clever compiler. Right? I mean, you
have to put some effort into this.
On this one, on the first example, I think it's something that you could do fully in a
compiler. With this sort of thread-level parallelization, I mean, this benchmark is
specifically picked from high-performance kernels designed as a challenge problem for
thread-level parallelization. And so this is sort of -- I mean, this is sort of a friendly
example.
So with this parallelization, I think this falls more into the case where you wouldn't want
to have your compiler say yes, just go through and thread-level parallelize everything you
can and get back to me when you're done. It's the kind of thing that you would want to
have some user feedback.
So you might prefer to say -- tell the user, look, I looked at this loop and I know from
profiling you spend an awful lot of time here, and I looked at this loop, and you know
what, there's only one data dependence right here. And I can't be sure that this data
dependence does or doesn't exist. But if you want to speed your program up, and you can
actually assure me that this data dependence doesn't exist or maybe just put a little lock in
here, then I can parallelize this.
So this is sort of the thing I think that makes more sense to have some level of user
interaction on. And so, I mean, that's sort of the concern of having a model that -- you
know, okay, so this developer's probably dedicated, he cares about his program, he wants
to make it run fast, but he doesn't have a lot of training in formal logic and he wants a
tool that he can sit down, spend an afternoon with, get comfortable with and after a
couple weeks be pretty proficient with. So we want something that we can interact with
the user on. Yeah.
Okay. Any other questions on that? Okay.
Then I'll brag about my speedup of a factor of three on the quad-core machine. We're
fast. All right. So that's sort of -- it's a general description of the model, and now I'm
going to talk briefly about some of the work we did on actually doing static analysis on
interesting programs with this model.
Now, some of the real challenges that come up when trying to do -- analyzing the
program heap statically are, one, the computational costs. In order to do shape analysis
with any degree of precision, we need to do it in a flow-sensitive manner. That means
we're going to have to use some disjunctive representations of the program.
So we're going to have to use sets of these graphs. We're not just going to have one
graph at each program point. And we're going to do the analysis in a context-sensitive
manner.
So that means we're going to sort of start at main, we're going to start analyzing. When
we see a call, we're going to analyze that call with the abstract state we have at that
program point. If we see another call to that method with a different abstract state, we're
going to reanalyze the body and whatnot.
And you can see that the combination of these things, if you're not careful, can quickly
lead to sort of an explosion in the number of models you're pushing around in the local
analysis, and, you know, analyzing a given method body many, many, many times, and
the analysis just will run unfathomably slow. It would be useless.
The other issue that we encountered that was kind of interesting was if you look at the
literature, a lot of the focus is placed on analyzing recursive data structures, doing strong
updates and this sort of stuff.
When we want to talk about analyzing userland code, a big problem is that they're going
to use lots of standard libraries. They're going to use collections, files, IO, you know,
graphical user interfaces. And so we need to deal with these.
But in particular collections are pretty important since the user is sort of organizing data
and changing a lot of the heap structure and properties through collections. They're
doing -- they're unioning two sets, they're computing the intersection of two sets, they're
copying things from sets to lists. So we need to make sure that we can handle collections
in a fairly reasonable way.
So on the first problem, fortunately there's been a fair bit of nice work on this notion of
partially disjunctive and flow sensitivity. And particularly I like this paper, Partially
Disjunctive Domains, from SAS '04. And this really helps out with the costs a lot.
And basically the idea in this paper is rather than keeping a set of let's say graphs where
if two graphs are different in any way then we keep them separate, we have some
similarity function. And if they're maybe not equal but they're similar, then we join them
together, we union them. And so we sort of cut down on the set size we have to push
around.
We also spend a good chunk of time, well, over the entirety of this project, but
particularly in the last six months on dealing with some of the disjunctive issues.
So particularly calls and returns, when you pass in a model set and you get back a model
set, you can have multiple [inaudible] growth in the number of models you're pushing
around. And context sensitivity as in how do we keep from having -- reanalyzing a
method body many, many times and building a huge memo table of method analysis
input and output pairs.
And I'm not going to go into detail on this. I mean, I'm really happy to talk about it for
people who are interested, but it's just a little technical, so I don't want to spend a whole
lot of time on it.
But some of the key ideas are that at each method call we ensure that we only pass in one
model and out one model. So that is we don't get in sort of exponential growth by
analyzing a method body. All the method body control flow is pretty much captured
within the callee and isn't exposed back to the caller.
Particularly when we're analyzing object-oriented code, handling virtual method calls is
really important. We're going to have large object hierarchies with lots of overrides. One
of the benchmarks I talk about later, we have a class hierarchy with 17 classes that all
override a base class in several methods that are part of a recursive call.
So if we were to handle virtual calls, even if we only return one heap graph from each of
those calls, if we have to analyze 17 possible targets, we get our factor of 17 growth just
by analyzing each call. So we have to handle that efficiently as well.
In general, we do a very good job with the number of contexts using some heuristics on
determining when a context is new and unique and when it's just sort of redundant.
So we end up with only 2.2 contexts per model in the end. We analyze it a few more
times than that, but it's still a very small number. And, critically, even though we're using
some heuristics to manage this and potentially losing information in the benchmarks
we've looked at, the results are still near optimal with what we can get. And I'll explain
exactly what this means a little bit later.
So with libraries, well, one nice thing is that most libraries aren't interesting from the
perspective of the user space code. This is the nice thing about saying I'm going to
analyze user space code. If I create a file stream, if I call a new file stream, I really don't
care about the internal representation of the file stream or how complex a heap structure
it has. I just know I get a file stream object back. It has some API. I make some calls on
it. Maybe I pass it strings, maybe I get strings back from it. But whatever happens in the
internal structure, I don't care. And this is -- seems to hold for a lot of library code.
So we can just, say, have a special semantic operation that says give me some
uninterpreted region of the heap. And you don't know anything about what happens in
there and you don't care. And so most of the libraries that we analyze actually say either
return an uninterpreted section of the heap or I did something to the uninterpreted section
of the heap. It's very simple.
Collections are a little harder. Obviously we need to analyze them more precisely
because they're not entirely opaque. The programmer can see objects he puts in, he wants
to see objects he gets out. He might have references to the objects he still puts in. So we
can't just say there's an -- you know, you've put something in an
opaque-and-otherwise-invisible-to-you section of the heap.
Also, the semantics are a little more I guess visible from a heap perspective, right? If you
add something to a collection, you expect it to be there. If you, you know -- union 2
collections, you expect the results to sort of have some sharing relations to each other. So
we need to be able to model these things.
Now, the obvious way to do this or the simplest way to do this I should say would just be
to say well, we're going to analyze the bodies of these collection libraries. But I'm sure,
as a lot of you know, these libraries are specialized, highly optimized, someone designed
these to be performant over a wide range of usage scenarios. It's a fair amount of code.
And so it's very expensive and difficult to model correctly and efficiently.
Also, there's standard libraries, so they really shouldn't change that often hopefully. And
so doing this analysis, every time we're analyzing some userland 1500-line-of-code
program, just doesn't make a lot of sense.
So, again, we introduce some simple fairly high-level semantics for a common collection
operation. So if you're going to add, remove, union, find, unlist sets, lookup values and
dictionaries, we have specialized semantics that also can take into account the specialized
semantics of the collection library. So they are precise, efficient, and, you know, require
some effort for us to implement because we have to implement these special transfer
functions in the analysis, but not too bad as far as we've seen.
So -- sorry. Yeah.
>> How does that interact when a user [inaudible] overrides? So, for example, if
[inaudible] it derived from dictionary or from this list or that list, then we no longer know
where they have that special semantic resource attribute the special source to it.
>> Mark Marron: Right. That's been -- that's been a really annoying issue. While I
understand why programmers love it, it drives me up the wall.
So what we do right now is we say, okay, so the user can pretty much override equals and
compare to, seem to be the things that are important. We're -- we don't have as complete
an implementation as we'd like right now.
So what we -- the claim we make is that the user overrides of equals have to be pure.
They can't write impure overrides of equals. And if they do that, then we can turn the
equals method into basically an access path of locations that are read because it's pure.
So only read-only.
Now, when we do -- let's say we want to do a find in a collection that has a user override
of the equals method, what we do is we say, okay, well, we know we're going to cheat a
little bit on the semantics. Find, it might return -- I forget if it -- does it return a bool or
does it return the value? All right. Let's say it returns a bool. We say we don't know if
you found it or not. So it's an overapproximation of what happens.
But then we take this read set that we computed from the equals method and we apply it
to every object that's stored in that collection. So the use-mod information is a safe
approximation. And the resulting semantics are a safe if rather conservative
approximation.
We could look at being more clever and figuring out how to do the equals operation and
maybe have a special stub for the case where the user overrode the equals operation and
used the built-in semantics for the special transfer function. If they didn't, it's not clear to
us yet if that's actually worth the additional precision and where it would be useful. But,
yeah, that's a definite issue.
>> So that's one problem. I was actually thinking of a different one, if I derive a
dictionary myself from picturing class or whatever picture classes you have in Java, right,
then if I didn't do that, when I derive dictionary, do you associate it with the semantics of
the original dictionary that you've hard coded, or do you now analyze the completion?
>> Mark Marron: Okay, well -- okay. So for right now just for simplicity sake we
assume that no one's overriding these.
But if you were to do that, the way I see that working out is you've either provided an
implementation of the add method in your override, in which case we'll analyze that
directly. I mean, we do the correct method resolution. So we see the add, we'll say,
okay, we're adding to user-derived dictionary, go analyze that method. Right?
Whereas, if you didn't override it, then you would say, okay, the implementation is in the
base class, go to the base class and use the built-in semantics there.
>> But if it's a virtual call, you might pick either one.
>> Mark Marron: It might be either one. And that's one of the issues that we have to do
is how do we resolve those. And then we'll get different results for each of those, and
how do we sort of get those back in a way that's meaningful into the call site.
I'd be happy to talk about that more, but, yeah, it -- yeah. There are a lot of sort of
technical annoyances with that when you sort of start dealing with something real world.
It's much nicer to say I have perfect collections and the user never overrides equals and
it's good. But yeah.
So okay. So this is a little bit on the implementation here. Our initial work, we were
really concerned about performance, and so we spent a fair bit of effort implementing the
analysis in C++. We wrote our own front end for Java so that we could do as much as
possible in the compiler and offload work there.
And we built a pretty nice prototype of that. And I know the timing results on the report
on the next slide are from that implementation.
We're currently working on an implementation for the CLI bytecode, in particular
bytecodes generated by C#. We've got a pretty good working implementation of that. I'll
show a demo in a little bit. It handles most of what gets generated by the C# compiler.
Currently some function pointers, pass-by reference and exceptions aren't handled. We
know how to do these in theory, but they're just sort of -- we want to get the core working
first before we add more complexity.
And I wanted to cry when I saw function pointers appear in my nice language that didn't
have function pointers. So, you know, that required some rethinking of how we do stuff.
We support most of the most sort of key portions of the system collections and IO library.
So if you want to interact with the console, you want to read files, you want to use list
tree or list sets and dictionaries, that's all there.
Again, we in theory know how to support graphical libraries and threads at least in a
relatively simple manner, but we haven't implemented that as it -- as I said, it adds more
complexity.
So really our goal here is we want to hit an analysis that's scalable to a module-level size
code bases and then think about using other techniques for inter-module analysis and how
to compose the results.
So for our results here we have benchmarks from a couple different places. We have
these first four, TSP, EM3D, Voronoi, and there's Barnes-Hut, that was our example, are
from the JOlden suite, or our version of it, rewritten for Java in C#.
We have DB and Raytrace came from [inaudible] 98. And then we have these two
programs, Expression and Interpreter, which I sort of wrote as challenge problems to sort
of stress test the shape analysis as well as the inter-procedural analysis.
I think Interpreter's particularly interesting. It's basically an interpreter for sort of a
simplified object-oriented Java-like language. So it has an XML parser that reads in an
[inaudible] to representation, builds an abstract syntax tree, builds intern symbol tables,
static variable tables, et cetera. It then runs an interpreter over this program, so it
manages the runtime call stack. It has some internal model of the heap.
So we have a lot of classes. We have a lot of different types of heap structures,
everything from a sort of well-behaved sort of abstract syntax tree that has a very nice
structure to, you know, the Interpreter's internal model of the heap which obviously we
can never interpret precisely, so it's this very ambiguous object that the analysis has to
deal with efficiently and make sure that the ambiguity here doesn't creep in and destroy
other useful information that we extracted.
So this is I think a very challenging program from both a shape analysis and abstract
interpretation standpoint.
But, anyway, these programs range from pretty small, you know, 910 lines of code. And
this is normalized lines of code. So one expression per line. We don't have a bunch of
nested expressions on any given line, up to about 15,000 lines of code for the Interpreter
program. So I think the Interpreter is right about 10 source lines of code before we
normalize it to one expression per line.
The analysis time is really fast for these smaller benchmarks, less than a second. You can
see that even for Interpreter we're sitting at just under two minutes. So this is something
that's quite, quite fast for this sort of shape analysis and definitely doable on a developer
machine.
Even more importantly I think is the analysis memory, which, you know, if it takes twice
as long, well, then you get twice as long for your coffee break, which is nice. I don't
think anybody complains about that.
But really if you start burning lots of memory, you'll get over 4 gigs of memory. That's
not something that you can really talk about running on a developer machine anymore.
It's something that you have to have dedicated hardware for. And we really don't want
that.
But our Interpreter benchmark sits at 122 megabytes. And it seems to scale, you know,
not too badly with the size of the program. These are all very small, less than 30
megabytes. Most of it is just infrastructure setup code. So it's quite efficient
memory-wise.
Now, we have these percentage numbers for region shape and sharing. And I'm going to
take a minute to explain what we mean by this.
So in most of the work on shape analysis, when you see people attempt to evaluate
whether or not their heap analysis is good or is better than something else, they'll pick
some application domain, they'll run their analysis, then they'll run their target application
and they'll say, okay, we were able to find three more null pointer bugs or we were able
to improve the performance of our optimization by 10 percent or whatever.
And, I mean, this is of course useful and it's nice to have some sort of -- that your
analysis actually does something concrete in practice, but it adds a dependency on the
particular application you picked, right? You know, perhaps you could have found lots
more null pointer bugs, but there were only three in the program, or you found no null
pointer bugs but it didn't matter because there were no null pointer bugs left for you to
find.
And so we really wanted to come up with something that's a little less dependent on the
particular client application and the benchmarks you're using.
And so what we did, as I mentioned before, we've been doing a fair bit of work on a
runtime support. And one of the things we've done is developed a debugger visualization
and specification mining tool.
So what we can do is we'll take a program and at any point in time we can take a
snapshot of the concrete heap that's running. So we have this object graph that's an image
of the runtime heap. We can then apply our abstraction function and we'll get one of
these abstract storage shape graphs. And we can of course take multiple snapshots and
accumulate them as the program executes.
And these abstract storage shape graphs are now going to be an underapproximation of
the true semantics of a given method or methods in the program. We can then take the
storage shape graph computed by our static analysis, which we know is an upper
approximation of the true abstract semantics of the -- or the true semantics of the
program, and we can compare the two.
We can basically look for a graph isomorphism, right, and a quality between them. And
if they're equal, then we know that at least with respect to the properties that we can
encode in our model that we have gotten the most precise semantics for that method
possible.
And so that's where this region information is, is there an isomorphism between the lower
approximation and our static upper approximation. So basically all the time we get at
least structure-wise as precise of results as possible.
The shape is -- under this structuralized isomorphism, we look at nodes that map between
each other and if we assign the correct shape property to each node. And you can see that
in general we do a pretty good job. Not quite as good as the region, but we're pretty
good.
And the sharing is equivalently pairs of edges that are mapped under this isomorphism
between the upper approximation and lower approximation graph. If we get the
interference/noninterference property correct.
And in general you can see we do pretty well. The one outlier is DB, where we can't
model the shell sort that's implemented on an array correctly. And so we get a set of
potentially -- the upper approximation says the pointers might interfere when in fact they
never do. And this array shows up quite frequently. So it drives the precision down. But
in general we do a pretty good job.
All right. So now is time for an exciting demo. And so we'll see sort of a little bit about
this actually working -- yeah.
>> So one comparison that I've seen which is really useful for like compiler guys, and
DSA people do it, they compare alias results from their shape analysis with some
classical, you know, alias analysis like [inaudible].
>> Mark Marron: Yeah. I mean, that's sort of -- you know, that sort of gets back to the
issue of you're comparing an upper bound with another upper bound. So you say I've
reduced it below somebody else's upper bound, right?
And the question is, well, if their upper bound was already as good as you can get and
you can't reduce it any more, does that mean your analysis is bad or, you know, not
necessarily, or if you reduce their upper bound significantly, well, you still don't now
how close you are to sort of optimal. Maybe your analysis isn't so good either; you're just
better than something that was really bad before.
So in our approach, we like it because its compares it to the best you could obtain.
Right? So since we're taking a specific run of the program, we know that you can
actually get a result like this.
I mean, I can -- you know, that makes sense and we've sort of done that thing in the past,
but we happen to like this a little more. Just personal preference.
So here is my favorite program, Barnes-Hut, again. And this is in Visual Studio. And so
what we're going to do is I'm just going to fire up, run the analysis on this here, and I
have it pop up the debug window so you can see something happen while the analysis is
running. And what it's going to do is I start it running here, it's just going to print out as it
goes down each method it encounters. And so you'll see it sort of move to the right as it
goes down the call graph and then back up as it gets toward the end and as it encounters
various calls.
And so basically analysis is going to start at main, it's going to run until it sees a method,
it's going to go in and analyze that method body and come back out and -- oh, it's done
already. So I guess I needed to talk faster.
Okay. But basically it started at main, it just analyzed each method as it saw it. It's not
doing anything clever -- I mean, our concern here was really developing an analysis that
fundamentally was scalable.
So we don't do anything clever like preprocessing and saying, oh, this method is pure.
We can actually -- we don't have to analyze it directly, we can actually compute a
summary for it beforehand or doing abductive analysis on methods that don't take any
heap parameters or anything.
So I think there's definitely a lot of room for performance improvement by doing some
good engineering. We were just focused on having something that was fundamentally
scalable to begin with. But, anyway, it's completed analyzing the program.
And so something that I thought was kind of interesting here from sort of an extracting
pre and post conditions and program understanding point of view, we have this method
load tree. And this occurs when we're actually building the space decomposition tree for
the given body.
And so what we can do is we can say, okay, would you please include the contracts that
you computed as likely contracts or pre and post conditions for this method.
And so you can see that we've encoded this more or less in code contract language, and
we can see that it's discovered a number of nice things. So this is not equal to null, which
is always good and always true. Also requires P is not equal to null. The tree that's
passed in is not null, so that's a little more useful.
It has some other interesting stuff. For instance, it says that this is the unique owner of
the objected points too, so there is nobody else either in this list of arguments, static
variables or in other call frames refers to this.
So it's basically inferred some sort of ownership-type information. It's figured out that
you -- there's some path where you can go from this to get to the variable that's pointed to
you by P, so there's some reachability relation. And it's also determined that this, the
object pointed -- or the entire section of the heap that's reachable from this is not
modified over the execution of this method as a post condition.
So you can see that it's able to extract some interesting information, and similar stuff for
some of these other properties.
Now, some of these, such as not-mod, doesn't actually have any sort of semantics that can
be checked at runtime, but it's nonetheless -- probably be useful information for some
other tool that wants to perform a secondary analysis on this. Which is, I mean, sort of
what we're looking for.
We want to build a set of -- a heap analysis that can produce some useful information but
that can produce useful information that can be post processed by other tools to do what
they want to do, not necessarily -- we didn't have any particular application in mind, we
just wanted a robust-based toolkit that others could use and build on.
Now, the other thing that's sort of interesting here, and it takes a little looking, we have
another output format that makes it a little more obvious. But I'll just point it out here.
So we see this is nonnull. Okay, that's good, P is nonnull. Okay. The tree is nonnull.
That's nice to know.
We have this math back there, XPIC, right, and we have the information about it right
here. Well, we know that it's the unique owner of whatever it points to. We know it's not
modified, but you notice it hasn't inferred that it's not null. This parameter could actually
be null.
So okay. So we can investigate that a little more. And we see, all right, right here,
right -- oh, now it's bigger. Okay. Right here we're actually passing it directly to this old
subindex method without testing whether or not it's null. And if we go to that definition,
we see, sure enough, it's the IC parameter here. Right here IC is promptly de-referenced.
So either our analysis has made a mistake or there's some bug potentially lurking in this
program.
And it's pretty easy to go back. We can nicely in Visual Studio say show me all the
places this is used, and I've cheated before, so I happen to know where this comes from.
So this is the place that's of interest. There are a lot of places where this load tree is
called, but this is the first place that load tree gets started with. The other calls are
recursive calls.
And here's that parameter, that XQIC. And it's initialized here at this innit cord
[phonetic]. And if we go to that definition, sure enough, okay, we can return null.
And if we go ahead and extract the pre and post conditions for this, tools, contracts -- let's
see. Oh. Sorry. I guess I have not done that just yet. Let me do it with the other one.
Tools. Okay. So I guess we haven't finished all my contract work, but it says the return
might be null, right? Question mark null. And we see that that's definitely the case.
Here we return unknown value.
So, you know, it looks like there might be a bug here. It actually isn't in practice because
this null is only returned if this condition fails and sort of the implicit precondition
everywhere this is called is actually this is not true.
So this is sort of this weird sort of software impedance mismatch where when somebody
wrote this they were thinking one thing that implicitly depended on something else and
then carries through in the future, and this is ported over from C so people didn't use
exception checking, they just returned flags when really maybe this should have been an
exception. If you were really concerned about checking this here, it should have been a
precondition.
And so I think this is an interesting example of how sort of mining some of these
specifications can really be useful in understanding what's happening in the program,
some of these implicit assumptions that got built into the design of the program but were
never documented and are completely unintuitive.
And, like I said, this bug never occurs because these conditions happen to hold
everywhere, but it's one of those things that just, in my mind, at least, is just waiting.
When somebody makes a few changes and isn't quite aware of this, you're going to have
a null pointer bug and you're going to start introducing some problems. And so it would
be nice to sort of extract this information, help the programmer understand it and do some
more checking for it.
All right. So that -- I think that's my demo there, unless anybody else had any questions
on that.
Okay. So I'm going to just wrap up pretty quickly, then. Okay. So, in summary, well,
the model of the program heap, that was one of the key things we were working on.
Hopefully I convinced you that it can capture a lot of the information that you would like
to have if you're doing some sort of optimization, error detection, test case generation,
user directed refactoring or whatever IDE operation.
I think the experimental results show it's fairly amenable to static analysis. And
hopefully in some of the examples I gave you got a feel for, you know, okay, maybe you
didn't understand everything about it, but you can at least see how it would work and will
believe me when I say it's a fairly intuitive representation for the heap. At least much
more so than a lot of the logical formulas that you get.
You know, I always take a while to parse separation logic formulas. I couldn't imagine
average developers dealing with those very effectively. Even though they're much richer
and can express much more sophisticated things than this, they just require a lot more
effort to understand what they're saying.
The static analysis, well, okay, we can analyze 15,000 lines of code in a few minutes, 200
megabytes. That's not quite module size yet. We haven't analyzed anything larger just
because our Java front end was kind of brittle because we wrote it ourselves and we
wanted to start working on the .NET stuff rather than continuing to poke that.
And so we also were able to do the static analysis efficiently, but also we're able to do it
in a way that we're fairly precise and actually get a lot of information. We don't do things
efficiently but throw a lot of weight at a lot of good detail.
The dynamic support, I sort of hinted at it. We have some neat debugger stuff, some
specification mining stuff. Other supports in progress, particularly with the test case
generation. So, you know, I'd love to talk to people about this. I think it's kind of
exciting.
And, you know, really I sort of alluded to some of this future work in the previous slide
and earlier, but the goal here is to finish building sort of a -- take these core concepts that
seem to work well, build a robust tool that other people can use and play around with.
I think now it's to the point where if you're interested in using it for something, you can
run it on some nontrivial programs that we've already run it on. If you want to try and do
lots of other large stuff, you're getting into exciting territory. But at least it's the point
where you could play around with a nontrivial program and see if it's producing results
that would be useful to you in doing what you want to do. It's -- you can sort of evaluate
it a little bit.
So we really want to focus on finishing this implementation so that it is robust and you
can run it on something that we haven't run it on and have better than 50/50 odds of
getting anything.
And then we're also trying to figure out how to export these results in a meaningful way.
You know, we can export them to Visual Studio here with a little bit of the code
contracts. We have sort of an API for querying the model. But it's not clear what's the
best way to export this information in a uniform way that other tools can make use of it
and what -- and if we've gotten all the information that we'd like.
I think we've covered the range of information that people would need pretty well. But
we really need some more experience with actual applications to say that for sure.
And so that sort of ties into this we'd like to apply this to, you know, actual client
applications. We've sort of played around with some as we've been developing this
model to make sure we're on the right track, but we haven't really implemented a lot of
tools in a comprehensive way based on the results.
So I think that wraps up what I had to say. So I, you know -- questions or comments, I'd
be very interested in what everybody had to say.
>> So do you have a visualization [inaudible]?
>> Mark Marron: Yes.
>> Because you didn't list it, but I sort of expected ->> Mark Marron: Yeah. So here we go. So we go back to this body. So you can see
load tree. We can actually share the full model.
And so, well, okay, it's a little tough if you want to actually read the text in it. But so
here it shows that, you know -- it shows the pre and post states of that method. So it will
show you what's read and written.
So you can see, as I mentioned, we said, you know, this is never modified -- this is never
modified. This is actually arg -- well, arg zero. In the middle. Okay, yeah. Arg zero.
So there. And then, you know, you can see at the post date, right, that this part of the
heap has been read. Even though there's a lot of modification over here, what was
reachable from arg zero, this body has just been read.
Here's the part that's -- the return value has been modified. You know, it says, oh, it's
fresh, it actually has a tree shape. It's, you know, has a reference back to -- you know,
it's -- the return value itself is a fresh object, freshly allocated. But and some of the other
parts are fresh. But it also has a reference to some part of the heap that was passed in.
So there's more information -- I mean, there's a fair bit of interesting information here
that it's not clear how you, you know, export that, you know, in a compact way to other
tools or to the user other than just saying here's the graph and you can sort of look around
and see what you want to see. Yeah.
>> So what [inaudible] it provides me [inaudible] analysis that already mentioned?
>> Mark Marron: Yeah.
>> So how do you compare two, then?
>> Mark Marron: Well, I think -- okay. So two important distinctions with the DSA
analysis is, one, we're doing the analysis in a context-sensitive manner, right, whereas
they do a local and then bottom-down, top-up. So they're not doing context sensitivity as
such. So that means that they're going to lose a lot of precision that way.
The second thing that happens is they have a fixed partition of the heap that's computed
by using a points to analysis, right, for -- for the local flow graphs, whereas we have a
dynamic partitioning scheme that allows us to basically say take a link list and split it into
two and we'll realize that now there are two linked -- conceptually different linked lists,
whereas with their fixed partitioning, there's no way to break that apart necessarily.
So in general what you'll see is that you'll get much smaller graphs that seem to, at least
in my opinion and from looking at it, end up merging a lot of parts of the heap that really
are conceptually distinct in the program into one region. And so you just lose a lot of
precision that way.
>> So do you know what's the worst case [inaudible]?
>> Mark Marron: For mine?
>> Yeah.
>> Mark Marron: It's exponential in several ways. Right? I mean, we're doing context
sensitivity. And even though we have some nice heuristics, you could still come up with
a pathological program that would do that.
>> [inaudible] something like n-squared or something?
>> Mark Marron: Theirs is something like n-squared, yes. And, I mean, but the other
thing is I think this is why we're interested in focusing on sort of a module level. Right?
I mean, we can do 15,000 lines of code efficiently we know. I believe it will continue to
scale well. It seems to have pretty nice scaling. We have plenty of optimization stuff.
I feel pretty comfortable saying, you know, 30-, 40-, 50,000 lines of code is not a
problem. I mean, you can imagine modules bigger than that, I'm sure. I'm sure there are
plenty here that are bigger than that. But, you know, 50,000 lines of code I feel confident
that we can scale on this pretty nicely.
If you said I'd like to do completely context-sensitive analysis with the shape model of
half a million lines of code, you know, I don't think that's doable. Whereas the DSA
people have said we want to just take the whole program and we want to analyze it all in
one chunk. And so they're analyzing very, very large C++ programs.
So, I mean, having that sort of, you know, it's -- you know, it has to be -- you know,
n-squared is getting -- or n-cubed is getting pretty rough for programs that big. But, you
know, they really put a lot of time into making sure it scales well on a really big code.
>> Yeah. I guess it will be nice to see how much precision you gain, but it's hard to
compare because they're C and you're in Java and ->> Mark Marron: Well, and so mainly what I've looked is at some of these like JOlden
benchmarks, right, because there's the -- the original suite was the Olden suite in C. And
then the JOlden suite is in Java. And so in looking at those, you'll see something like the
Barnes-Hut example. Basically they'll wad sort of all of this together.
>> Okay.
>> Mark Marron: And so you lose a lot of -- so you lose the ability to say we have a tree
of body objects. You say I have a DAG of cell node and body objects. So, okay, for
some applications you don't care, but I think that's a little too much loss of precision for a
lot of the things that I'd like to do.
>> Just final quick question: So have you seen anywhere in this paper -- it's [inaudible]
called is it the cycle [inaudible].
>> Mark Marron: Yes, yes, [inaudible] I like that paper a lot. That was sort of one of the
papers that I read and I was like, hey, you know, if we had this nice sort of category of
shapes, but the problem is they apply it to the whole heap.
And so if you have something bad down here, then it all dies. And I said, hey, this is the
neat thing about the graph, because if we apply it individually, you'll notice in the
analysis, oh, man, we got this wrong, we said this is a cycle. We were overly
conservative here. It should be a tree.
But the fact that this region is a cycle, when we still have nice overall structure, we still
know that these body objects are uniquely pointed to by some of these guys and whatnot.
So that's sort of that error isolation that I really like a lot.
>> Francesco Logozzo: Okay. Thank you.
[applause]
Download