>> Francesco Logozzo: So good morning, everyone. Happy New Year. Thank you for being there. So it's my honor and pleasure to introduce Mark Marron who is postdoc at Imdea Software and today is going to talk to you about the work he has done on modeling the heap in a practical way for the debugger or the statistical analysis. Thank you. >> Mark Marron: All right. Thank you for that introduction and thanks, everybody, for showing up. As Francesco said, my interest is in developing tools and techniques for modeling the program heap. And just to sort of start out, I mean, the motivation for this, I think all of us are interested in techniques for optimizing programs, helping programmers understand their programs, finding bugs, testing et cetera. And all of these things that we want to build require some understanding of the semantics of the program that we're talking about. And so as these two quotes sort of illustrate, that in order to understand the semantics of a program, we need to understand both the algorithms and control flow structures as well as the data structures that these algorithms are manipulating. And one of the interesting things about these two quotes is that they come from a time when procedural programming was sort of the state of the art so that the algorithms were the fundamental organizing principle of the program that manipulated data structures. Now, as we've moved on and techniques like object-oriented programming have become more popular, sort of data and this abstraction around objects has become sort of the central organizing principle of program development, and so this has made the notion of data and the organization of data a more important concept in understanding the semantics of the program. Additionally, with the popularity of garbage-collected languages, it makes it much easier to sort of build rich, complex data structures on the heap, much more so than if you had to manage all the memory yourself. So as time has gone on, understanding the program heap has become a more important and a more difficult task in understanding the overall semantics of these programs that we're interested in. Now, just to sort of focus this a little more on the problems that I'm interested in and what I'm focusing on, I'm not particularly convinced that there's one sort of heap analysis that subsumes all others. I think in order to get the best results you're probably going to have to focus on specific classes of problems a little more. And what I'm interested in doing is sort of supporting applications that we'd like to see in sort of the development environment of the future, a development environment that helps programmers query the behavior of their program, understand what's going on if they're refactoring bug fixing, adding new features, the compilation environment that would be in the system. We want to produce fast code, we want to understand what's going on in the program so we can do, you know, interesting, sophisticated transformations. We want to support automated test development, debuggers. We want to help the programmer really test all these data structures that they're building and understand what data structures they are building at runtime. And in particular this is where sort of the practicality comes from. We're interested in -since we're building these tools for a development environment, we're going to assume that most of the code that we're looking at and trying to figure out the semantics of are going to be sort of userland applications; that these programs are going to be perhaps fairly large, some parts of them might not be particularly well written. They're going to use lots of libraries, IO, this sort of stuff. And so we really wanted to come to the development of our heap analysis technique thinking about how we're going to deal with these in sort of a manageable way, so that is we have to ensure that we can analyze large programs fairly computationally efficiently. If the programmer is doing something, writing some code in a very strange way that's hard to understand, we're going to lose information, we need to make sure that this doesn't destroy all the other information we have about the program. We need to be able to cope with these things. So sort of just to get a little more precise, some of the client applications that we're interested in supporting and we've looked at as we've developed this analysis technique, we have some sort of optimization examples, thread-level parallelization or redundancy elimination; better memory management, garbage collection. Some program understanding tools, automatically extracting pre and post conditions for methods or computing class invariants. From the testing standpoint, maybe automated test case generation for heap data structures, debugger visualization to help the programmer understand the runtime heap that's present when they're figuring out what a bug is. So in order to support all these things, we need a number of components across sort of a range of parts of the program lifecycle. We need some sort of model that represents these properties that we're interested in that we want to query or expose to the program or other tools. This model needs to be amenable to efficient static analysis. That's a key thing. And we'd like it to be understandable by nonexperts. We'd like something that we can show to the end user, they can quickly understand, they can give us feedback on whether or not it's precise or not. So we'd like to try to have something that we cannot just automate but also involve the end user if possible. From the static analysis standpoint, as we mentioned we want to utilize userland codes, so this needs to be scalable, at least to the module level. We might be willing to do some other stuff for inter-module analysis, but we want to at least do, you know, whole module analysis in a context-sensitive manner efficiently enough to run for the developer on their machines, right? I mean, ideally we'd like this to be able to run in the background interactively with the rest of the development environment. But they should at least be able to run this analysis, this static analysis when they do their other regression testing. We also need of course some dynamic support. If we want to talk about debugger visualizations, we need debugger hooks, hooks for doing specification testing. And of course there's runtime needs for this heap structure generation. Now, most of the work that I've done to date has focused on the development of this program heap model in the static analysis. We're moving into some more of this work on dynamic support, but it's not so well developed yet. So I'm not going to really talk about this too much in the talk. But if you're interested in this, you can just imagine all the things that I talk about in the heap model applying to the dynamic support as well as the static analysis. And of course I'd be happy to answer any questions anybody has on that. So the first general topic I wanted to talk about is how we actually model the heap; that is, the properties we're going to extract and represent how we're going to do it and how it all sort of fits together. Now, the first question is what properties we're actually interested in understanding. And these sort of break down into two categories. One is sort of the obvious, if we have a heap, it has some structural properties, right? And what we'd like to do is rather than view the heap as some uncollected set of objects with pointers between them, we'd kind of like to try and break it up into sort of a more hierarchical structure that hopefully mirrors what the programmer has in mind when he's been defining these data objects. So what we're going to do is we're going to initially try and take the heap and partition it into logically related regions in some sense. So maybe we'd like to say rather than we have just a bunch of objects, we'd like to say this set of objects makes up some recursive data structure, this set of objects makes up another recursive data structure. This set of objects is all stored in a given array, some stuff like that. And that way we can sort of keep unrelated sections of the heap separate, we can more precisely model the connectivity between these things. Now, once we've broken everything down into these regions, we can then talk about sort of connectivity both within the region, so the classic sort of shape property: this set of objects make up a recursive data structure and they're linked together like a [inaudible] tree or they're linked together like a singly linked list. And then we can also have the connectivity between regions. So starting from variable X I can reach part of the heap that's shaped like a binary tree, I can then reach part of the heap that's shaped like a link list, I can then reach some other vector. Now, sort of an important special case here is that if we have a region in the heap that represents an array and a region in the heap that represents all of the objects stored in that array, and we'd like to know sort of the special case, does this array point to a unique object in each index of the array or does it -- maybe there's some sharing in this array. And similarly for recursive data structure. So we're going to sort of handle this special case a little specifically since it seems to be quite important. The second class of information that we want to model is what I refer to as intrinsic information. And so this is information that's sort of about objects themselves and might actually not be visible to the executable program but nonetheless is very important to understanding the overall behavior of a program. So the first idea is this notion of object identity; that is, we might like at some point in the program, say, statement 1 to say, okay, X refers to a given object, then we'd like to ask, you know, at statement SK where has this object ended up, has it been moved into a different data structure, how have the data structures been arranged around it, is it even still alive. And we'd like to have a fairly compact naming scheme across program locations. And once we have this kind of information, then we can also start asking about a very useful thing. So use-mod information on heap-based locations. So given a program statement we'd like to say can you tell me all the locations that are read and written by this statement. And similarly given a method call, we'd like to know all the locations that are read or written across the invocation of this method. So this is sort of the -- often called the method frame or footprint. Then we also have some information of course on allocation and escape sites. This is pretty standard. Field nullity is useful. And we also tracked some information on collection sizes and iterator positions within the collections, which can be interesting -- I mean, to a more limited set of applications. But still useful from time to time. So the technique that we're going to use to represent all of these properties is based on the classic storage shape graph approach. Now, in this approach we're going to take an actual graph model and each of the nodes is going to represent a set of objects or the regions that we get from our decomposition. And each edge represents a set of pointers between the regions. And so this is very nice. As you can see it has a very natural representation for a lot of the properties we're interested in. It's easy to visualize and sort of understand even by nonexpert developers. There's sort of a natural map between the graph structure and the graph structure of the actual concrete heap. From sort of a cognitive understanding point, it's a very navigable structure, it's not prone to a lot of sort of radical rearrangements when the heap changes a little bit. It also is very efficient and compact for computation in terms of static analysis. So just sort of a couple of technical issues that would be interesting to people who do a lot of static analysis and shape analysis is in the storage shape graph disjointness properties are pretty natural and mostly free. So, for instance, if we have two disjoint sets of objects that are sort of different in some way, they'll be represented by two distinct nodes. So this is implicitly encoding a lot of the disjointness information. So we get nice frame rules and that type of stuff for free. The second important point is the majority of computation is local. Since the transitive closure of a lot of relationships is implicit in the graph structure, we don't need to update a lot of transitive relations explicitly to do each operation. We just change given edges in the graph where they point to, and we get all the transitive closure implicitly basically for free for most operations. And the third thing which I found is very, very important is the graph structure does a very nice job of isolating imprecision that arises when we're analyzing a program. As we said, we're sort of expecting in these large programs that we're looking at for the -you know, the programmer might not write a particular algorithm in a very straightforward way. They might mangle data structures in just a weird way before they clean them up. And we're not going to expect our analysis to be able to figure out the clever thing the programmer's doing. So we're going to make very crude approximations about what happened to some data structure. We might say we have no idea what happened here, this data structure could be a cycle. But we don't want the fact that we were unable to understand a specific section of code or what happened in a specific data structure to destroy all the useful information we have about the rest of the program heap. I'm sure this is an issue that's come up with pretty much anybody who's written a static analysis. You know, you have this beautiful expression language, you try and run it on a program and you're like I should be able to discover all sorts of useful things and it comes out and it says I couldn't figure anything out and you find out there was one point where your transfer function was slightly imprecise and this just snowballed and wiped out all the useful information you had. And so this is a very important thing since we pretty much expect that we're going to have these places where there's imprecision and we don't want this snowballing to occur. So I thought that was an interesting technical aside, but not critical. Okay. So the only downside with this storage shape graph is that there are some properties such as use-mod information that it's not at all clear how you would encode that in the connectivity structure of the graph, right? And there's some properties even in the connectivity and structural information that you can't encode as precisely as we'd like just based on the graph structure. So what we're going to do is we're going to annotate the nodes and edges with some additional instrumentation properties to precisely encode this information and encode all the information we're interested in. Okay. So the first thing I want to -- going to describe these properties and look at a little -- how we build this model in a little more detail. And so sort of the first issue that comes up is how do we group objects into nodes, how do we do this partitioning. And this is sort of a fundamental issue that comes up when you're using a storage shape graph approach. Clearly, if you have too many nodes, this graph is large, it's expensive to compute with. When you try and display this information to an end user, they see this huge graph, they have no idea where they are or what they're looking for. So too many nodes is not going to be good. If we have too few nodes, then what we're going to end up doing is we're going to end up partitioning the heap too coarsely. So a classic version is if we partition objects by their allocation site or type, then if we have a linked list, then all linked lists end up in the same partition, and that's, you know, clearly not the best way to do things. So what we're going to do is we're going to have a dynamic partitioning scheme for the heap; that is, at each point in the program, we're going to look at sort of the abstract heap, we're going to identify recursive data structures and we're going to group those together around objects stored in the same equivalence class of memory location. So same collection, array, or recursive data structure. And this -- like I said, this partitioning is dynamic, so if we have a method that takes two linked lists which are represented by two nodes originally and appends them together, we're going to say, okay, now these what were two distinct partitions are now one partition and create that new partition. Or if we partition a linked list, we'll have -- you know, the original linked list was represented by one partition and at the end we have two partitions which are the two -are two nodes which represent the two partitions of the linked list. So just as a little example here to show how this works, we have this -- an expression evaluation program, let's say. And it's a nice object-oriented program, so it has a base expression class. And then we derive from it add class, multiply class, negate class, constants and variables. And so we have this nice expression tree here pointed to as the variable exp. And just for fun we've decided to intern all the variables in some environment. So we have this environment array that points to all the variables and every variable only occurs once, so it might be shared by certain expressions in the tree. So if we're going to -- we're going to apply sort of our partitioning rule in this to figure out what the logically related regions are, well, the first obvious one is the recursive data structure that's the tree. So this add expression has a left multiply expression which has a negate expression and two multiplication expressions. The other equivalence class we have is everything that's stored in these same equivalent storage locations, so the variables that are interned in the symbol table are all stored in equivalent storage locations. So then when we apply our abstraction function here, we get -- this is our abstract storage shape graph. So we have the expression variable points to a node which represents some adds, some multiply and some negate objects. There's some internal connectivity on the left, right, and, you know, whatever the negate field is, A. We have a region of the heap which represents a variable array pointing to a bunch of variable objects, and then these are pointed to again by this recursive expression structure. So this sort of segues a little nicely into the first of these instrumentation properties I wanted to talk about, because when you look at this, it looks like this captured sort of the fundamental structures of the heap pretty well. The unfortunate thing is that although we've sort of grouped this set of objects into the right recursive data structure, we lost information, a lot of information, more than we'd like, on actually what that structure was. All we know is that there's some sort of internal connectivity. We have no idea if it's a tree or a cycle or anything else. So this is where the first of our instrumentation predicates comes in useful, this layout. And this is fairly common to a lot of sort of heap analysis techniques, so notion of shape. And so we're going to associate with each node the layout which is going to be the most general way that you could traverse the set of objects that this node represents. And so we have a couple of sort of hard-coded classes that we use. So singleton, there's no internal connectivity. Lists, there might be a linear list or some simpler structure, so it might be a singleton, but we don't know. Tree, it can contain a tree. Or cycle, which is sort of any structure. It's our top. So when we go ahead and annotate this model with this information, we now have that we have this summary region which has add, multiply and negate objects, it had some internal connectivity, but we know that it's laid out like a tree. So that is there is no DAG or cycle structures in there. So that captures our concrete heap much better. I should say that I'm sort of going to -- I'm cruising over these in sort of a high level, and I just -- so if anybody has any questions or wants any more detail, feel free to interrupt me. >> Why can you say that's a tree and not a cycle? And I didn't see a self-loop on that node [inaudible]. >> Mark Marron: So that's basically what this property is for. Right? Because as I -- let me go back two slides. So here we just have a self-loop, but there's no information in the actual structure of the graph to tell us what that loop means. Right? Like you say, it could be a cycle. Right? And so what we're going to do is we're going to associate this additional property ->> I understand, but how did you know that there wasn't a cycle? >> Mark Marron: How did I know there wasn't a ->> Could be a cycle and then ->> Mark Marron: Oh. >> -- in that structure. >> Mark Marron: In the static -- when I'm performing the static analysis you mean. >> Right. >> Mark Marron: Well, so basically what we're doing is as the analysis goes along, it's going to -- anytime it sees a new statement, it's going to create a new node to represent the single object there. Then at some point in time it's says, you know, we need the analysis to terminate. So if I kept doing that I would build a graph of infinite size. So it -- control flow join point. It says lets look at the graph and let's see if we're building something that looks like a recursive data structure. And if we do, we'll compact it down. And at that point in time we sort of have a number of sort of -- I mean, it's basically a K statement that says if you have this sort of connectivity and this sort of structure property, then you're building a linked list. And if you have this set of structure properties and connectivity, then you're building a tree. So it just sort of aggregates as the program executes. Does that answer your question? Okay. Good. All right. So the next property that I want to talk about that's related to structure is kind of interesting. So this is the notion of sharing. So, again, if we look at the storage shape graph, each edge abstracts some set of pointers. But there's no information in the structure to describe how this set of pointers in this edge share or alias or anything. All we know is we have a set of pointers. We don't know if the pointers that this edge abstracts maybe point to the same object, point to different objects, point to different objects but within the same recursive data structure. And we'd really like to know that. For example, we have this array of objects, we'd like to know if, you know -- do the pointers stored in this array refer to aliasing objects or not. Now, this sharing could occur between two references abstracted by the same edge or two pointers abstracted by different edges. So sort of these two dual versions, interference and connectivity. Since they're basically the same, I'm just going to describe interference here and you can imagine the same thing for pairs of edges in the connectivity property. So the question here is we have an edge E. It represents some set of pointers. And we want to know if the edge is noninterfering; that is, every pair of references, R1 and R2, abstracted by the edge refer to disjoint data structures. That is, in the region the edge points to, there's no way to get to the same object from these two pointers. We're going to say the edge represents interfering pointers if there may exist a pair of references that refer into the same data structure; that is, they point into a data structure where they could get to the same object. So this is a little sort of strange definition, so I think a quick example is sort of the easiest way to understand it. So in this figure, we have two concrete programs states. We have the array A referring to an array with -- or the variable A referring to an array with three elements. In the first case, the first and the second index both alias on the same data object. In the second case, all the indices have pointers to distinct data objects. Now, if we look at the standard storage shape graph model of this, all of these pointers in either case are going to be extracted by some edge that is the pointer stored in the data array pointing to data objects. And so there's no way with the structural information of the graph to distinguish whether this abstract state represents this concrete state or this concrete state. So what we're going to do is we're going to use these interfering properties, so we're going to have in this case the pointers may interfere, which allows either there exists some pair that alias or they -- none of them alias. But the noninterfering property is going to explicitly exclude this, because it says for all pairs they point to disjoint objects. And so you can think of this as being useful if you wanted to perhaps parallelize a traversal of this array. Interfering says it's not safe to do this, there might be a concrete state where there's some loop carry dependencies. Noninterfering says it's safe to do these. Or if you wanted to, you know, somehow refactor some processing of the array or how data elements are stored in the array, maybe you wanted to inline them, if the array contains all noninterfering data elements, then you know each index sort of owns its corresponding data element. Okay. So that's sort of the extent of the structural properties. And, now, again, I want to think a little more interesting, these intrinsic appropriates. So the first one is this data structure identity. And so as you saw in the storage shape graph, we have these nodes which represent partitions of the heap. And we can sort of split these partitions, we can merge them back together, we can rearrange them. We can do lots of things. And really what we'd like to know is if at sort of the start of the entry -- excuse me, the entry of a method, I have some given partition. At the exit of the method, I've moved these all around. And I'd like some quick way to compare the initial partition with the final partition or some partition in an intermediate point in the method. And sort of a classic way to do this is based on access pads. Right? But I'm sure, as a lot of you have experienced or seen, these can quickly become very complicated to reason about, to propagate in a model and so on. And so what we'd like to do is we have this storage shape graph where each object is sort of explicitly represented or each class of objects is explicitly represented by a node. So we'd really like to just tag these nodes with some notion of identity, and then we can just read off the identities the node represents and compare those to some sort of canonical identity partitioning. And, as I said, we fortunately have a nice canonical identity partitioning, the one at the start of the method. And I'm going to talk here about the use-mod, and then we'll see an example of how the two work together. So once we have this notion of identity, right, we can then add use-mod information quite easily to, you know -- we're going to add a tag to each node that says the last statement that each field in the node was used or modified. And then it's really easy at a statement level to just basically look at the use-mod tags for that statement and say, okay, well, we can read off all the identities that were used and modified at this statement. That gives us the use-mod sets of the heap locations for the statement. And, similarly, across the entire method we could just say, okay, initially the method, we have an identity partition and no one has been used and modified in this method. At the exit of the method we can read off the identity tags and the use-mod field for each of those, and so we can compute the footprint relative to the initial partition. And so that probably sounds a little confusing and it's a little weird, so here's a nice example. So we have this function mangled. Takes a pair P, reads the first dot value of the object in that pair piece. So this is just a standard pair object with the first field in a second field. Then it swaps the first and the second. It nullifies the first and it returns the result value that we read. So we can see here that even in this really simple straight line bit of code, we rearrange a few things, it takes a little thought to sort of keep track of what actually gets read and written, who gets returned, what gets freed and so on. But if we look at the storage shape graph, so here's the storage shape graph at the entry to the method. And so we're going to say this is our canonical storage shape graph and our canonical partitioning. We have a pair object. It's one region of the heap. We've given it ID 1. We have some data object that's stored in the first field. We've given it ID 2. And we have some data object that's stored in the second field. We've given it ID 3. Then at the exit of the heap, we say, okay, well, we can see how this graph, this storage shape graph has been rearranged. All the regions have been rearranged. P still points to region 1. So it hasn't moved at all. We can see that region 3 no longer exists. So it's dead. It's been freed. And we can see that 2 now is pointing in the second and 2 was originally the first. So it's quite easy to see where things ended up without having to think through each step of this rearrangement code. The second thing that's quite easy to read off, so for the read-write use-mod information in the heap, we do a little color coding. If -- now, in the previous description, I said we do this in a field-sensitive way, the use-mod. For the figures to try and display the individual fields that were used in mod, it becomes rather difficult to see in the figure. So we just simply color the nodes if any field has been used green, and if any field has been used and written it's colored red. So we basically color this pair object red because the first and the second fields are both read and written, and we color this object green because we see that in the first line it was read. Okay. Oh, yes. >> Where did the diagram on the left come from? Is that from the analysis of just this function or for the whole program? >> Mark Marron: We actually -- this is actually from the whole program. The model is designed to be fairly compact. So this is almost the complete semantics of this method. There's also -- if you run the analysis on this sort of with a program that does all sorts of crazy things, you'll also end up with a heap graph, a second state where the first and the second both refer to the same data object. But those two models basically will define the complete semantics for this method. All right. Are there any questions on that? Or does that make a reasonable amount of sense? Okay. All right. So, then, the same technique we use to track field nullity and allocation site, we just add tags to the nodes. Collection sizes, we track whether they're empty. They must be nonempty or we don't know. And field nullities per tag. These are pretty straightforward and the techniques that were used to model them don't differ too much from what we used for the heap use-mod information. So now that I've sort of humbled everybody with this barrage of properties and stuff and claimed that it all does something useful, I want to just take a minute to look at this case study of Barnes-Hut that hopefully puts it all in a slightly larger context and can you see why we do this and maybe it will be a little more intuitive. So this program, Barnes-Hut, is taken from the JOlden benchmark suite, and it's basically sort of a high-performance computational kernel. It does an n-body simulation of the gravitational interaction of a bunch of planets or something in three dimensions. Now, the easiest way to do this is basically you have a bunch of time steps. At each time step you compute the gravitational interaction between each pair of bodies. You compute how this affects their acceleration. You update everybody's position based on this and you go to the next iteration and you keep doing this. The problem with that is if you compute the pairwise interaction of each of the bodies, you get an n-squared algorithm which can get kind of expensive when you have a large, large number of bodies. So what this algorithm does is it takes advantage of the fact that gravity drops off rapidly as distance increases and that you could sort of -- it's additive. So for faraway sets of bodies, it computes a point mass and just computes the interaction with that point mass for a body. So you get a much more efficient algorithm. I think it's N log N. Is that -anybody disagree with me? I think. Okay. I'm not going to claim that. But it's much better than n-squared. So the main part of this computation is the loop that actually computes the interaction between pairs of bodies. And so I'm going to show the loop invariant heap state that we compute for that loop. Now, this is just showing the graph structure that we compute. I'll show some of the instrumentation predicates in a minute. So the heap has a B tree object, and this object, it just sort of encapsulates all the computation structure present in the program. And it has a root that points to the space of the composition tree. This tree is made up of cell objects which contain vectors which either have pointers to other cell objects, parts of the subtree, or contain pointers to the actual body objects that represent the planets that we're doing the n-body simulation on. We also have this field called body tab reverse, which is a vector of body objects that we're going to use to iterate over the set of body objects. So each body has a position, a velocity, an acceleration and a new acceleration field that it uses to store the new acceleration computed on each update loop. These positions are represented as mathematical vectors in three space, which are these math vector objects, and each one of them has a double representing the vector, basically the values of the vector. Okay. So -- and I think this is interesting as well as it shows that this model can sort of nicely represent this hierarchical construction, right? Bodies are made up of sets of math vector objects which represent their positions and their velocities, you know, the overall computation structure is a -- you know, space decomposition tree of bodies that are also stored in the vector. So this is a nice -- I think this emphasizes the compositionality of this model pretty well. >> Can I ->> Mark Marron: Yeah. >> The doubles, this is Java, right? >> Mark Marron: Yeah, this is Java. >> The doubles are actually boxed [inaudible]? >> Mark Marron: I think double arrays, you actually get the doubles in the array. >> Oh, you do? >> Mark Marron: Yeah, I think in primitives. But if you were to do -- if you were to allocate like a vector of doubles, then they would be boxed, I believe. So the first thing that's kind of interesting about this is if you look at this figure, you see that there are an awful lot of math vectors for the rest of the objects in the program. And our analysis is able to extract some interesting properties. First off, whoever implemented this went and wrote the math vector, and what I think is sort of the canonical, nice object-oriented way. They wrote a sort of robust encapsulated notion of what a mathematical vector is. So it has a parameter that you can set for what dimensionality you want the vector to static final in. It has a bunch of loops that iterate over the length of the -- whatever vector you're doing. And it's sort of nice from a software engineering standpoint. But, as you can see, these math vectors take up a lot of space on the heap. And when you're doing these operations on them, you have a lot of these loops with fixed loop bounds, and in particular the analysis knows that each of these arrays gets allocated of size three, so we can actually unroll those loops entirely and turn it into straight line code. Furthermore, as I mentioned previously, we have this interference/noninterference properties. The analysis has determined that this set of the pointers represented by this edge are all noninterfering. So that means that there are a bunch of body -- or math vector objects in this region. And each of them refers to a unique and distinct double array in that region. So they own, in the ownership-type sense, their respective double arrays. So that basically means that we could sort of inline this entire double array into the math vector object as well as unrolling the loops. And so by doing this we get about a 23 percent serial speedup and drastically reduce the memory use of the program by about 37 percent. So another sort of application that we played around with this information is looking at parallelizing sort of the main computation loop in this program. This loop takes about I think like 90 percent of the time of the program's run. So what it's going to do is it's going to take that body table reverse, which is a vector of all the bodies that we're doing the computation on, it's going to iterate over each of them, grab the body out of there and call this hack gravity method. And this hack gravity method is what computes the gravitational interaction with all the other bodies that we're working on. And so it takes the space decomposition tree, it takes the body we're interested in and walks around and computes the new acceleration for the body given by B. And so if we look at the use-mod information of this, we see that although it's really accessing the whole heap and all different parts of it, it reads pretty much from everything, but it only writes to this one double array that's the new acceleration field. And in particular it writes to the new acceleration field of the body that B refers to. And since we know that this vector here has all the objects -- excuse me -- this edge here contains all noninterfering pointers, we know that this -- each body object appears once in that vector so that each body object's new acceleration field is written only once in that loop when it's referred to by B. So we know that since all these locations are read and never written, there are no read-write dependencies. And since we know each of these new acceleration fields is written on exactly one loop iteration, we know there are no write-write dependencies. So we can just -- yeah. Go ahead. >> So the write-write depends like -- I can see the read-write. I'm not sure. I mean, it could be that in a different iteration you read that thing that you're writing from the -from some other vector from the edge going to the left [inaudible] maybe that's not the case. >> Mark Marron: Well, because -- well, okay. So we know that this is the only place where there could be a read-write dependence. All this is just read-only. And so we know this is -- there are no read-write dependencies here because this thing is actually only accessed through B. So if we looked at ->> It's not clear from this graph, right? >> Mark Marron: No, no. You need a little bit more information. You need to actually look in the hack gravity method. There's one line that says basically B dot new acceleration dot data equals blah, right? >> So you know there's no read [inaudible]. >> Mark Marron: Well, you know that there's a read of this, but only through B. So you do have to do a little more work than just look at the graph. You know, this -- I mean, this is under the assumption that you have a pretty clever compiler. Right? I mean, you have to put some effort into this. On this one, on the first example, I think it's something that you could do fully in a compiler. With this sort of thread-level parallelization, I mean, this benchmark is specifically picked from high-performance kernels designed as a challenge problem for thread-level parallelization. And so this is sort of -- I mean, this is sort of a friendly example. So with this parallelization, I think this falls more into the case where you wouldn't want to have your compiler say yes, just go through and thread-level parallelize everything you can and get back to me when you're done. It's the kind of thing that you would want to have some user feedback. So you might prefer to say -- tell the user, look, I looked at this loop and I know from profiling you spend an awful lot of time here, and I looked at this loop, and you know what, there's only one data dependence right here. And I can't be sure that this data dependence does or doesn't exist. But if you want to speed your program up, and you can actually assure me that this data dependence doesn't exist or maybe just put a little lock in here, then I can parallelize this. So this is sort of the thing I think that makes more sense to have some level of user interaction on. And so, I mean, that's sort of the concern of having a model that -- you know, okay, so this developer's probably dedicated, he cares about his program, he wants to make it run fast, but he doesn't have a lot of training in formal logic and he wants a tool that he can sit down, spend an afternoon with, get comfortable with and after a couple weeks be pretty proficient with. So we want something that we can interact with the user on. Yeah. Okay. Any other questions on that? Okay. Then I'll brag about my speedup of a factor of three on the quad-core machine. We're fast. All right. So that's sort of -- it's a general description of the model, and now I'm going to talk briefly about some of the work we did on actually doing static analysis on interesting programs with this model. Now, some of the real challenges that come up when trying to do -- analyzing the program heap statically are, one, the computational costs. In order to do shape analysis with any degree of precision, we need to do it in a flow-sensitive manner. That means we're going to have to use some disjunctive representations of the program. So we're going to have to use sets of these graphs. We're not just going to have one graph at each program point. And we're going to do the analysis in a context-sensitive manner. So that means we're going to sort of start at main, we're going to start analyzing. When we see a call, we're going to analyze that call with the abstract state we have at that program point. If we see another call to that method with a different abstract state, we're going to reanalyze the body and whatnot. And you can see that the combination of these things, if you're not careful, can quickly lead to sort of an explosion in the number of models you're pushing around in the local analysis, and, you know, analyzing a given method body many, many, many times, and the analysis just will run unfathomably slow. It would be useless. The other issue that we encountered that was kind of interesting was if you look at the literature, a lot of the focus is placed on analyzing recursive data structures, doing strong updates and this sort of stuff. When we want to talk about analyzing userland code, a big problem is that they're going to use lots of standard libraries. They're going to use collections, files, IO, you know, graphical user interfaces. And so we need to deal with these. But in particular collections are pretty important since the user is sort of organizing data and changing a lot of the heap structure and properties through collections. They're doing -- they're unioning two sets, they're computing the intersection of two sets, they're copying things from sets to lists. So we need to make sure that we can handle collections in a fairly reasonable way. So on the first problem, fortunately there's been a fair bit of nice work on this notion of partially disjunctive and flow sensitivity. And particularly I like this paper, Partially Disjunctive Domains, from SAS '04. And this really helps out with the costs a lot. And basically the idea in this paper is rather than keeping a set of let's say graphs where if two graphs are different in any way then we keep them separate, we have some similarity function. And if they're maybe not equal but they're similar, then we join them together, we union them. And so we sort of cut down on the set size we have to push around. We also spend a good chunk of time, well, over the entirety of this project, but particularly in the last six months on dealing with some of the disjunctive issues. So particularly calls and returns, when you pass in a model set and you get back a model set, you can have multiple [inaudible] growth in the number of models you're pushing around. And context sensitivity as in how do we keep from having -- reanalyzing a method body many, many times and building a huge memo table of method analysis input and output pairs. And I'm not going to go into detail on this. I mean, I'm really happy to talk about it for people who are interested, but it's just a little technical, so I don't want to spend a whole lot of time on it. But some of the key ideas are that at each method call we ensure that we only pass in one model and out one model. So that is we don't get in sort of exponential growth by analyzing a method body. All the method body control flow is pretty much captured within the callee and isn't exposed back to the caller. Particularly when we're analyzing object-oriented code, handling virtual method calls is really important. We're going to have large object hierarchies with lots of overrides. One of the benchmarks I talk about later, we have a class hierarchy with 17 classes that all override a base class in several methods that are part of a recursive call. So if we were to handle virtual calls, even if we only return one heap graph from each of those calls, if we have to analyze 17 possible targets, we get our factor of 17 growth just by analyzing each call. So we have to handle that efficiently as well. In general, we do a very good job with the number of contexts using some heuristics on determining when a context is new and unique and when it's just sort of redundant. So we end up with only 2.2 contexts per model in the end. We analyze it a few more times than that, but it's still a very small number. And, critically, even though we're using some heuristics to manage this and potentially losing information in the benchmarks we've looked at, the results are still near optimal with what we can get. And I'll explain exactly what this means a little bit later. So with libraries, well, one nice thing is that most libraries aren't interesting from the perspective of the user space code. This is the nice thing about saying I'm going to analyze user space code. If I create a file stream, if I call a new file stream, I really don't care about the internal representation of the file stream or how complex a heap structure it has. I just know I get a file stream object back. It has some API. I make some calls on it. Maybe I pass it strings, maybe I get strings back from it. But whatever happens in the internal structure, I don't care. And this is -- seems to hold for a lot of library code. So we can just, say, have a special semantic operation that says give me some uninterpreted region of the heap. And you don't know anything about what happens in there and you don't care. And so most of the libraries that we analyze actually say either return an uninterpreted section of the heap or I did something to the uninterpreted section of the heap. It's very simple. Collections are a little harder. Obviously we need to analyze them more precisely because they're not entirely opaque. The programmer can see objects he puts in, he wants to see objects he gets out. He might have references to the objects he still puts in. So we can't just say there's an -- you know, you've put something in an opaque-and-otherwise-invisible-to-you section of the heap. Also, the semantics are a little more I guess visible from a heap perspective, right? If you add something to a collection, you expect it to be there. If you, you know -- union 2 collections, you expect the results to sort of have some sharing relations to each other. So we need to be able to model these things. Now, the obvious way to do this or the simplest way to do this I should say would just be to say well, we're going to analyze the bodies of these collection libraries. But I'm sure, as a lot of you know, these libraries are specialized, highly optimized, someone designed these to be performant over a wide range of usage scenarios. It's a fair amount of code. And so it's very expensive and difficult to model correctly and efficiently. Also, there's standard libraries, so they really shouldn't change that often hopefully. And so doing this analysis, every time we're analyzing some userland 1500-line-of-code program, just doesn't make a lot of sense. So, again, we introduce some simple fairly high-level semantics for a common collection operation. So if you're going to add, remove, union, find, unlist sets, lookup values and dictionaries, we have specialized semantics that also can take into account the specialized semantics of the collection library. So they are precise, efficient, and, you know, require some effort for us to implement because we have to implement these special transfer functions in the analysis, but not too bad as far as we've seen. So -- sorry. Yeah. >> How does that interact when a user [inaudible] overrides? So, for example, if [inaudible] it derived from dictionary or from this list or that list, then we no longer know where they have that special semantic resource attribute the special source to it. >> Mark Marron: Right. That's been -- that's been a really annoying issue. While I understand why programmers love it, it drives me up the wall. So what we do right now is we say, okay, so the user can pretty much override equals and compare to, seem to be the things that are important. We're -- we don't have as complete an implementation as we'd like right now. So what we -- the claim we make is that the user overrides of equals have to be pure. They can't write impure overrides of equals. And if they do that, then we can turn the equals method into basically an access path of locations that are read because it's pure. So only read-only. Now, when we do -- let's say we want to do a find in a collection that has a user override of the equals method, what we do is we say, okay, well, we know we're going to cheat a little bit on the semantics. Find, it might return -- I forget if it -- does it return a bool or does it return the value? All right. Let's say it returns a bool. We say we don't know if you found it or not. So it's an overapproximation of what happens. But then we take this read set that we computed from the equals method and we apply it to every object that's stored in that collection. So the use-mod information is a safe approximation. And the resulting semantics are a safe if rather conservative approximation. We could look at being more clever and figuring out how to do the equals operation and maybe have a special stub for the case where the user overrode the equals operation and used the built-in semantics for the special transfer function. If they didn't, it's not clear to us yet if that's actually worth the additional precision and where it would be useful. But, yeah, that's a definite issue. >> So that's one problem. I was actually thinking of a different one, if I derive a dictionary myself from picturing class or whatever picture classes you have in Java, right, then if I didn't do that, when I derive dictionary, do you associate it with the semantics of the original dictionary that you've hard coded, or do you now analyze the completion? >> Mark Marron: Okay, well -- okay. So for right now just for simplicity sake we assume that no one's overriding these. But if you were to do that, the way I see that working out is you've either provided an implementation of the add method in your override, in which case we'll analyze that directly. I mean, we do the correct method resolution. So we see the add, we'll say, okay, we're adding to user-derived dictionary, go analyze that method. Right? Whereas, if you didn't override it, then you would say, okay, the implementation is in the base class, go to the base class and use the built-in semantics there. >> But if it's a virtual call, you might pick either one. >> Mark Marron: It might be either one. And that's one of the issues that we have to do is how do we resolve those. And then we'll get different results for each of those, and how do we sort of get those back in a way that's meaningful into the call site. I'd be happy to talk about that more, but, yeah, it -- yeah. There are a lot of sort of technical annoyances with that when you sort of start dealing with something real world. It's much nicer to say I have perfect collections and the user never overrides equals and it's good. But yeah. So okay. So this is a little bit on the implementation here. Our initial work, we were really concerned about performance, and so we spent a fair bit of effort implementing the analysis in C++. We wrote our own front end for Java so that we could do as much as possible in the compiler and offload work there. And we built a pretty nice prototype of that. And I know the timing results on the report on the next slide are from that implementation. We're currently working on an implementation for the CLI bytecode, in particular bytecodes generated by C#. We've got a pretty good working implementation of that. I'll show a demo in a little bit. It handles most of what gets generated by the C# compiler. Currently some function pointers, pass-by reference and exceptions aren't handled. We know how to do these in theory, but they're just sort of -- we want to get the core working first before we add more complexity. And I wanted to cry when I saw function pointers appear in my nice language that didn't have function pointers. So, you know, that required some rethinking of how we do stuff. We support most of the most sort of key portions of the system collections and IO library. So if you want to interact with the console, you want to read files, you want to use list tree or list sets and dictionaries, that's all there. Again, we in theory know how to support graphical libraries and threads at least in a relatively simple manner, but we haven't implemented that as it -- as I said, it adds more complexity. So really our goal here is we want to hit an analysis that's scalable to a module-level size code bases and then think about using other techniques for inter-module analysis and how to compose the results. So for our results here we have benchmarks from a couple different places. We have these first four, TSP, EM3D, Voronoi, and there's Barnes-Hut, that was our example, are from the JOlden suite, or our version of it, rewritten for Java in C#. We have DB and Raytrace came from [inaudible] 98. And then we have these two programs, Expression and Interpreter, which I sort of wrote as challenge problems to sort of stress test the shape analysis as well as the inter-procedural analysis. I think Interpreter's particularly interesting. It's basically an interpreter for sort of a simplified object-oriented Java-like language. So it has an XML parser that reads in an [inaudible] to representation, builds an abstract syntax tree, builds intern symbol tables, static variable tables, et cetera. It then runs an interpreter over this program, so it manages the runtime call stack. It has some internal model of the heap. So we have a lot of classes. We have a lot of different types of heap structures, everything from a sort of well-behaved sort of abstract syntax tree that has a very nice structure to, you know, the Interpreter's internal model of the heap which obviously we can never interpret precisely, so it's this very ambiguous object that the analysis has to deal with efficiently and make sure that the ambiguity here doesn't creep in and destroy other useful information that we extracted. So this is I think a very challenging program from both a shape analysis and abstract interpretation standpoint. But, anyway, these programs range from pretty small, you know, 910 lines of code. And this is normalized lines of code. So one expression per line. We don't have a bunch of nested expressions on any given line, up to about 15,000 lines of code for the Interpreter program. So I think the Interpreter is right about 10 source lines of code before we normalize it to one expression per line. The analysis time is really fast for these smaller benchmarks, less than a second. You can see that even for Interpreter we're sitting at just under two minutes. So this is something that's quite, quite fast for this sort of shape analysis and definitely doable on a developer machine. Even more importantly I think is the analysis memory, which, you know, if it takes twice as long, well, then you get twice as long for your coffee break, which is nice. I don't think anybody complains about that. But really if you start burning lots of memory, you'll get over 4 gigs of memory. That's not something that you can really talk about running on a developer machine anymore. It's something that you have to have dedicated hardware for. And we really don't want that. But our Interpreter benchmark sits at 122 megabytes. And it seems to scale, you know, not too badly with the size of the program. These are all very small, less than 30 megabytes. Most of it is just infrastructure setup code. So it's quite efficient memory-wise. Now, we have these percentage numbers for region shape and sharing. And I'm going to take a minute to explain what we mean by this. So in most of the work on shape analysis, when you see people attempt to evaluate whether or not their heap analysis is good or is better than something else, they'll pick some application domain, they'll run their analysis, then they'll run their target application and they'll say, okay, we were able to find three more null pointer bugs or we were able to improve the performance of our optimization by 10 percent or whatever. And, I mean, this is of course useful and it's nice to have some sort of -- that your analysis actually does something concrete in practice, but it adds a dependency on the particular application you picked, right? You know, perhaps you could have found lots more null pointer bugs, but there were only three in the program, or you found no null pointer bugs but it didn't matter because there were no null pointer bugs left for you to find. And so we really wanted to come up with something that's a little less dependent on the particular client application and the benchmarks you're using. And so what we did, as I mentioned before, we've been doing a fair bit of work on a runtime support. And one of the things we've done is developed a debugger visualization and specification mining tool. So what we can do is we'll take a program and at any point in time we can take a snapshot of the concrete heap that's running. So we have this object graph that's an image of the runtime heap. We can then apply our abstraction function and we'll get one of these abstract storage shape graphs. And we can of course take multiple snapshots and accumulate them as the program executes. And these abstract storage shape graphs are now going to be an underapproximation of the true semantics of a given method or methods in the program. We can then take the storage shape graph computed by our static analysis, which we know is an upper approximation of the true abstract semantics of the -- or the true semantics of the program, and we can compare the two. We can basically look for a graph isomorphism, right, and a quality between them. And if they're equal, then we know that at least with respect to the properties that we can encode in our model that we have gotten the most precise semantics for that method possible. And so that's where this region information is, is there an isomorphism between the lower approximation and our static upper approximation. So basically all the time we get at least structure-wise as precise of results as possible. The shape is -- under this structuralized isomorphism, we look at nodes that map between each other and if we assign the correct shape property to each node. And you can see that in general we do a pretty good job. Not quite as good as the region, but we're pretty good. And the sharing is equivalently pairs of edges that are mapped under this isomorphism between the upper approximation and lower approximation graph. If we get the interference/noninterference property correct. And in general you can see we do pretty well. The one outlier is DB, where we can't model the shell sort that's implemented on an array correctly. And so we get a set of potentially -- the upper approximation says the pointers might interfere when in fact they never do. And this array shows up quite frequently. So it drives the precision down. But in general we do a pretty good job. All right. So now is time for an exciting demo. And so we'll see sort of a little bit about this actually working -- yeah. >> So one comparison that I've seen which is really useful for like compiler guys, and DSA people do it, they compare alias results from their shape analysis with some classical, you know, alias analysis like [inaudible]. >> Mark Marron: Yeah. I mean, that's sort of -- you know, that sort of gets back to the issue of you're comparing an upper bound with another upper bound. So you say I've reduced it below somebody else's upper bound, right? And the question is, well, if their upper bound was already as good as you can get and you can't reduce it any more, does that mean your analysis is bad or, you know, not necessarily, or if you reduce their upper bound significantly, well, you still don't now how close you are to sort of optimal. Maybe your analysis isn't so good either; you're just better than something that was really bad before. So in our approach, we like it because its compares it to the best you could obtain. Right? So since we're taking a specific run of the program, we know that you can actually get a result like this. I mean, I can -- you know, that makes sense and we've sort of done that thing in the past, but we happen to like this a little more. Just personal preference. So here is my favorite program, Barnes-Hut, again. And this is in Visual Studio. And so what we're going to do is I'm just going to fire up, run the analysis on this here, and I have it pop up the debug window so you can see something happen while the analysis is running. And what it's going to do is I start it running here, it's just going to print out as it goes down each method it encounters. And so you'll see it sort of move to the right as it goes down the call graph and then back up as it gets toward the end and as it encounters various calls. And so basically analysis is going to start at main, it's going to run until it sees a method, it's going to go in and analyze that method body and come back out and -- oh, it's done already. So I guess I needed to talk faster. Okay. But basically it started at main, it just analyzed each method as it saw it. It's not doing anything clever -- I mean, our concern here was really developing an analysis that fundamentally was scalable. So we don't do anything clever like preprocessing and saying, oh, this method is pure. We can actually -- we don't have to analyze it directly, we can actually compute a summary for it beforehand or doing abductive analysis on methods that don't take any heap parameters or anything. So I think there's definitely a lot of room for performance improvement by doing some good engineering. We were just focused on having something that was fundamentally scalable to begin with. But, anyway, it's completed analyzing the program. And so something that I thought was kind of interesting here from sort of an extracting pre and post conditions and program understanding point of view, we have this method load tree. And this occurs when we're actually building the space decomposition tree for the given body. And so what we can do is we can say, okay, would you please include the contracts that you computed as likely contracts or pre and post conditions for this method. And so you can see that we've encoded this more or less in code contract language, and we can see that it's discovered a number of nice things. So this is not equal to null, which is always good and always true. Also requires P is not equal to null. The tree that's passed in is not null, so that's a little more useful. It has some other interesting stuff. For instance, it says that this is the unique owner of the objected points too, so there is nobody else either in this list of arguments, static variables or in other call frames refers to this. So it's basically inferred some sort of ownership-type information. It's figured out that you -- there's some path where you can go from this to get to the variable that's pointed to you by P, so there's some reachability relation. And it's also determined that this, the object pointed -- or the entire section of the heap that's reachable from this is not modified over the execution of this method as a post condition. So you can see that it's able to extract some interesting information, and similar stuff for some of these other properties. Now, some of these, such as not-mod, doesn't actually have any sort of semantics that can be checked at runtime, but it's nonetheless -- probably be useful information for some other tool that wants to perform a secondary analysis on this. Which is, I mean, sort of what we're looking for. We want to build a set of -- a heap analysis that can produce some useful information but that can produce useful information that can be post processed by other tools to do what they want to do, not necessarily -- we didn't have any particular application in mind, we just wanted a robust-based toolkit that others could use and build on. Now, the other thing that's sort of interesting here, and it takes a little looking, we have another output format that makes it a little more obvious. But I'll just point it out here. So we see this is nonnull. Okay, that's good, P is nonnull. Okay. The tree is nonnull. That's nice to know. We have this math back there, XPIC, right, and we have the information about it right here. Well, we know that it's the unique owner of whatever it points to. We know it's not modified, but you notice it hasn't inferred that it's not null. This parameter could actually be null. So okay. So we can investigate that a little more. And we see, all right, right here, right -- oh, now it's bigger. Okay. Right here we're actually passing it directly to this old subindex method without testing whether or not it's null. And if we go to that definition, we see, sure enough, it's the IC parameter here. Right here IC is promptly de-referenced. So either our analysis has made a mistake or there's some bug potentially lurking in this program. And it's pretty easy to go back. We can nicely in Visual Studio say show me all the places this is used, and I've cheated before, so I happen to know where this comes from. So this is the place that's of interest. There are a lot of places where this load tree is called, but this is the first place that load tree gets started with. The other calls are recursive calls. And here's that parameter, that XQIC. And it's initialized here at this innit cord [phonetic]. And if we go to that definition, sure enough, okay, we can return null. And if we go ahead and extract the pre and post conditions for this, tools, contracts -- let's see. Oh. Sorry. I guess I have not done that just yet. Let me do it with the other one. Tools. Okay. So I guess we haven't finished all my contract work, but it says the return might be null, right? Question mark null. And we see that that's definitely the case. Here we return unknown value. So, you know, it looks like there might be a bug here. It actually isn't in practice because this null is only returned if this condition fails and sort of the implicit precondition everywhere this is called is actually this is not true. So this is sort of this weird sort of software impedance mismatch where when somebody wrote this they were thinking one thing that implicitly depended on something else and then carries through in the future, and this is ported over from C so people didn't use exception checking, they just returned flags when really maybe this should have been an exception. If you were really concerned about checking this here, it should have been a precondition. And so I think this is an interesting example of how sort of mining some of these specifications can really be useful in understanding what's happening in the program, some of these implicit assumptions that got built into the design of the program but were never documented and are completely unintuitive. And, like I said, this bug never occurs because these conditions happen to hold everywhere, but it's one of those things that just, in my mind, at least, is just waiting. When somebody makes a few changes and isn't quite aware of this, you're going to have a null pointer bug and you're going to start introducing some problems. And so it would be nice to sort of extract this information, help the programmer understand it and do some more checking for it. All right. So that -- I think that's my demo there, unless anybody else had any questions on that. Okay. So I'm going to just wrap up pretty quickly, then. Okay. So, in summary, well, the model of the program heap, that was one of the key things we were working on. Hopefully I convinced you that it can capture a lot of the information that you would like to have if you're doing some sort of optimization, error detection, test case generation, user directed refactoring or whatever IDE operation. I think the experimental results show it's fairly amenable to static analysis. And hopefully in some of the examples I gave you got a feel for, you know, okay, maybe you didn't understand everything about it, but you can at least see how it would work and will believe me when I say it's a fairly intuitive representation for the heap. At least much more so than a lot of the logical formulas that you get. You know, I always take a while to parse separation logic formulas. I couldn't imagine average developers dealing with those very effectively. Even though they're much richer and can express much more sophisticated things than this, they just require a lot more effort to understand what they're saying. The static analysis, well, okay, we can analyze 15,000 lines of code in a few minutes, 200 megabytes. That's not quite module size yet. We haven't analyzed anything larger just because our Java front end was kind of brittle because we wrote it ourselves and we wanted to start working on the .NET stuff rather than continuing to poke that. And so we also were able to do the static analysis efficiently, but also we're able to do it in a way that we're fairly precise and actually get a lot of information. We don't do things efficiently but throw a lot of weight at a lot of good detail. The dynamic support, I sort of hinted at it. We have some neat debugger stuff, some specification mining stuff. Other supports in progress, particularly with the test case generation. So, you know, I'd love to talk to people about this. I think it's kind of exciting. And, you know, really I sort of alluded to some of this future work in the previous slide and earlier, but the goal here is to finish building sort of a -- take these core concepts that seem to work well, build a robust tool that other people can use and play around with. I think now it's to the point where if you're interested in using it for something, you can run it on some nontrivial programs that we've already run it on. If you want to try and do lots of other large stuff, you're getting into exciting territory. But at least it's the point where you could play around with a nontrivial program and see if it's producing results that would be useful to you in doing what you want to do. It's -- you can sort of evaluate it a little bit. So we really want to focus on finishing this implementation so that it is robust and you can run it on something that we haven't run it on and have better than 50/50 odds of getting anything. And then we're also trying to figure out how to export these results in a meaningful way. You know, we can export them to Visual Studio here with a little bit of the code contracts. We have sort of an API for querying the model. But it's not clear what's the best way to export this information in a uniform way that other tools can make use of it and what -- and if we've gotten all the information that we'd like. I think we've covered the range of information that people would need pretty well. But we really need some more experience with actual applications to say that for sure. And so that sort of ties into this we'd like to apply this to, you know, actual client applications. We've sort of played around with some as we've been developing this model to make sure we're on the right track, but we haven't really implemented a lot of tools in a comprehensive way based on the results. So I think that wraps up what I had to say. So I, you know -- questions or comments, I'd be very interested in what everybody had to say. >> So do you have a visualization [inaudible]? >> Mark Marron: Yes. >> Because you didn't list it, but I sort of expected ->> Mark Marron: Yeah. So here we go. So we go back to this body. So you can see load tree. We can actually share the full model. And so, well, okay, it's a little tough if you want to actually read the text in it. But so here it shows that, you know -- it shows the pre and post states of that method. So it will show you what's read and written. So you can see, as I mentioned, we said, you know, this is never modified -- this is never modified. This is actually arg -- well, arg zero. In the middle. Okay, yeah. Arg zero. So there. And then, you know, you can see at the post date, right, that this part of the heap has been read. Even though there's a lot of modification over here, what was reachable from arg zero, this body has just been read. Here's the part that's -- the return value has been modified. You know, it says, oh, it's fresh, it actually has a tree shape. It's, you know, has a reference back to -- you know, it's -- the return value itself is a fresh object, freshly allocated. But and some of the other parts are fresh. But it also has a reference to some part of the heap that was passed in. So there's more information -- I mean, there's a fair bit of interesting information here that it's not clear how you, you know, export that, you know, in a compact way to other tools or to the user other than just saying here's the graph and you can sort of look around and see what you want to see. Yeah. >> So what [inaudible] it provides me [inaudible] analysis that already mentioned? >> Mark Marron: Yeah. >> So how do you compare two, then? >> Mark Marron: Well, I think -- okay. So two important distinctions with the DSA analysis is, one, we're doing the analysis in a context-sensitive manner, right, whereas they do a local and then bottom-down, top-up. So they're not doing context sensitivity as such. So that means that they're going to lose a lot of precision that way. The second thing that happens is they have a fixed partition of the heap that's computed by using a points to analysis, right, for -- for the local flow graphs, whereas we have a dynamic partitioning scheme that allows us to basically say take a link list and split it into two and we'll realize that now there are two linked -- conceptually different linked lists, whereas with their fixed partitioning, there's no way to break that apart necessarily. So in general what you'll see is that you'll get much smaller graphs that seem to, at least in my opinion and from looking at it, end up merging a lot of parts of the heap that really are conceptually distinct in the program into one region. And so you just lose a lot of precision that way. >> So do you know what's the worst case [inaudible]? >> Mark Marron: For mine? >> Yeah. >> Mark Marron: It's exponential in several ways. Right? I mean, we're doing context sensitivity. And even though we have some nice heuristics, you could still come up with a pathological program that would do that. >> [inaudible] something like n-squared or something? >> Mark Marron: Theirs is something like n-squared, yes. And, I mean, but the other thing is I think this is why we're interested in focusing on sort of a module level. Right? I mean, we can do 15,000 lines of code efficiently we know. I believe it will continue to scale well. It seems to have pretty nice scaling. We have plenty of optimization stuff. I feel pretty comfortable saying, you know, 30-, 40-, 50,000 lines of code is not a problem. I mean, you can imagine modules bigger than that, I'm sure. I'm sure there are plenty here that are bigger than that. But, you know, 50,000 lines of code I feel confident that we can scale on this pretty nicely. If you said I'd like to do completely context-sensitive analysis with the shape model of half a million lines of code, you know, I don't think that's doable. Whereas the DSA people have said we want to just take the whole program and we want to analyze it all in one chunk. And so they're analyzing very, very large C++ programs. So, I mean, having that sort of, you know, it's -- you know, it has to be -- you know, n-squared is getting -- or n-cubed is getting pretty rough for programs that big. But, you know, they really put a lot of time into making sure it scales well on a really big code. >> Yeah. I guess it will be nice to see how much precision you gain, but it's hard to compare because they're C and you're in Java and ->> Mark Marron: Well, and so mainly what I've looked is at some of these like JOlden benchmarks, right, because there's the -- the original suite was the Olden suite in C. And then the JOlden suite is in Java. And so in looking at those, you'll see something like the Barnes-Hut example. Basically they'll wad sort of all of this together. >> Okay. >> Mark Marron: And so you lose a lot of -- so you lose the ability to say we have a tree of body objects. You say I have a DAG of cell node and body objects. So, okay, for some applications you don't care, but I think that's a little too much loss of precision for a lot of the things that I'd like to do. >> Just final quick question: So have you seen anywhere in this paper -- it's [inaudible] called is it the cycle [inaudible]. >> Mark Marron: Yes, yes, [inaudible] I like that paper a lot. That was sort of one of the papers that I read and I was like, hey, you know, if we had this nice sort of category of shapes, but the problem is they apply it to the whole heap. And so if you have something bad down here, then it all dies. And I said, hey, this is the neat thing about the graph, because if we apply it individually, you'll notice in the analysis, oh, man, we got this wrong, we said this is a cycle. We were overly conservative here. It should be a tree. But the fact that this region is a cycle, when we still have nice overall structure, we still know that these body objects are uniquely pointed to by some of these guys and whatnot. So that's sort of that error isolation that I really like a lot. >> Francesco Logozzo: Okay. Thank you. [applause]