21664 >> Shaz Qadeer: Okay. Welcome, everybody. It's my pleasure to introduce Hongseok Yang to you. Hongseok is visiting us from Queen Mary University where he's a professor. He has worked for many years on program semantics, program analysis, program verification and logics that make it easy to specify and verify programs. He got his Ph.D. from UIUC in 1991 and then he spent five years in Korea doing a post-doc. Then he moved to Queen Mary University, and he has been since then. In May he's going to move to Oxford University. And he'll be here today and tomorrow, if you're interested in meeting him, let me know. >> Hongseok Yang: Thank you very much. Actually I got the degree in 2001, 1991 and then I got a degree 10 years ago. I was too young for -Okay. So what I'm going to talk about -- what I'm going to talk about in the perhaps next 50 minutes, or less than 50 minutes, is going to be -- what I have done with Oukseh Lee and Pete Rasmus Petersen. Before starting the talk, I want to give some words about kind of research methodology that I and my friends employed in the past couple of years. So the kind of methodology we had was we were in some sense, didn't really focus on technology which applied for really wide class of programs. You said we wanted to analyze only some small restrictive but interesting class of program with a zero force [inaudible]. What we believed was if we focused on a certain class of programs, then push the verification technology up to the limit so it can actually verify some of the property not with zero [inaudible] that will help us to come up with a new ideas. So what I'm going to present now is in a sense the result that followed from that methodology. So I will tell you kind of program we are interested in, and also kind of property we discovered for the program and the technology we developed to exploit these properties. Right. So the long-term goal of my research and kind of the thing that I want to do with my friends is to develop automatic verification tool that can verify certain class of program with zero [inaudible] so we are interested in real world programs from especially real world low level code like operating system in some part of the operating system Linux. And we started this -- we started this research about five years ago, I and my friends in London. Although it sounds a little bit odd for me to say we made progress, I think we were quite proud of the thing we have achieved because five years ago what we dealt with was kind of list reversal and so on. But now we have enough infrastructure and experience so that we can now apply to analyze some more realistic goals. So one of the examples which I'm going to talk during the talk is deadline IO scheduler from Linux. So I think we made interesting progress. But still we are quite far from achieving the real goal, which is developing technology that works for -- I mean, for instance, Linux. And there are many challenges to deal with those, to overcome all the difficulties. And one of the main challenges, this highly shared data structure. That is what I'm going to talk about today. So I mean another thing which I want to tell you is I think I'm a bit nervous today. So especially if I'm nervous my English gets screwed up. So when you have some questions feel free to ask me. The purpose of this talk I'm not giving a talk for a conference. It's not like -- it's of course a little bit selling my stuff, but I hope you understand some of the things that I really enjoyed doing. So if you have any questions, feel free to ask me. Okay. So, right, so one of the challenges in developing automatic verification tool is complicated data structures. So this is one example of complicated data structure, which you can find in deadline IO scheduler. So deadline IO scheduler, what it does is whenever you write, say, okay I want to write like one megabyte of file into the disk, then that request is chopped into very bunch of very small requests, and those small requests are first saved -- saved in bunch of queues and from the queues the scheduler picks the next request to the process and that request is moved to the next queue. So the first queue you move on the left-hand side, these are the two queues in deadline IO scheduler, and when it first arrived it's toward some area in this queue. It doesn't look like it's quite complicated, but it's just where some here. And when IO scheduler think now it's time to process one request, the request is moved from this queue one to queue two and then sometimes later it is really written into the disk. Now, one of the challenge to verify such a program is that the queue 2 looks quite okay. It's a double [inaudible] but queue 1 looks obviously similar. Looks like there are some structures but it is quite complicated structure. So the question is how can it develop verification technology which can verify data structure like queue 1. So when we look at it with the first thought, this is so complicated, but we realize actually queue 1 has very nice structure, which is if you look at some of the field, this is just double link list. And if I see another kind of field, this is just red/black trees. So it is two data structure which is implemented on top of another, and they do it because that allows them to optimize certain operations which I'm going to tell you later. So what I call that overlay data structure is that means multiple data structures are implemented on top of one another, and this kind of data structures are found frequently in many of these systems code, especially like when I looked at source code of Linux, I mean it's quite common to use it because these two data structure provide different kind of indexing over the data structure. So some operations can be done in a very optimized way. Right. But for program analysis person's perspective, I mean I want to find a little more property about such a data structure, because if I really keep -- try to track all the correlation between tree and link list, it becomes very difficult to come up with a scaleable verification technology. So we looked at the source code, and we realized actually that although these multiple data structures are implemented on top of one another, they're very weakly correlated, especially the correlation that matters is these data structures are just talked about the exact same set of heap cells. So what we actually have done is that we developed verification technology which can actually exploit this weak correlation property. So I will show you some operation in deadline IO scheduler, and then tell you that the correlation property that I just told you, at least are talking about the same set of heap cells how it will be maintained by the IO scheduler, also how it can be exploit. I will start with one operation called insert. So this insert is an operations which take a request and put the request somewhere in the tree, I mean in the first queue, queue 1. So what it does first is it first searched the link list and add to that to the double link list. So it added like this. And after that, you search the tree and then you find the right spot to insert the node. So it goes like this. And then insert the elements. So at the end of the operation you can actually see that -I mean, we still have our list and trees, overlaid on top of the other, another, and then these two tree and list data structures are talking about exactly the same set of heap cells. So that kind of correlation property is maintained by this insert operation. The other more interesting thing is removal request. So what it does if search some request which sets by certain property, and it moved that request from the first queue to the second queue. And these operations actually exploit this correlation property. I explain how it works. So first you find the request. Say traverse double link list and find a node like this one, and you eliminate the nodes from the double link list. Okay. Yes, delete the 92 from the double link list, and then it deletes the node from the tree. But one interesting thing about this deletion of a node from the tree is this deletion happens in a sense in place, without really traversing the entire tree. That's possible because we know that this -- once I find the request in a double link list, that node is also in the tree as well, because of the correlation property that I just told you. So how the operation works is it traverses the parent partner and finds out which guy is actually pointing to this request in the tree and it resets its proper child in this case, Y child into 0. So that's effectively eliminating the node from the tree. So this is a place where actually this property, tree and list, talk about same set of heap cell is exploited, because if that property is not there, then following this parent pointer, because it's a CI might end up with a dangling pointer. If I try to do anything with this, this is not going to be memory safe operation. So then it adds elements to the nodes like this. So now you can see that again at the end of the day we still maintain the correlation just like I told you. This tree deletion operation actually exploit this weak correlation property. So when -- I mean, I and my friends looked at this program. We developed technology -verification algorithm which can actually exploit this correlation property and so that analyzer can prove the deadline, within deadline IO scheduler, but sufficiently efficient. When the application started to work started to apply long time ago, in 2009, I and my friend Rasmus Petersen, we first thought this is an interesting program and we want to develop verify for those programs. But we didn't fully understand what was going on in the source code, so we developed the code and we both have precision and performance problem. So we don't track enough correlation, but at the same time we track unnecessary correlation. So we just end up with something which doesn't really work. And then we realize actually we have to throw away lots of correlations. So we have another version which is throw away lots of correlation. So worked almost all routine except one in the deadline IO scheduler, and that's the place where that weak correlation that I just told you is exploited. So this guide doesn't really prove the entire deadline IO scheduler. Yes? >>: So memory safety means that every pointer that's accesses is reallocated and not ->> Hongseok Yang: Memory safety means no dangling point to dereference, no node point to dereference. Also no memory link. But then what we really prove is co-shape property at which program point what are the expected data structure invariants, and this verifier shows that these are the expected data structure invariant. So I will show you some demo and you will see some pictures. >>: Absence of memory -- more than safety ->> Hongseok Yang: Well. >>: Okay. >> Hongseok Yang: Yes. So then we gave up. And then my friend Oukseh Lee joined like this year because he had a sabbatical and we approached the problem again. And we were able to actually prove the entire deadline IO scheduler. Modular -- we do some modeling so there are hacks achieving what we have done, with that cheating we couldn't do it because I thought it was quite hard problem. But anyhow that's what I've done, and I will show you the demo about ->>: What's your -- how do you model the operational semantics of the program? >> Hongseok Yang: So the operational -- you mean operational semantics of C? >>: Yeah. >> Hongseok Yang: We had a particular operation -- we actually, you can see our operational semantics is encoded inside implementations. So then the memory model, we used almost memory -- we used a memory model inside in CIL. So that's in a sense not as precise as like real C semantics, but so what we're aiming for is condition of -- conditional correctness, assuming this memory model doesn't really generate any new issues, then we have correctness. >>: Are you going to talk about what is the memory model? >> Hongseok Yang: Not really. >>: Okay. How big is the IO scheduler? >> Hongseok Yang: I will show you now. So this is IO scheduler -- I mean, it has a bunch of in-line code and some my creation routine. So it is quite -- it's not very appreciable. But so this guy -- this is reasonably big. 2000 -- it's about 4,000 lines of code. Of course, there's lots of comments and so on, but I thought it's realistic. And then this is like main -- my main routines. What it does it creates a data structure. The data structure can fail -- creation can fail, and then it returns. Otherwise there's a really big nondeterministic Y loop, keep exercise deadline IO scheduler. Deadline IO scheduler is like a library. What it does is generate typical data structure and big Y loop keeps exercising all these library routines. So I commented the main part because it takes about 400 seconds. So I will just show you -- the analysis of the result after the creations so you can see how data structure look like. So ->>: You do do in-line procedures? >> Hongseok Yang: No, we implemented RHS, yes. So now analysis is finished. So it generates a bunch of pictures. But I mean -- okay. So these are all global variables used inside inline IO scheduler. What this picture shows is that how actually this picture shows how our analyzer actually works. Our analysis tool tried to analyze the least part and three-part as separately as possible. This picture only shows what we're going to get from, on the analysis part about the list. So this one say they're double link list which have some header. These are some errors from the drawing tool. And there's another double linked list. So remember the deadline IO scheduler, there are two doubly linked list one for queue 1 and the second is for queue 2. These two mirror the two data structures which you saw in the slides. Now I have shown another one. So this is about tree. Again, there are lots of global variables. And the important thing is here so there's a tree which is for the first data structure and there are some true predicate which means essentially as far as the second data structure, which is queue 2 is concerned, we don't maintain any tree structures. So this true predicate covers the queue 2 from the tree's perspective. So this picture actually shows how analyzer, our analysis tool actually works. Tried to analyze this part and tree part as separately as possible. So the idea is because these two data structures are weakly correlated, I mean, really encourage us to have very weakly correlated on two analysis tool. But they have to talk to each other in a little bit. So that's the rest of the talk what our focus is how that weak correlation actually happens. Before that, I'll just run the tool. So it will take about 400 seconds. And then I will just continue my talk. Hopefully at the end of the talk it finishes. So design principle is the idea is to analyze a tree and list as separately as possible and what we think is by doing so we can get some gain in performance. But this separation cannot be done in a complete manner. So they have to talk to one another a little bit. And in terms of techniques, I found a quite interesting technique, we found if we use gross variable and during the analysis we discover instruction on the gross variable on the fly, then we found that there's one good way to make those two tools to talk to each other. The traditionally, if you're familiar with program analysis, people used so-called reader's product or some other mechanism which is statistically determined. But what we found is using gross variable is another good way to make the communication in a very cheap, effective manner. So now I'll explain what's going on and to do that you need to understand a little bit about separation logic, but it's okay -- it's not very hard. You can say understand the separation logic performance means in terms of pictures. The first formula which says X means points to prac Y, next Z. You see at the top here. This one says I have a field X, previous field bit Y, and next field is Z and it might contain some other field. So we don't really care about some other fields. For instance, in this case it has a field F. We don't really say what the value of F is. The second formula is nonempty, it's a double linked list. So you say I have a double linked list starting with Y ending with Z. I have a double linked list starting with Y and ending with Z. And the previous value of Y is equal to A. Like you see here. Next value of Z is equal to B just like B you see on the top. And then this NE means nonempty double linked list predicate. And we also introduce some tree segment predicate because the deadline IO scheduler actually used trees. So this predicate say I have a tree rooted at X, but it's a segment so there's some edges which is going out from the tree to the outside. So this Y is talking about the pointer from this kind of last cell in the tree where from where this pointer goes from the tree to the outside. This tree is a target of the pointer. This is tree X but there's somehow called Y. From Y there's a pointer to the outside, which is V. And then A is like parents' value of cheap predicate X, yes. >>: And you only have a four area version of this tree; you don't have exceptions for B and C and D? >> Hongseok Yang: No, that's right T we don't have multiple trees with multiple holes. >>: Because the programs only violate the invariants in one point at a time. >> Hongseok Yang: That's right. So programs like say I want to verify Bradford search, this kind of predicate is not good enough. What happens is analyzer will diverge or start to be stopped by me because this predicate is not good enough to summarize some regularity which is going on in the first traversal. So this is very targeted for certain tree operations where one hole is good enough. >>: But it does an obvious generalization, which is number of fixes. >> Hongseok Yang: Yes. That's right. So if we do it -- so manual verification is not really a problem but if I do the automatic verification that means I have to design abstract domain for strings. So it's because holes can increase as many times. So one direction which we haven't pursued. It's already there are some other works that needs to be done. So now interesting part of separation logic is separating conjunction perhaps you heard of our separation so many times now you understand. The separation conjunction said I can split him into two parts. Left-hand side of the -- one part satisfied the left-hand side of the formula. The other part satisfied the right-hand side of the formula. In the picture you see on the bottom actually satisfy the formula here, because I can split heap here, and then X satisfies the formula X points to 0 Y because its previous value is 0. Next value is Y. And the other side satisfied this list segment predicate because I have a double linked list segment Y to Z which is not empty and previous and next fields are set correctly. Separation logic is just normal logic. So it has existential quantification universal as well as conjunction and disjunction. Now one idea we have was to use only special kind of separation logic formula, which is good enough to capture this weak correlation properties that I just told you in deadline IO scheduler. So don't worry about this gamma, beta, alpha at the moment. And what this formula -- this represents the kind of typical formula which is used during the analysis. This formula is a conjunction on the top and has a one part for the conjunct and the other part for conjunct, second part. The first part talks about tree related properties only, and second part talk about list-related properties. Now if I do it, we cannot really express correlation between these tree and list data structures. I mean, that's necessary to prove the memory safety of the program. >>: What is the correlation fact you want to express? >> Hongseok Yang: That tree and the first queue, tree and lists are talking about exactly the same set of heap sets. So what we did was we actually -- I mean augmented or extended the separation logic formula with, you can say, second order variables. Second order variable like gamma beta alpha represents set of locations. But using this we can say the first guy are talking about exactly the same set of locations. The second guy talk about exactly the same set of location, the third one is the same. And the formula also have primed variable which I implicitly existentially quantified. So E prime -I'm not going to write existential quantification, but if you see a prime variable just understand it as existential quantifiable variables. But notice that alpha beta gamma are not existentially quantified, they are gross variable in the programming will be manipulated by the mean instructions in the program. >>: Why are they appearing in the subscript and so on? >> Hongseok Yang: This is just syntax. So we extended the separation logic formula so that every predicate comes with this alpha beta gamma. And then what we wanted was this alpha beta gamma, we don't have any like extra constraint on alpha beta gamma. We just use alpha beta gamma, that's the only uses of alpha beta gamma. >>: So if you took them away would they still work? >> Hongseok Yang: If we took them away, if I run the analysis, analysis doesn't keep enough track of information so we cannot really prove memory safety, the particular routine which is move routine. >>: So alpha beta gamma are program variables, right? You are going to add some goals and you're going to update them as you go along in the program, right? >> Hongseok Yang: That's right. That's right. >>: Now this formula is referring to those variables also? >> Hongseok Yang: That's right. That's right. Alpha beta gammas are kind of instructions which manipulate gamma RB insert are going to be inserted by the analyzer, so that's a way to make tree ->>: Can I think of beta as sort of representing a set of objects? >> Hongseok Yang: Yes, set of addresses. >>: One set of objects. >> Hongseok Yang: Set of addresses. >>: Keep adding, moving stuff from them. >> Hongseok Yang: Yes. So then this set of addresses means that all the heap cell involved in this tree is exactly the same as the set, I mean collect them and make a set the same as the set beta. >>: Sounds like you have three heaps, alpha beta gamma. >> Hongseok Yang: Yes, that's right. Now I'll show you an example and it will become clearer. This first part talk about only tree. So now you see that there are trees -- like cell queue 1 also tree, if I erase all the link what you see on the right-hand side is just a bunch of cells which is summarized by true predicates. The elected one is a list, and you just describe what it looks -how he looked like if you focus on the list, I mean the list part only. Now using this alpha beta gamma, we can now talk about the correlation between these two conjuncts. >>: Is there intuition here? Do you have a data structure? Do you have getting the data structure by like where the alpha beta coming from? >> Hongseok Yang: Alpha beta gamma, you're right, we design so alpha beta gamma captures the group of structures like alpha is for captures all the data structure involved in one object. Essentially one high level object like trees and list. So during the analysis this predicate can be decomposed into multiple components, but still we want to use if nothing really changes, then we want to put group all the things into someone same second order variable like beta. So I thought provides one more hierarchy which says which kind of heap cells should be considered as a single unit. So then beta is for the trees and lists and gamma is for the linked list, what you see on the right-hand side. As I said this expresses loose correlations, now because the ordering of this double linked list just satisfy the data structure. So what it really captures is, I mean, this formula captures the list and trees are talking about exactly the same set of heap cells. But, on the other hand, if I add some nodes like tree but without updating the link field it doesn't fill satisfy it because predicate says this tree and this list should talk about exactly the same heap list. >>: I want to follow up on the question about just representing different subheaps. So the formula you've got written there has the separated conjunction on the inside. And the regular conjunction on the outside. >> Hongseok Yang: Yes, yes. >>: If you did it the other way around ->> Hongseok Yang: In some sense what you're saying is exactly right. Original motivation was suppose we nest separating conjunction and normal conjunction in arbitrary manner then many these properties can be expressed. So exactly as you said put conjunction here using separating conjunction outside. That could be used to express it. But now the problem becomes how can you design the rest of the analysis, which is transfer functions and abstraction mechanisms, joint operations. So it was kind of -- if you have arbitrary complicated formula we gain something on the expressiveness. But on the other hand we pay the price about designing the right abstract -- I mean the abstract transfer functions. So this particular representation is designed so that many of the existing transfer functions can be reused without changing very much. Okay. So, right, the abstract domain is really a disjunction of these guys. So we have disjunction of tree predicate and conjunction outside and disjunction of another, this predicate. So in a sense what we are doing is a creation product so two abstract domain or two power set abstract domain. The first power set talks about tree related to property using disjunctions. Second power set talks about list-related property using another disjunctions and these two power sets, they usually, they talk to each other via this gross variable beta gamma and alpha. Okay. So that was the -- kind of the basic form of the analysis. We set our abstract domain in this way. Now the main question becomes, if you just run analyzer one and list and tree separately, then there's nothing to say about this talk but it's not good enough. So we have to make these two components to talk to each other without paying too much price. So I will show actually how that works by going through this particular routine, which is a move routine. And ideally we like to have list analysis in the one goal and tree analysis in another goal. But the problem is when you do the tree analysis at this point, we have no idea -- we have no information about where IQ is. So we cannot really prove the memory safety. So what we do is our analyzer goes through two phase. The first phase is runs something from preanalysis. You can say -- we can understand there's a meta analysis, which try to figure out what's going to happen when I run tree analyzer and list analyzer separately. Then the meta analysis says actually this tree analyzer, and this analyzer has to talk to each other at certain points. And it inserts instructions which says at this point we have -- these two guys have to communicate. At this point we have to maintain this [inaudible] in a certain way, and then it runs, after insertion of some of the Gaussian instruction by main analysis you run the main analysis. And the main analysis can also add other instructions. So in this case if I run this analysis with predicate that I showed you in the previous slides, the analyzer also inserts instruction of this form which says we have to unify certain gross variables. So it is like we try to put these gross valuables, my instructions about gross variables, to the preanalysis as much as possible, because that makes our analyzer more predictable. But there are some cases where it's not possible. And this unification actually happens because that's where to insert unification can be only available. That information is only available during the real analysis. >>: That's a union or unification? >> Hongseok Yang: So I mean semantically take all the variable gamma and variable delta and take a union of them and then put it to gamma and set delta to 0. It's a set union. And set delta to 0. Delta to the empty set. So then in one way to implement that is by unification. And these are the instructions, which is added, and they are the ones which make these two analyzers to talk to each other. >>: So how many [inaudible]? >> Hongseok Yang: I will tell you later. So suppose it was added. I mean, the last one, yeah, I will show you how it works but the other two I will tell you later. Suppose it works. Then let's see how the analyzer actually make two lists and tree analysis to talk to one another. So now this is a same move routine. And it first finds some requests from the queue 1. So like this. And then it deletes the request from the double link list. So now think about -- I will show you what's going to happen after -- I mean what analyzer we produce after the first two instructions. Analyzer will replace the list -- the second star conjunct in the list part by what you see on the bottom. What they say is I have a list, queue, queue 1 then this guy says I have a double linked segment from C, C prime and previous and next, and all point to queue 1. So this conjunct describes the double link list on what you see -- I mean consisting of the three nodes. And then the points of predicate on the right-hand side express the request node which is taken out from the double link list. Now the next instruction is move informations, but what it says is move information from the second list part of the analysis to the first, which is a three-part of the analysis. And especially we only move the information with respect to this RQ. And this -- so what it does is analyzer first figures out RQ is in the region beta, because it's annotated with beta, which means the set area beta contains RQ and moves that information to the tree part. So it says, okay, RQ should be somewhere in the beta partitions. So then we can replace this tree predicate by -- well, somehow instantiate the tree predicate such that the RQ factor is explicit inside a predicate. So this -- I didn't really show everything, but say I have a tree which is kind of everything which is RQ nodes. Then we have RQ node explicitly, and then there are something which comes below the RQ node. If you see they have a tree, RQ is somewhere. There's something which is before the RQ. There's something after the RQ. And this predicate exactly describes that fact exactly. >>: So is the beta the upper bound on a set of elements? >> Hongseok Yang: Very good point. I think beta should, then you find the star, the notion of a star should also split the definition of beta. So beta -- so notion of a star says I can split it into two part, same way beta should split this definition of beta into two disjoint subset so beta here and beta here talks about these two sets of locations. So be a bit more sophisticated but I didn't quite explain everything. So now if I delete tree, then this node is gone from the tree, and that operation can be run symbolically with respect to this description of our tree that we have. So that will replace ->>: What you just explained, aren't the semantics of separation logic, right? >> Hongseok Yang: That's right. It's different. For semantics of separation logic what I think is separation logic is essentially whenever you have a partial committed [inaudible], you can always define the notion of a star based on a partial [inaudible]. What I'm exploiting here I just extended partial [inaudible] with something extra which is a value of beta alpha and gamma. Once I have this PCM structure, I can always extend the logical formula for those predicates. So in some sense ->>: What you're saying is that there's no one separation logic? >> Hongseok Yang: Exactly, yes. So I think separation logic -- personally my take about separation logic is not like first of all logic it's more of a semantic principle which can be exposed in a certain way. So one -- whenever there's a notion of a separation, I mean expressed by the partial commutative [inaudible] structure then I can always come up with a separation logic very easily. >>: The syntax to express formulas is not a big deal the way you derive all your intuition is based on all the models? >> Hongseok Yang: That's right. Exactly. The syntax is kind of semi-automatic. Once you have it there are a good recipe to produce syntax. >>: Then how do you reason about these formulas? >> Hongseok Yang: Then these are all instances of something called logical punched implications. So that proposes carry-over. So I can reuse [inaudible]. >>: So whenever the semantic models are based on these partially committed [inaudible] the proof rules are this logic of bunch implications will do the job? >> Hongseok Yang: Yes. It's not complete. But it gives just basic implications, then for specific cases we have to come up with the other implication fact. But my experience is, because these formulas are so simple and the variant is very little, that kind of implication we have to consider is very -- not very much. So for particular implementation we are working on right now we don't really do anything special for I mean this beta. Most of the existing theorem proving question, theorem prover, theorem proving techniques can be carried over into these extensions and then we don't do anything special for alpha and beta. Okay. So let's move on. Now, the next one is what we call a transfer. Now what it does is it transfer this RQ from partition beta to partition delta. The reason it does it is if you understand what's happening, what's going to happen in the future is that you can somehow see what's -why we do this, because we want to move this RQ from partition B tree to the list so we do ownership transfer of this sell RQ from the tree to the list. So this -- I mean intermediate step, we are signing delta to -- moving this RQ from beta to delta serves as a kind of preparation step for this ownership transfer. So initially we have beta and then we split beta into like this and then we just name this RQ as a delta. So now at the end we have this list insert operations. So this list insert operations, I mean, works like this. So it adds this node RQ to the second link list queue two. And symbolically what it does is that it -- after that list insert operation we have this representation, which says I have a double linked list starting from R queue two and go to E prime and then it points to RQ from RQ points to queue two again so I have a cyclic double linked list. It's a symbolic way to represent what's going to happen in the actual operation. So now when the analyzer see of this form, it figures out actually I can abstract this, I mean, this is essentially cycle link list. I can express this list with one predicate which is list any predicates. But then we realize that actually we cannot do it because there are -- the partition delta and gamma are different or you can of course do it but if I do it it has to make sure that all this tree and list analyzers have the consistent view about this gamma and delta. So what you do is its first -- I mean, it's kind of unify this delta region and gamma region. Like this. And at the end of this unification, kind of merge, we have a one region called gamma which includes the post request and this list segment predicates. And then there we can apply the abstractions, and then abstraction make -- summarize the entire predicate, single list predicate. Now the effect -- this instruction will also be run for the tree part. So then this will replace -- I'm sorry. The text from delta to gamma. And it's going to be merged with a true. So that's the result at the end what we get from the analysis of this routine. So this is the result that we get at the end of the analysis. This says that we have a queue one which is a tree and we also have a queue one and queue two and the tree and lists are talking about exactly the same set of heap sets. >>: So I have a high level meta level question about your approach >> Hongseok Yang: Yes. >>: So when you set out trying to do this verification, you spend some time making, designing the logic itself, right? >> Hongseok Yang: Yes. >>: Then you spend some more time designing the abstract interpretation business, right, the transfer function and, what is that, join and wide and all that stuff, right? >> Hongseok Yang: Yes. Yes. >>: Now, how much would you say -- how much time would you spend in the first part, how much time would you spend in the second part? >> Hongseok Yang: Actually, the real story is, I mean at the moment for this particular project we think about the problem and then we do just -- after having some agreement about -- I don't know. Sorry. The real story is spend most of the time in the implementations. So designing once we fix the abstract domain like the one that I showed in the first few slides, then the other operations that -- the other operations are just -- we can think about it and see how it should be implemented. So the story is we ask -- okay, we need this much operations and then we just implemented them first. Then the theory just came up later. So I know this is not a good answer. But I cannot really measure which -- how much I really spend for designing these abstract -- kind of logic itself was really easy. So that part was not really a big deal. It's more like abstract designing for algorithm for abstract interpretation. >>: What I don't understand, when you designed the logic, right, this program is, what, maybe a few thousand lines, right? >> Hongseok Yang: Yes. >>: Not a super large program. You could have then decided to write all the loop invariants, if you have a theorem prover, incomplete theorem prover for these kinds of theorems based on the proof rules [inaudible] >> Hongseok Yang: Like I actually did hand proof before, during my Ph.D.. if you ask me do you want to do hand proof or do you want to do like designs such an analyzer I prefer the second option because hand proof is still extremely painful. >>: But it's not a hand proof, right? Because you only supply the loop invariants. Everything else would be taken care of by the prover, right? >> Hongseok Yang: No, there are like many of the procedures ->>: Huh >>: During the Ph.D. the prover did not exist. >> Hongseok Yang: It's a bit more complicated than that. Because now I have to give pre and post conditions for each procedures. And there is -- giving up pre, post condition for each procedure requires [inaudible] point, [inaudible], which means you actually have to add logical variables. If you don't add logical variables, you cannot really relate certain pre -- relationship and pre and post. And identify how many logical variables I should add is much more complicated. So personally if somebody like say my boss asked me to do manual proof with the infrastructure I have and if we give yet another option say, okay, develop a verifier. I think developing a verifier for me is a lot easier to do that. >>: I still don't understand. No matter what you do, at the end of the day, even if you use program analysis and abstract computation it's computing at the end of the day and inductive proof, right? >> Hongseok Yang: Uh-huh. >>: So how do you check -- must be some stack that checks that proof is valid, right? >> Hongseok Yang: Yes, yes. >>: Who is doing that checking? >> Hongseok Yang: I mean, the abstract interpreters use sound implications, right? Whenever you do the over approximations, it always produce -- suppose all the steps over abstract interpreters are correct, then it will always produce an overapproximation of each step or you need to check syntactically I reach the fixed point or not. Abstract interpreter in a sense put much less burden to the theorem prover because many of the good reasonings is interpreted into the development of these transfer functions and any check by which the fixed point can actually be reduced to syntactic checking. Okay. So personally, I mean by developing these tools I personally find that interacting with some kind of -- this one, instead of providing the actual proof of loop invariants, providing some of these hints were helping a little bit that's much more enjoyable and much easier to do that. But just my personal opinion. And another thing is there are some other class of program which I believe follow the same patterns than equip the technology that could be applied there. All right. So now I sum up some abstract operators. Abstract domain you can say a power domain of tree and list. And these domains have three operations, which is moving information from tree to the list to the tree with respect to cell X. Transfer the partition of cell X to the partition alpha. Then we also have some union operation or some, can say unification operations. These are three operations I use to make these analyses talk to one another. Now, then I just want to point out one thing. Then this operation which you see on the bottom is mainly discovered based on where we should apply abstractions. So abstraction is, like summarizing data structure into single list segment predicate. That was a driving force to -- for inferring where we should put the third instruction. But the other two, and actually we use preanalysis to figure out where we should insert the other two, and I will show you how it works there. So to figure out the other two we in essence do meta analysis. We do analysis, trying to figure out, in terms of data flow analysis but data analysis to figure out if I run tree analysis what's going to happen. So for transfer functions what we do is if the meta analysis figures out. If we want tree analysis and list analysis, then we know that it's RQ will be expressed as a point -- I mean, that inside inductive definitions but as a point to predicate explicitly. That's where we insert RQ, transfer instructions. Wherever we see RQ points to blah, as in the instructions, then we temporarily put it into the region delta that might go back to the original partitions or move to the next partitions. This is what's being done by the analyzer, and the second one is now where we should insert this moving for instruction. Now, we do it to find it out we do forward analysis and backward analysis. The forward analysis figures out that whether tree knows about RQ, at least knows about RQ. So then this four analysis figures at this program point it's likely that tree analyzer will know about where RQ is actually allocated. But it's like [inaudible] this analysis cannot really conclude -- no, the list analyzer knows where RQs are located but it doesn't really know. Tree will know about RQ or not. On the other hand, it's backward analysis say that at this point, to proceed, this tree part actually has to know about this RQ. So we do this forward analysis to figure out which part of the analysis knows about the cell RQ and which part of the analysis which needs information about the RQ and then that's the point where we actually insert this moving for information, moving for instruction. For this deadline IO scheduler, this is formulated using some data flow fixed point -- meta analysis. And insert four or five instructions, moving point instructions, into the program. Okay. So this is the end of my talk, and yes? >>: I had one more question. Meta level question. When you designed the abstract domain for a program, in that process do you feel like you got some more insight into why the program is correct? >> Hongseok Yang: Yes, that's actually true, yes. So for this deadline IO scheduler, I really didn't know the weak correlation before. So, yes. >>: I see. But I thought the weak correlation, that insight came when you were trying to design the logic not when you were trying to design the transfer abstract domain and the transfer function and so on. >> Hongseok Yang: I don't really make the distinction. Because for me designing of logic is so easy. So I mean -- because it's not -- so designing logic really means coming out [inaudible] so once I -- and identifying which partial [inaudible] are necessary is closely tied to the -- I mean how analyzer is going to work. So it's like all the rest of the syntax is because I'm so used to it, just follow very automatically. So the first take-home message is there aren't many of the data overlay structure. And I expect maybe not all of them but quite a few of them uses this correlation only. So when you are kind of faced exactly the same problem, I mean, think about whether this guy only have this weak correlation or not. If you are happy with using separation logic, try to use this conjunctions and start and second order variable. The second take home message, which I quite liked it, because I mean suppose your boss asks you to design combine very expensive analysis, disjunctive analysis are very expensive. Want to put two disjunctive analysis together. That's actually what we are doing, one disjunctive analysis for tree, another disjunctive analysis for list. If it expands, if we implement so-called reduction operator which is by expanding all these conjuncts, so five disjunct, ten disjunct, have disjunct of them you can generate 50 cases. Right? If you do those 50 cases and do some operations on them, very likely it's going to perform very, very poorly. Because it grows exponential, not exponential, but I found it quite challenging. On the other hand, another challenge is to treat this case splitting implicit manual. One way to do it is using gross variable, adding gross variable, and the make analysis to insert these gross instructions so that these two analyzer have a consistent view about gross valuable. So that's the end of my talk. And I will just go back to the tool and say it's verified. And nothing fantastic like fancy but the tool was able to verify that the code is okay. Okay. Thank you very much. [applause]. >>: So can you rephrase the use of ghost variables and the transfer functions as a reduced product? >> Hongseok Yang: Yes, but usually reduced product is explained in terms of -- once I fix state space, meaning once I fixed a set of variables and then that's usually the way to explain reduced product. But if I exaggerate a little bit, it's like once you have existential quantification you exploit the existential quantification to talk a lot more. If you're allowed to extend the state space ->>: So you have -- so there's a good sub [inaudible] >> Hongseok Yang: Yes. >>: Basically [inaudible] >> Hongseok Yang: Yes. >>: But the presentation implied that you use, you insert functions in the transfer functions by [inaudible] space >> Hongseok Yang: Right. P. But then also on the fly, if it's necessary we can insert the ghost instructions. You're right, it's kind of principle is that I want to put analysis together, one possibility extend the state space by adding gross variables, and then during the analysis figure out how to update these gross variable and then insert ghost assignment will play the role of communications between these two analyzers. >>: Does it help [inaudible] or does it somehow -- you look close at the two [inaudible] >> Hongseok Yang: So the preanalysis, mostly try to get things in terms of reachability. It's heuristic. But instead of a tree, you say okay if I have some root X, I want to know all the things that are individual from XY or this field of the tree. So the whole analyzer as I described is parameterized for a set of fields. Initial configuration parts says these fields should be considered as one conjunct. These set of fields will be tracked by another conjunct. And based on that set of fields it uses which kind of reachability star reasoning to figure out which -- probably core B knows what's going on there. So now tie it to the trees. >>: That's kind of, I guess, maybe anybody might feel free to answer this, because I have little experience in that area. But if you're going to pose this problem to me, my first reaction would actually have been to do dynamic analysis where I start exploring, like I just start creating models like essentially puzzle potato structure. It seemed like I had a small state space to search for. I can look for new shapes, but expected other space but ->>: The answer to your question here the approach is nonnegotiable. You have decided on what the approach is. You can't change it. >>: Yeah, I think that's great, the extent of the approach in your balance but I'm kind of curious? >> Hongseok Yang: One possibility, I think dynamic analysis can play a big role in shape analysis as well. So perhaps how can I figure out what data structures are actually used. And then suppose I have -- see for this kind of quote I don't have a good dynamic profiler, but suppose I have a good dynamic profiler then I can gather some of the heap shades, certain abstractions may be applicable here. Certain abstractions is not applicable because applying abstraction at the end I got exactly the same heap then this is useless. So actually the [inaudible] they do similar case studies for Java. What they said is for certain kind of client, particular applications, they tested whether -- they collected heap from dynamic analysis and applied the abstractions, whether they checked whether certain abstraction is good for that or not. So I'm sure there's lots of room for dynamic analysis. I just don't really know what to do at the moment. >>: What I was looking for was kind of this is for a 4,000 line program. I was hoping you'd say something along the lines okay we start doing this from a Windows kernel which is not 4,000 lines, now this dynamic analysis will get lost. Stack analysis, those wars, like that >> Hongseok Yang: Another possibility, I prepared it so now I have a main routine. So I can run this entire code and collect some data. It's not fully dynamic. I've already prepared -- because this is a library. I prepared the mains. And I run the routine and I guess some features about okay what kind of correlation is important and so on, then that could be used. But it's a kind of direction which I haven't explored. >>: So this one takes -- [inaudible] >> Hongseok Yang: Yes. >>: So the invariants that the abstracter up there is computing, it's very, very large? >> Hongseok Yang: Yes, in some cases the preconditions, yes but 3,000 preconditions. So those are rotate routines and the preconditions are 3,000 -- that's of course because analyzer makes some rough guess. And I can actually show some statistics here. >>: This is pretty amazing. >>: So do you prove that just the driver of this particular, that particular [inaudible] end is safe or is this packaged up [inaudible]. >> Hongseok Yang: Yes. I mean, I'm sorry, what did you ask? >>: At the end of your C program you uncommented this wild wood [inaudible] verified that the wild wood >> Hongseok Yang: From the wild loop we can reach everywhere. I'm hoping it reaches almost everywhere in the source code because these are like library. So usually exports some method and it just exercises all these methods. And ->>: [inaudible] >>: Well, he has property that's maintained at every step of my code. So ->> Hongseok Yang: At this program point the invariant is it's not necessary what you see on the slides. It could be anything. So see some routine has 6-100 preconditions and some like Web electry insert 150 -- well, that really -- okay. So I think this is not very useful. Some cases it has about 400 preconditions and so on. So it's just doing it by hand could be quite challenging. I mean, of course humans can optimize, but still quite a challenge. And I think actually analyzing trees is much more difficult than analyzing link lists. So you can look at maybe it's the same in Windows but Linux trees are much, much more challenging. Okay. Thank you very much.