21664 >> Shaz Qadeer: Okay. Welcome, everybody. It's...

advertisement
21664
>> Shaz Qadeer: Okay. Welcome, everybody. It's my pleasure to introduce Hongseok Yang to
you. Hongseok is visiting us from Queen Mary University where he's a professor. He has
worked for many years on program semantics, program analysis, program verification and logics
that make it easy to specify and verify programs.
He got his Ph.D. from UIUC in 1991 and then he spent five years in Korea doing a post-doc.
Then he moved to Queen Mary University, and he has been since then. In May he's going to
move to Oxford University. And he'll be here today and tomorrow, if you're interested in meeting
him, let me know.
>> Hongseok Yang: Thank you very much. Actually I got the degree in 2001, 1991 and then I
got a degree 10 years ago. I was too young for -Okay. So what I'm going to talk about -- what I'm going to talk about in the perhaps next 50
minutes, or less than 50 minutes, is going to be -- what I have done with Oukseh Lee and Pete
Rasmus Petersen. Before starting the talk, I want to give some words about kind of research
methodology that I and my friends employed in the past couple of years.
So the kind of methodology we had was we were in some sense, didn't really focus on technology
which applied for really wide class of programs. You said we wanted to analyze only some small
restrictive but interesting class of program with a zero force [inaudible]. What we believed was if
we focused on a certain class of programs, then push the verification technology up to the limit so
it can actually verify some of the property not with zero [inaudible] that will help us to come up
with a new ideas.
So what I'm going to present now is in a sense the result that followed from that methodology. So
I will tell you kind of program we are interested in, and also kind of property we discovered for the
program and the technology we developed to exploit these properties.
Right. So the long-term goal of my research and kind of the thing that I want to do with my
friends is to develop automatic verification tool that can verify certain class of program with zero
[inaudible] so we are interested in real world programs from especially real world low level code
like operating system in some part of the operating system Linux.
And we started this -- we started this research about five years ago, I and my friends in London.
Although it sounds a little bit odd for me to say we made progress, I think we were quite proud of
the thing we have achieved because five years ago what we dealt with was kind of list reversal
and so on. But now we have enough infrastructure and experience so that we can now apply to
analyze some more realistic goals.
So one of the examples which I'm going to talk during the talk is deadline IO scheduler from
Linux.
So I think we made interesting progress. But still we are quite far from achieving the real goal,
which is developing technology that works for -- I mean, for instance, Linux. And there are many
challenges to deal with those, to overcome all the difficulties.
And one of the main challenges, this highly shared data structure. That is what I'm going to talk
about today. So I mean another thing which I want to tell you is I think I'm a bit nervous today.
So especially if I'm nervous my English gets screwed up. So when you have some questions feel
free to ask me. The purpose of this talk I'm not giving a talk for a conference. It's not like -- it's of
course a little bit selling my stuff, but I hope you understand some of the things that I really
enjoyed doing.
So if you have any questions, feel free to ask me. Okay. So, right, so one of the challenges in
developing automatic verification tool is complicated data structures.
So this is one example of complicated data structure, which you can find in deadline IO
scheduler. So deadline IO scheduler, what it does is whenever you write, say, okay I want to
write like one megabyte of file into the disk, then that request is chopped into very bunch of very
small requests, and those small requests are first saved -- saved in bunch of queues and from the
queues the scheduler picks the next request to the process and that request is moved to the next
queue. So the first queue you move on the left-hand side, these are the two queues in deadline
IO scheduler, and when it first arrived it's toward some area in this queue. It doesn't look like it's
quite complicated, but it's just where some here. And when IO scheduler think now it's time to
process one request, the request is moved from this queue one to queue two and then
sometimes later it is really written into the disk.
Now, one of the challenge to verify such a program is that the queue 2 looks quite okay. It's a
double [inaudible] but queue 1 looks obviously similar. Looks like there are some structures but it
is quite complicated structure.
So the question is how can it develop verification technology which can verify data structure like
queue 1. So when we look at it with the first thought, this is so complicated, but we realize
actually queue 1 has very nice structure, which is if you look at some of the field, this is just
double link list. And if I see another kind of field, this is just red/black trees. So it is two data
structure which is implemented on top of another, and they do it because that allows them to
optimize certain operations which I'm going to tell you later.
So what I call that overlay data structure is that means multiple data structures are implemented
on top of one another, and this kind of data structures are found frequently in many of these
systems code, especially like when I looked at source code of Linux, I mean it's quite common to
use it because these two data structure provide different kind of indexing over the data structure.
So some operations can be done in a very optimized way. Right. But for program analysis
person's perspective, I mean I want to find a little more property about such a data structure,
because if I really keep -- try to track all the correlation between tree and link list, it becomes very
difficult to come up with a scaleable verification technology. So we looked at the source code,
and we realized actually that although these multiple data structures are implemented on top of
one another, they're very weakly correlated, especially the correlation that matters is these data
structures are just talked about the exact same set of heap cells.
So what we actually have done is that we developed verification technology which can actually
exploit this weak correlation property. So I will show you some operation in deadline IO
scheduler, and then tell you that the correlation property that I just told you, at least are talking
about the same set of heap cells how it will be maintained by the IO scheduler, also how it can be
exploit.
I will start with one operation called insert. So this insert is an operations which take a request
and put the request somewhere in the tree, I mean in the first queue, queue 1. So what it does
first is it first searched the link list and add to that to the double link list. So it added like this.
And after that, you search the tree and then you find the right spot to insert the node. So it goes
like this. And then insert the elements. So at the end of the operation you can actually see that -I mean, we still have our list and trees, overlaid on top of the other, another, and then these two
tree and list data structures are talking about exactly the same set of heap cells. So that kind of
correlation property is maintained by this insert operation.
The other more interesting thing is removal request. So what it does if search some request
which sets by certain property, and it moved that request from the first queue to the second
queue.
And these operations actually exploit this correlation property. I explain how it works. So first you
find the request. Say traverse double link list and find a node like this one, and you eliminate the
nodes from the double link list.
Okay. Yes, delete the 92 from the double link list, and then it deletes the node from the tree. But
one interesting thing about this deletion of a node from the tree is this deletion happens in a
sense in place, without really traversing the entire tree. That's possible because we know that
this -- once I find the request in a double link list, that node is also in the tree as well, because of
the correlation property that I just told you. So how the operation works is it traverses the parent
partner and finds out which guy is actually pointing to this request in the tree and it resets its
proper child in this case, Y child into 0. So that's effectively eliminating the node from the tree.
So this is a place where actually this property, tree and list, talk about same set of heap cell is
exploited, because if that property is not there, then following this parent pointer, because it's a CI
might end up with a dangling pointer. If I try to do anything with this, this is not going to be
memory safe operation.
So then it adds elements to the nodes like this. So now you can see that again at the end of the
day we still maintain the correlation just like I told you. This tree deletion operation actually
exploit this weak correlation property.
So when -- I mean, I and my friends looked at this program. We developed technology -verification algorithm which can actually exploit this correlation property and so that analyzer can
prove the deadline, within deadline IO scheduler, but sufficiently efficient. When the application
started to work started to apply long time ago, in 2009, I and my friend Rasmus Petersen, we first
thought this is an interesting program and we want to develop verify for those programs. But we
didn't fully understand what was going on in the source code, so we developed the code and we
both have precision and performance problem.
So we don't track enough correlation, but at the same time we track unnecessary correlation. So
we just end up with something which doesn't really work. And then we realize actually we have to
throw away lots of correlations. So we have another version which is throw away lots of
correlation. So worked almost all routine except one in the deadline IO scheduler, and that's the
place where that weak correlation that I just told you is exploited.
So this guide doesn't really prove the entire deadline IO scheduler. Yes?
>>: So memory safety means that every pointer that's accesses is reallocated and not ->> Hongseok Yang: Memory safety means no dangling point to dereference, no node point to
dereference. Also no memory link. But then what we really prove is co-shape property at which
program point what are the expected data structure invariants, and this verifier shows that these
are the expected data structure invariant.
So I will show you some demo and you will see some pictures.
>>: Absence of memory -- more than safety ->> Hongseok Yang: Well.
>>: Okay.
>> Hongseok Yang: Yes. So then we gave up. And then my friend Oukseh Lee joined like this
year because he had a sabbatical and we approached the problem again. And we were able to
actually prove the entire deadline IO scheduler. Modular -- we do some modeling so there are
hacks achieving what we have done, with that cheating we couldn't do it because I thought it was
quite hard problem. But anyhow that's what I've done, and I will show you the demo about ->>: What's your -- how do you model the operational semantics of the program?
>> Hongseok Yang: So the operational -- you mean operational semantics of C?
>>: Yeah.
>> Hongseok Yang: We had a particular operation -- we actually, you can see our operational
semantics is encoded inside implementations. So then the memory model, we used almost
memory -- we used a memory model inside in CIL. So that's in a sense not as precise as like real
C semantics, but so what we're aiming for is condition of -- conditional correctness, assuming this
memory model doesn't really generate any new issues, then we have correctness.
>>: Are you going to talk about what is the memory model?
>> Hongseok Yang: Not really.
>>: Okay. How big is the IO scheduler?
>> Hongseok Yang: I will show you now. So this is IO scheduler -- I mean, it has a bunch of
in-line code and some my creation routine. So it is quite -- it's not very appreciable. But so this
guy -- this is reasonably big. 2000 -- it's about 4,000 lines of code. Of course, there's lots of
comments and so on, but I thought it's realistic. And then this is like main -- my main routines.
What it does it creates a data structure. The data structure can fail -- creation can fail, and then it
returns. Otherwise there's a really big nondeterministic Y loop, keep exercise deadline IO
scheduler. Deadline IO scheduler is like a library. What it does is generate typical data structure
and big Y loop keeps exercising all these library routines.
So I commented the main part because it takes about 400 seconds. So I will just show you -- the
analysis of the result after the creations so you can see how data structure look like. So ->>: You do do in-line procedures?
>> Hongseok Yang: No, we implemented RHS, yes. So now analysis is finished. So it
generates a bunch of pictures. But I mean -- okay. So these are all global variables used inside
inline IO scheduler.
What this picture shows is that how actually this picture shows how our analyzer actually works.
Our analysis tool tried to analyze the least part and three-part as separately as possible. This
picture only shows what we're going to get from, on the analysis part about the list.
So this one say they're double link list which have some header. These are some errors from the
drawing tool. And there's another double linked list. So remember the deadline IO scheduler,
there are two doubly linked list one for queue 1 and the second is for queue 2. These two mirror
the two data structures which you saw in the slides.
Now I have shown another one. So this is about tree. Again, there are lots of global variables.
And the important thing is here so there's a tree which is for the first data structure and there are
some true predicate which means essentially as far as the second data structure, which is queue
2 is concerned, we don't maintain any tree structures. So this true predicate covers the queue 2
from the tree's perspective. So this picture actually shows how analyzer, our analysis tool
actually works. Tried to analyze this part and tree part as separately as possible. So the idea is
because these two data structures are weakly correlated, I mean, really encourage us to have
very weakly correlated on two analysis tool. But they have to talk to each other in a little bit. So
that's the rest of the talk what our focus is how that weak correlation actually happens. Before
that, I'll just run the tool. So it will take about 400 seconds. And then I will just continue my talk.
Hopefully at the end of the talk it finishes. So design principle is the idea is to analyze a tree and
list as separately as possible and what we think is by doing so we can get some gain in
performance. But this separation cannot be done in a complete manner.
So they have to talk to one another a little bit. And in terms of techniques, I found a quite
interesting technique, we found if we use gross variable and during the analysis we discover
instruction on the gross variable on the fly, then we found that there's one good way to make
those two tools to talk to each other. The traditionally, if you're familiar with program analysis,
people used so-called reader's product or some other mechanism which is statistically
determined. But what we found is using gross variable is another good way to make the
communication in a very cheap, effective manner.
So now I'll explain what's going on and to do that you need to understand a little bit about
separation logic, but it's okay -- it's not very hard. You can say understand the separation logic
performance means in terms of pictures.
The first formula which says X means points to prac Y, next Z. You see at the top here. This one
says I have a field X, previous field bit Y, and next field is Z and it might contain some other field.
So we don't really care about some other fields. For instance, in this case it has a field F. We
don't really say what the value of F is. The second formula is nonempty, it's a double linked list.
So you say I have a double linked list starting with Y ending with Z. I have a double linked list
starting with Y and ending with Z. And the previous value of Y is equal to A. Like you see here.
Next value of Z is equal to B just like B you see on the top.
And then this NE means nonempty double linked list predicate. And we also introduce some tree
segment predicate because the deadline IO scheduler actually used trees. So this predicate say
I have a tree rooted at X, but it's a segment so there's some edges which is going out from the
tree to the outside. So this Y is talking about the pointer from this kind of last cell in the tree
where from where this pointer goes from the tree to the outside. This tree is a target of the
pointer. This is tree X but there's somehow called Y. From Y there's a pointer to the outside,
which is V. And then A is like parents' value of cheap predicate X, yes.
>>: And you only have a four area version of this tree; you don't have exceptions for B and C and
D?
>> Hongseok Yang: No, that's right T we don't have multiple trees with multiple holes.
>>: Because the programs only violate the invariants in one point at a time.
>> Hongseok Yang: That's right. So programs like say I want to verify Bradford search, this kind
of predicate is not good enough. What happens is analyzer will diverge or start to be stopped by
me because this predicate is not good enough to summarize some regularity which is going on in
the first traversal. So this is very targeted for certain tree operations where one hole is good
enough.
>>: But it does an obvious generalization, which is number of fixes.
>> Hongseok Yang: Yes. That's right. So if we do it -- so manual verification is not really a
problem but if I do the automatic verification that means I have to design abstract domain for
strings. So it's because holes can increase as many times. So one direction which we haven't
pursued. It's already there are some other works that needs to be done. So now interesting part
of separation logic is separating conjunction perhaps you heard of our separation so many times
now you understand. The separation conjunction said I can split him into two parts. Left-hand
side of the -- one part satisfied the left-hand side of the formula. The other part satisfied the
right-hand side of the formula. In the picture you see on the bottom actually satisfy the formula
here, because I can split heap here, and then X satisfies the formula X points to 0 Y because its
previous value is 0. Next value is Y. And the other side satisfied this list segment predicate
because I have a double linked list segment Y to Z which is not empty and previous and next
fields are set correctly.
Separation logic is just normal logic. So it has existential quantification universal as well as
conjunction and disjunction. Now one idea we have was to use only special kind of separation
logic formula, which is good enough to capture this weak correlation properties that I just told you
in deadline IO scheduler.
So don't worry about this gamma, beta, alpha at the moment. And what this formula -- this
represents the kind of typical formula which is used during the analysis. This formula is a
conjunction on the top and has a one part for the conjunct and the other part for conjunct, second
part.
The first part talks about tree related properties only, and second part talk about list-related
properties. Now if I do it, we cannot really express correlation between these tree and list data
structures. I mean, that's necessary to prove the memory safety of the program.
>>: What is the correlation fact you want to express?
>> Hongseok Yang: That tree and the first queue, tree and lists are talking about exactly the
same set of heap sets. So what we did was we actually -- I mean augmented or extended the
separation logic formula with, you can say, second order variables. Second order variable like
gamma beta alpha represents set of locations. But using this we can say the first guy are talking
about exactly the same set of locations.
The second guy talk about exactly the same set of location, the third one is the same.
And the formula also have primed variable which I implicitly existentially quantified. So E prime -I'm not going to write existential quantification, but if you see a prime variable just understand it
as existential quantifiable variables. But notice that alpha beta gamma are not existentially
quantified, they are gross variable in the programming will be manipulated by the mean
instructions in the program.
>>: Why are they appearing in the subscript and so on?
>> Hongseok Yang: This is just syntax. So we extended the separation logic formula so that
every predicate comes with this alpha beta gamma. And then what we wanted was this alpha
beta gamma, we don't have any like extra constraint on alpha beta gamma. We just use alpha
beta gamma, that's the only uses of alpha beta gamma.
>>: So if you took them away would they still work?
>> Hongseok Yang: If we took them away, if I run the analysis, analysis doesn't keep enough
track of information so we cannot really prove memory safety, the particular routine which is move
routine.
>>: So alpha beta gamma are program variables, right? You are going to add some goals and
you're going to update them as you go along in the program, right?
>> Hongseok Yang: That's right. That's right.
>>: Now this formula is referring to those variables also?
>> Hongseok Yang: That's right. That's right. Alpha beta gammas are kind of instructions which
manipulate gamma RB insert are going to be inserted by the analyzer, so that's a way to make
tree ->>: Can I think of beta as sort of representing a set of objects?
>> Hongseok Yang: Yes, set of addresses.
>>: One set of objects.
>> Hongseok Yang: Set of addresses.
>>: Keep adding, moving stuff from them.
>> Hongseok Yang: Yes. So then this set of addresses means that all the heap cell involved in
this tree is exactly the same as the set, I mean collect them and make a set the same as the set
beta.
>>: Sounds like you have three heaps, alpha beta gamma.
>> Hongseok Yang: Yes, that's right. Now I'll show you an example and it will become clearer.
This first part talk about only tree. So now you see that there are trees -- like cell queue 1 also
tree, if I erase all the link what you see on the right-hand side is just a bunch of cells which is
summarized by true predicates. The elected one is a list, and you just describe what it looks -how he looked like if you focus on the list, I mean the list part only.
Now using this alpha beta gamma, we can now talk about the correlation between these two
conjuncts.
>>: Is there intuition here? Do you have a data structure? Do you have getting the data structure
by like where the alpha beta coming from?
>> Hongseok Yang: Alpha beta gamma, you're right, we design so alpha beta gamma captures
the group of structures like alpha is for captures all the data structure involved in one object.
Essentially one high level object like trees and list. So during the analysis this predicate can be
decomposed into multiple components, but still we want to use if nothing really changes, then we
want to put group all the things into someone same second order variable like beta. So I thought
provides one more hierarchy which says which kind of heap cells should be considered as a
single unit. So then beta is for the trees and lists and gamma is for the linked list, what you see
on the right-hand side.
As I said this expresses loose correlations, now because the ordering of this double linked list just
satisfy the data structure. So what it really captures is, I mean, this formula captures the list and
trees are talking about exactly the same set of heap cells. But, on the other hand, if I add some
nodes like tree but without updating the link field it doesn't fill satisfy it because predicate says
this tree and this list should talk about exactly the same heap list.
>>: I want to follow up on the question about just representing different subheaps. So the formula
you've got written there has the separated conjunction on the inside. And the regular conjunction
on the outside.
>> Hongseok Yang: Yes, yes.
>>: If you did it the other way around ->> Hongseok Yang: In some sense what you're saying is exactly right. Original motivation was
suppose we nest separating conjunction and normal conjunction in arbitrary manner then many
these properties can be expressed. So exactly as you said put conjunction here using separating
conjunction outside. That could be used to express it.
But now the problem becomes how can you design the rest of the analysis, which is transfer
functions and abstraction mechanisms, joint operations.
So it was kind of -- if you have arbitrary complicated formula we gain something on the
expressiveness. But on the other hand we pay the price about designing the right abstract -- I
mean the abstract transfer functions.
So this particular representation is designed so that many of the existing transfer functions can be
reused without changing very much. Okay. So, right, the abstract domain is really a disjunction
of these guys. So we have disjunction of tree predicate and conjunction outside and disjunction
of another, this predicate.
So in a sense what we are doing is a creation product so two abstract domain or two power set
abstract domain. The first power set talks about tree related to property using disjunctions.
Second power set talks about list-related property using another disjunctions and these two
power sets, they usually, they talk to each other via this gross variable beta gamma and alpha.
Okay. So that was the -- kind of the basic form of the analysis. We set our abstract domain in
this way. Now the main question becomes, if you just run analyzer one and list and tree
separately, then there's nothing to say about this talk but it's not good enough. So we have to
make these two components to talk to each other without paying too much price. So I will show
actually how that works by going through this particular routine, which is a move routine.
And ideally we like to have list analysis in the one goal and tree analysis in another goal. But the
problem is when you do the tree analysis at this point, we have no idea -- we have no information
about where IQ is. So we cannot really prove the memory safety. So what we do is our analyzer
goes through two phase. The first phase is runs something from preanalysis. You can say -- we
can understand there's a meta analysis, which try to figure out what's going to happen when I run
tree analyzer and list analyzer separately. Then the meta analysis says actually this tree
analyzer, and this analyzer has to talk to each other at certain points. And it inserts instructions
which says at this point we have -- these two guys have to communicate. At this point we have to
maintain this [inaudible] in a certain way, and then it runs, after insertion of some of the Gaussian
instruction by main analysis you run the main analysis. And the main analysis can also add other
instructions.
So in this case if I run this analysis with predicate that I showed you in the previous slides, the
analyzer also inserts instruction of this form which says we have to unify certain gross variables.
So it is like we try to put these gross valuables, my instructions about gross variables, to the
preanalysis as much as possible, because that makes our analyzer more predictable.
But there are some cases where it's not possible. And this unification actually happens because
that's where to insert unification can be only available. That information is only available during
the real analysis.
>>: That's a union or unification?
>> Hongseok Yang: So I mean semantically take all the variable gamma and variable delta and
take a union of them and then put it to gamma and set delta to 0. It's a set union. And set delta
to 0. Delta to the empty set.
So then in one way to implement that is by unification. And these are the instructions, which is
added, and they are the ones which make these two analyzers to talk to each other.
>>: So how many [inaudible]?
>> Hongseok Yang: I will tell you later. So suppose it was added. I mean, the last one, yeah, I
will show you how it works but the other two I will tell you later. Suppose it works. Then let's see
how the analyzer actually make two lists and tree analysis to talk to one another.
So now this is a same move routine. And it first finds some requests from the queue 1. So like
this. And then it deletes the request from the double link list. So now think about -- I will show
you what's going to happen after -- I mean what analyzer we produce after the first two
instructions. Analyzer will replace the list -- the second star conjunct in the list part by what you
see on the bottom.
What they say is I have a list, queue, queue 1 then this guy says I have a double linked segment
from C, C prime and previous and next, and all point to queue 1. So this conjunct describes the
double link list on what you see -- I mean consisting of the three nodes. And then the points of
predicate on the right-hand side express the request node which is taken out from the double link
list.
Now the next instruction is move informations, but what it says is move information from the
second list part of the analysis to the first, which is a three-part of the analysis. And especially we
only move the information with respect to this RQ. And this -- so what it does is analyzer first
figures out RQ is in the region beta, because it's annotated with beta, which means the set area
beta contains RQ and moves that information to the tree part. So it says, okay, RQ should be
somewhere in the beta partitions.
So then we can replace this tree predicate by -- well, somehow instantiate the tree predicate such
that the RQ factor is explicit inside a predicate. So this -- I didn't really show everything, but say I
have a tree which is kind of everything which is RQ nodes. Then we have RQ node explicitly,
and then there are something which comes below the RQ node. If you see they have a tree, RQ
is somewhere. There's something which is before the RQ. There's something after the RQ. And
this predicate exactly describes that fact exactly.
>>: So is the beta the upper bound on a set of elements?
>> Hongseok Yang: Very good point. I think beta should, then you find the star, the notion of a
star should also split the definition of beta. So beta -- so notion of a star says I can split it into two
part, same way beta should split this definition of beta into two disjoint subset so beta here and
beta here talks about these two sets of locations. So be a bit more sophisticated but I didn't quite
explain everything.
So now if I delete tree, then this node is gone from the tree, and that operation can be run
symbolically with respect to this description of our tree that we have. So that will replace ->>: What you just explained, aren't the semantics of separation logic, right?
>> Hongseok Yang: That's right. It's different. For semantics of separation logic what I think is
separation logic is essentially whenever you have a partial committed [inaudible], you can always
define the notion of a star based on a partial [inaudible]. What I'm exploiting here I just extended
partial [inaudible] with something extra which is a value of beta alpha and gamma. Once I have
this PCM structure, I can always extend the logical formula for those predicates.
So in some sense ->>: What you're saying is that there's no one separation logic?
>> Hongseok Yang: Exactly, yes. So I think separation logic -- personally my take about
separation logic is not like first of all logic it's more of a semantic principle which can be exposed
in a certain way. So one -- whenever there's a notion of a separation, I mean expressed by the
partial commutative [inaudible] structure then I can always come up with a separation logic very
easily.
>>: The syntax to express formulas is not a big deal the way you derive all your intuition is based
on all the models?
>> Hongseok Yang: That's right. Exactly. The syntax is kind of semi-automatic. Once you have
it there are a good recipe to produce syntax.
>>: Then how do you reason about these formulas?
>> Hongseok Yang: Then these are all instances of something called logical punched
implications. So that proposes carry-over. So I can reuse [inaudible].
>>: So whenever the semantic models are based on these partially committed [inaudible] the
proof rules are this logic of bunch implications will do the job?
>> Hongseok Yang: Yes. It's not complete. But it gives just basic implications, then for specific
cases we have to come up with the other implication fact. But my experience is, because these
formulas are so simple and the variant is very little, that kind of implication we have to consider is
very -- not very much. So for particular implementation we are working on right now we don't
really do anything special for I mean this beta. Most of the existing theorem proving question,
theorem prover, theorem proving techniques can be carried over into these extensions and then
we don't do anything special for alpha and beta.
Okay. So let's move on. Now, the next one is what we call a transfer. Now what it does is it
transfer this RQ from partition beta to partition delta. The reason it does it is if you understand
what's happening, what's going to happen in the future is that you can somehow see what's -why we do this, because we want to move this RQ from partition B tree to the list so we do
ownership transfer of this sell RQ from the tree to the list.
So this -- I mean intermediate step, we are signing delta to -- moving this RQ from beta to delta
serves as a kind of preparation step for this ownership transfer.
So initially we have beta and then we split beta into like this and then we just name this RQ as a
delta. So now at the end we have this list insert operations. So this list insert operations, I mean,
works like this. So it adds this node RQ to the second link list queue two. And symbolically what
it does is that it -- after that list insert operation we have this representation, which says I have a
double linked list starting from R queue two and go to E prime and then it points to RQ from RQ
points to queue two again so I have a cyclic double linked list. It's a symbolic way to represent
what's going to happen in the actual operation.
So now when the analyzer see of this form, it figures out actually I can abstract this, I mean, this
is essentially cycle link list. I can express this list with one predicate which is list any predicates.
But then we realize that actually we cannot do it because there are -- the partition delta and
gamma are different or you can of course do it but if I do it it has to make sure that all this tree
and list analyzers have the consistent view about this gamma and delta.
So what you do is its first -- I mean, it's kind of unify this delta region and gamma region. Like
this. And at the end of this unification, kind of merge, we have a one region called gamma which
includes the post request and this list segment predicates.
And then there we can apply the abstractions, and then abstraction make -- summarize the entire
predicate, single list predicate.
Now the effect -- this instruction will also be run for the tree part. So then this will replace -- I'm
sorry. The text from delta to gamma. And it's going to be merged with a true. So that's the result
at the end what we get from the analysis of this routine.
So this is the result that we get at the end of the analysis. This says that we have a queue one
which is a tree and we also have a queue one and queue two and the tree and lists are talking
about exactly the same set of heap sets.
>>: So I have a high level meta level question about your approach
>> Hongseok Yang: Yes.
>>: So when you set out trying to do this verification, you spend some time making, designing the
logic itself, right?
>> Hongseok Yang: Yes.
>>: Then you spend some more time designing the abstract interpretation business, right, the
transfer function and, what is that, join and wide and all that stuff, right?
>> Hongseok Yang: Yes. Yes.
>>: Now, how much would you say -- how much time would you spend in the first part, how much
time would you spend in the second part?
>> Hongseok Yang: Actually, the real story is, I mean at the moment for this particular project we
think about the problem and then we do just -- after having some agreement about -- I don't
know. Sorry. The real story is spend most of the time in the implementations.
So designing once we fix the abstract domain like the one that I showed in the first few slides,
then the other operations that -- the other operations are just -- we can think about it and see how
it should be implemented.
So the story is we ask -- okay, we need this much operations and then we just implemented them
first. Then the theory just came up later. So I know this is not a good answer. But I cannot really
measure which -- how much I really spend for designing these abstract -- kind of logic itself was
really easy. So that part was not really a big deal. It's more like abstract designing for algorithm
for abstract interpretation.
>>: What I don't understand, when you designed the logic, right, this program is, what, maybe a
few thousand lines, right?
>> Hongseok Yang: Yes.
>>: Not a super large program. You could have then decided to write all the loop invariants, if
you have a theorem prover, incomplete theorem prover for these kinds of theorems based on the
proof rules [inaudible]
>> Hongseok Yang: Like I actually did hand proof before, during my Ph.D.. if you ask me do you
want to do hand proof or do you want to do like designs such an analyzer I prefer the second
option because hand proof is still extremely painful.
>>: But it's not a hand proof, right? Because you only supply the loop invariants. Everything else
would be taken care of by the prover, right?
>> Hongseok Yang: No, there are like many of the procedures ->>: Huh
>>: During the Ph.D. the prover did not exist.
>> Hongseok Yang: It's a bit more complicated than that. Because now I have to give pre and
post conditions for each procedures. And there is -- giving up pre, post condition for each
procedure requires [inaudible] point, [inaudible], which means you actually have to add logical
variables.
If you don't add logical variables, you cannot really relate certain pre -- relationship and pre and
post. And identify how many logical variables I should add is much more complicated.
So personally if somebody like say my boss asked me to do manual proof with the infrastructure I
have and if we give yet another option say, okay, develop a verifier. I think developing a verifier
for me is a lot easier to do that.
>>: I still don't understand. No matter what you do, at the end of the day, even if you use
program analysis and abstract computation it's computing at the end of the day and inductive
proof, right?
>> Hongseok Yang: Uh-huh.
>>: So how do you check -- must be some stack that checks that proof is valid, right?
>> Hongseok Yang: Yes, yes.
>>: Who is doing that checking?
>> Hongseok Yang: I mean, the abstract interpreters use sound implications, right? Whenever
you do the over approximations, it always produce -- suppose all the steps over abstract
interpreters are correct, then it will always produce an overapproximation of each step or you
need to check syntactically I reach the fixed point or not.
Abstract interpreter in a sense put much less burden to the theorem prover because many of the
good reasonings is interpreted into the development of these transfer functions and any check by
which the fixed point can actually be reduced to syntactic checking.
Okay. So personally, I mean by developing these tools I personally find that interacting with
some kind of -- this one, instead of providing the actual proof of loop invariants, providing some of
these hints were helping a little bit that's much more enjoyable and much easier to do that.
But just my personal opinion. And another thing is there are some other class of program which I
believe follow the same patterns than equip the technology that could be applied there.
All right. So now I sum up some abstract operators. Abstract domain you can say a power
domain of tree and list. And these domains have three operations, which is moving information
from tree to the list to the tree with respect to cell X. Transfer the partition of cell X to the partition
alpha. Then we also have some union operation or some, can say unification operations. These
are three operations I use to make these analyses talk to one another.
Now, then I just want to point out one thing. Then this operation which you see on the bottom is
mainly discovered based on where we should apply abstractions. So abstraction is, like
summarizing data structure into single list segment predicate. That was a driving force to -- for
inferring where we should put the third instruction.
But the other two, and actually we use preanalysis to figure out where we should insert the other
two, and I will show you how it works there.
So to figure out the other two we in essence do meta analysis. We do analysis, trying to figure
out, in terms of data flow analysis but data analysis to figure out if I run tree analysis what's going
to happen.
So for transfer functions what we do is if the meta analysis figures out. If we want tree analysis
and list analysis, then we know that it's RQ will be expressed as a point -- I mean, that inside
inductive definitions but as a point to predicate explicitly. That's where we insert RQ, transfer
instructions.
Wherever we see RQ points to blah, as in the instructions, then we temporarily put it into the
region delta that might go back to the original partitions or move to the next partitions.
This is what's being done by the analyzer, and the second one is now where we should insert this
moving for instruction.
Now, we do it to find it out we do forward analysis and backward analysis. The forward analysis
figures out that whether tree knows about RQ, at least knows about RQ. So then this four
analysis figures at this program point it's likely that tree analyzer will know about where RQ is
actually allocated.
But it's like [inaudible] this analysis cannot really conclude -- no, the list analyzer knows where
RQs are located but it doesn't really know. Tree will know about RQ or not. On the other hand,
it's backward analysis say that at this point, to proceed, this tree part actually has to know about
this RQ.
So we do this forward analysis to figure out which part of the analysis knows about the cell RQ
and which part of the analysis which needs information about the RQ and then that's the point
where we actually insert this moving for information, moving for instruction.
For this deadline IO scheduler, this is formulated using some data flow fixed point -- meta
analysis. And insert four or five instructions, moving point instructions, into the program.
Okay. So this is the end of my talk, and yes?
>>: I had one more question. Meta level question. When you designed the abstract domain for a
program, in that process do you feel like you got some more insight into why the program is
correct?
>> Hongseok Yang: Yes, that's actually true, yes. So for this deadline IO scheduler, I really
didn't know the weak correlation before. So, yes.
>>: I see. But I thought the weak correlation, that insight came when you were trying to design
the logic not when you were trying to design the transfer abstract domain and the transfer
function and so on.
>> Hongseok Yang: I don't really make the distinction. Because for me designing of logic is so
easy. So I mean -- because it's not -- so designing logic really means coming out [inaudible] so
once I -- and identifying which partial [inaudible] are necessary is closely tied to the -- I mean how
analyzer is going to work.
So it's like all the rest of the syntax is because I'm so used to it, just follow very automatically.
So the first take-home message is there aren't many of the data overlay structure. And I expect
maybe not all of them but quite a few of them uses this correlation only. So when you are kind of
faced exactly the same problem, I mean, think about whether this guy only have this weak
correlation or not. If you are happy with using separation logic, try to use this conjunctions and
start and second order variable. The second take home message, which I quite liked it, because
I mean suppose your boss asks you to design combine very expensive analysis, disjunctive
analysis are very expensive. Want to put two disjunctive analysis together. That's actually what
we are doing, one disjunctive analysis for tree, another disjunctive analysis for list.
If it expands, if we implement so-called reduction operator which is by expanding all these
conjuncts, so five disjunct, ten disjunct, have disjunct of them you can generate 50 cases. Right?
If you do those 50 cases and do some operations on them, very likely it's going to perform very,
very poorly. Because it grows exponential, not exponential, but I found it quite challenging.
On the other hand, another challenge is to treat this case splitting implicit manual. One way to do
it is using gross variable, adding gross variable, and the make analysis to insert these gross
instructions so that these two analyzer have a consistent view about gross valuable.
So that's the end of my talk. And I will just go back to the tool and say it's verified. And nothing
fantastic like fancy but the tool was able to verify that the code is okay.
Okay. Thank you very much.
[applause].
>>: So can you rephrase the use of ghost variables and the transfer functions as a reduced
product?
>> Hongseok Yang: Yes, but usually reduced product is explained in terms of -- once I fix state
space, meaning once I fixed a set of variables and then that's usually the way to explain reduced
product.
But if I exaggerate a little bit, it's like once you have existential quantification you exploit the
existential quantification to talk a lot more. If you're allowed to extend the state space ->>: So you have -- so there's a good sub [inaudible]
>> Hongseok Yang: Yes.
>>: Basically [inaudible]
>> Hongseok Yang: Yes.
>>: But the presentation implied that you use, you insert functions in the transfer functions by
[inaudible] space
>> Hongseok Yang: Right. P. But then also on the fly, if it's necessary we can insert the ghost
instructions. You're right, it's kind of principle is that I want to put analysis together, one
possibility extend the state space by adding gross variables, and then during the analysis figure
out how to update these gross variable and then insert ghost assignment will play the role of
communications between these two analyzers.
>>: Does it help [inaudible] or does it somehow -- you look close at the two [inaudible]
>> Hongseok Yang: So the preanalysis, mostly try to get things in terms of reachability. It's
heuristic. But instead of a tree, you say okay if I have some root X, I want to know all the things
that are individual from XY or this field of the tree.
So the whole analyzer as I described is parameterized for a set of fields. Initial configuration
parts says these fields should be considered as one conjunct. These set of fields will be tracked
by another conjunct. And based on that set of fields it uses which kind of reachability star
reasoning to figure out which -- probably core B knows what's going on there.
So now tie it to the trees.
>>: That's kind of, I guess, maybe anybody might feel free to answer this, because I have little
experience in that area. But if you're going to pose this problem to me, my first reaction would
actually have been to do dynamic analysis where I start exploring, like I just start creating models
like essentially puzzle potato structure. It seemed like I had a small state space to search for. I
can look for new shapes, but expected other space but ->>: The answer to your question here the approach is nonnegotiable. You have decided on what
the approach is. You can't change it.
>>: Yeah, I think that's great, the extent of the approach in your balance but I'm kind of curious?
>> Hongseok Yang: One possibility, I think dynamic analysis can play a big role in shape
analysis as well. So perhaps how can I figure out what data structures are actually used. And
then suppose I have -- see for this kind of quote I don't have a good dynamic profiler, but
suppose I have a good dynamic profiler then I can gather some of the heap shades, certain
abstractions may be applicable here. Certain abstractions is not applicable because applying
abstraction at the end I got exactly the same heap then this is useless. So actually the [inaudible]
they do similar case studies for Java. What they said is for certain kind of client, particular
applications, they tested whether -- they collected heap from dynamic analysis and applied the
abstractions, whether they checked whether certain abstraction is good for that or not.
So I'm sure there's lots of room for dynamic analysis. I just don't really know what to do at the
moment.
>>: What I was looking for was kind of this is for a 4,000 line program. I was hoping you'd say
something along the lines okay we start doing this from a Windows kernel which is not 4,000
lines, now this dynamic analysis will get lost. Stack analysis, those wars, like that
>> Hongseok Yang: Another possibility, I prepared it so now I have a main routine. So I can run
this entire code and collect some data. It's not fully dynamic. I've already prepared -- because
this is a library. I prepared the mains. And I run the routine and I guess some features about
okay what kind of correlation is important and so on, then that could be used.
But it's a kind of direction which I haven't explored.
>>: So this one takes -- [inaudible]
>> Hongseok Yang: Yes.
>>: So the invariants that the abstracter up there is computing, it's very, very large?
>> Hongseok Yang: Yes, in some cases the preconditions, yes but 3,000 preconditions. So
those are rotate routines and the preconditions are 3,000 -- that's of course because analyzer
makes some rough guess. And I can actually show some statistics here.
>>: This is pretty amazing.
>>: So do you prove that just the driver of this particular, that particular [inaudible] end is safe or
is this packaged up [inaudible].
>> Hongseok Yang: Yes. I mean, I'm sorry, what did you ask?
>>: At the end of your C program you uncommented this wild wood [inaudible] verified that the
wild wood
>> Hongseok Yang: From the wild loop we can reach everywhere. I'm hoping it reaches almost
everywhere in the source code because these are like library. So usually exports some method
and it just exercises all these methods. And ->>: [inaudible]
>>: Well, he has property that's maintained at every step of my code. So ->> Hongseok Yang: At this program point the invariant is it's not necessary what you see on the
slides. It could be anything. So see some routine has 6-100 preconditions and some like Web
electry insert 150 -- well, that really -- okay. So I think this is not very useful.
Some cases it has about 400 preconditions and so on.
So it's just doing it by hand could be quite challenging. I mean, of course humans can optimize,
but still quite a challenge.
And I think actually analyzing trees is much more difficult than analyzing link lists. So you can
look at maybe it's the same in Windows but Linux trees are much, much more challenging. Okay.
Thank you very much.
Download