>> Jonathan Protzenko: Hi everyone. We have the pleasure of having Francois Pottier visiting us today from INRIA. François interests include ML type inference among other things and also fergen [phonetic] proof, and this is the topic of the talk today. Is going to tell us about a verified implementation of a Union-find algorithm both in terms of functional correctness and algorithm complexity. >> Francois Pottier: Thank you. And thanks for inviting me to visit. I am very happy to be here. This is joint work with Arthur Chargueraud who is also a researcher in INRIA. As Jonathan just said this is a kind of case study in proving a small piece of CAML code both in terms of correctness and in terms of time complexity. We picked Union-find as an example that is interesting to us because it's very useful and it has many applications in compilers and theorem provers because it's at the heart of first-order unification, in particular. And it's also interesting because it has a very tricky complexity analysis which has been simplified over the years beginning with Tarjan's Analysis in '75 and so we wanted to see if we could port this analysis to a machine checked, make a machine checked version of it. I will begin by recalling briefly what Union-find is, how it works. It's a data structure for maintaining equivalence classes under unions. I'm guessing most of you have seen it before. It's also called a disjoint set forest where every element either is a root like this or has a pointer to some other element. So it's got a link, this one for instance, has got a link to a root. And each of these trees forms and equivalence class whose representative element is the root. The OCAML API which you can offer for this algorithm is this where there is a type of elements and three basic operations to create a new element, find the representative element of an arbitrary element. So find we'll just follow the links all the way down to the root, and the union which takes typically two roots as argument or two arbitrary elements as argument. And if you give it two roots, for instance, it will install a link from one to the other. And there is also an operation not shown here for checking, testing whether to elements are in the same equivalence class. The code for this thing is very simple and that's one of the reasons we chose this example. It fits on a slide and I'm not going to read everything but the few things which are worth noticing our that every element is mutable. It's a reference to some content and the content is either a link to some other element or a pointer to some other element, or if we have a root, then we carry an integer rank which is used to do balancing when we union two roots the rank helps us decide which way the pointer should be installed, so the balancing is visible here where we compare the ranks. Another thing that's worth noticing is the find operation. It's written as a recursive function which just follows the pointers all the way to the root. And the interesting line is this thing here which installs shortcuts, so it's path compression, where we modify the forest as we walk it so as to make the next find operation more efficient. It's the combination of balancing and path compression which gives the very good efficiency where find and link and union have almost constant time complexity at least if you reason in terms of amortized complexity. In the worst case it's logarithmic. The complexity analysis was first done by Tarjan in 1975 and at the time the paper published by Tarjan was quite complicated, but it's been simplified over the years by Tarjan himself and by others including Kozen. And you can find a very simple proof today either in the textbook I have cited here or in some course notes which are available online by Tarjan. The paper proof is about three pages and it's very easy to follow at least if you are happy with checking that it's correct, because it's quite difficult to understand why it makes any sense and how they came up with it, but at least if you just follow the steps you can confirm that everything is correct fairly easily. The main result is that the complexity of union and find is O of alpha of n which means almost constant in practice because alpha is the inverse of this function A which grows very, very quickly. The things I want to talk about today are some details of the specification of Union-find and how the specification talks about, what it does and how much time it needs at the same time. We'll see. And also one point I want to emphasize is that we're not just making a proof of the algorithm based on some pseudocode, but we are actually looking at the OCAML code which I just chose and we proved that this particular implementation is correct. And this is modular in the sense that we establish a rather simple specification which our client could use if we had one. For instance, a unification algorithm could be proved correct based on this specification. I should say that if you want to ask me questions during the talk, you shouldn't hesitate. The way we do this is we use CFML which is a tool developed by Arthur. It's embedded inside Coq and it can be to reason about OCAML code. At the same time it's a library and a tool. The tool takes the OCAML code and turns it into a Coq formula, a Coq description of the behavior of the OCAML code. And then there is a bunch of Coq dilemmas and Coq tactics which you can use to actually reason step-by-step about the code. Essentially, it's a whole logic and a separation logic where you have preconditions and post-conditions and you prove that the code satisfies them. I should say briefly that there is a lot of related work on proving programs correct and I'm not going to mention all of it here. But much of it is mostly concerned with functional correctness and is not look at complexity at all. If you look at the previous work that does include complexity concerns, often they are concerned with only the mathematical level and not with the code itself. Also, often they are concerned with inferring the complexity completely automatically, whereas, here we are not trying to infer anything at all. We are just trying to do a manual proof. And this little corner of the literature where we fit has been relatively little explored, I think. Most people have tried to fit in one of the other niches but we are almost the only people I believe to have looked at manual proofs of complexity. Of course, it's manual so it's a bit painful but it can be very expressive, for example, proving this alpha of n complexity is something which no automated tool could hope to do because there is a proof which requires a lot of human ingenuity and there's no way you could do it automatically at least with today's tools. >>: Did you say the original program is translated into this formula? >> Francois Pottier: Yes. The OCAML code which I showed here is translated by the CFML tool into what Arthur calls a characteristic formula and that's a kind of description of the behavior of this code. >>: From his thesis? >> Francois Pottier: Yes, from his thesis. Then we establish the whole triple about this code so we pre-posed condition pair. >>: In the translation itself, that you just trust? >> Francois Pottier: Yes. You just trust that the tool is correct. This has been proven on paper, but of course, there could be a bug in the paper proof or a bug in the implementation of the tool, so you have to trust this part. There was another question. >>: [indiscernible] verifying [indiscernible] complexity. It was [indiscernible] Sanders, Danielson [phonetic] doing in [indiscernible] doing verified lazy data structures. >> Francois Pottier: Yes. That was his paper at POPL '08 by [indiscernible] Sanders, Danielson, and indeed, he would fit in the same niche as us, I think. He didn't try to automate, but he had quite a lot of expressiveness because he could use Agda as a theorem prover to do interesting proofs of complexity. Indeed, he focused on the use of [indiscernible] and lazy evaluation. First, I'll show the specification which we proved, so that's what we show to clients of the Union-find model. And then I'll say some words about the framework which we use, separation logic extended with time credits. And if there is time I'll say a bit more about the proof itself and the invariance which it uses internally. The specification, since what we use is essentially a whole logic, a separation logic, we equip every operation with a precondition and a postcondition and that's what we show to the clients. The way we write these things is maybe a tiny bit unusual, but that's because we have to fit them in Coq syntax, but this is the way we write the specification for find, for instance. This line here is the precondition and the last line is the postcondition where R stands for the results returned by find. What this says roughly, well, the precondition says you have to have a well formed Union-find data structure before you can call find, so this UF here is an abstract predicate which means there exists a well formed Union-find data structure and I am the unique owner of it. Since this is a separation logic we follow this usual discipline of separation logic which says that when you want to mutate a data structure you have to be the unique owner of it. That's what U F N D R means. The parameters N D R are explained down here on the slides. D is the set of all elements, so the set of their memory addresses, if you will. That's the domain of the Union-find data structure. N is the bound on the number of elements. I'll speak about it a little bit later. You have to fix it at the beginning. It's used only in the complexity analysis. It doesn't really exist at runtime or play any role at runtime. And R is the logical level function which maps every element to its representative element. It doesn't really exist at runtime either. >>: In the definition of the Union-find structure you usually one has content of alpha being something. >> Francois Pottier: Yes, that's a good remark. Indeed, we were a bit out of time when we rushed to publish a paper about this, so we had to leave out the alphas. But you're right that usually you would attach a datum of some type output to every root and then the type L M would become a type alpha L M N find would return maybe an alpha. Maybe you would have another operation besides find which would give you the datum attached with an equivalence class. It wouldn't be much more difficult, maybe one more day of work to adapt this. And we plan to do it. Here, indeed, because we don't have these data elements, what find it does is just returned the representative element and that's what this specification says here. It says this thing will return an element R such that R is the representative X and the Union-find data structure will still be well formed after that. You will still be the unique owner of it. And then the interesting aspect here is this dollar alpha of N plus two. So dollar is our notation for time credits so it's a permission to spend one unit of time, basically. And here what we are telling the client is you need to give us alpha of N plus two credits when you call find and that is the advertised cost, so to speak, so it's the amortized cost. As we'll see, maybe the real cost will be less than that or more than that, but that's our business. We take care of that internally. The client doesn't have to know and all they have to know is they have to give us this amount of credits. >>: Are you saving credits from one location to another? >> Francois Pottier: Yes. I'll say a little bit more about that later on. >>: The things on the right of the star are pure? >> Francois Pottier: These are not really pure because these credits are affine and so when you say dollar N credits it means that I own these credits. I am the unique owner of them and maybe I will consume them and then they will be lost forever. So the first star is really a star of two impure assertions. Indeed, this one between brackets so those square brackets are the syntax for pure assertions. So this equation is something that does not represent the ownership if anything. It's just an equation. >>: So you could have used just a normal conjunction? >> Francois Pottier: Yes. Here we could use a normal conjunction, but I don't think we have it. We never use it, I think. We just use star everywhere. The other operations have analogous specifications, so I don't think we need to read them in detail. But again, you will see a precondition and a postcondition and, again, we see the number of credits that must be passed in. This one happens to require three alpha of n plus six credits and that's a bit more concrete than we would like. Of course we would like to say it's big O of alpha of n and we have begun working to work with big O's but it's not quite ready yet, so at the moment I'm just showing these concrete specifications. Maybe one would also like to publish a spec which says something about amortized complexity and worst case complexity. Here the worst case complexity is O of log n and the spec doesn't say it. If we want it to say it we need to have maybe another kind of time credits which cannot be stored from one invocation to the next. So then we could say we require O of log n non-storable credits and that would be a way of saying that the worst case of cost is O of log n. This is the function make which creates a new point, so this one just has constant time complexity. And there is this precondition here, this condition which says the cardinality of the domain should never exceed n. This is kind of unpleasant, but it's imposed to us by the way the proof is written, the text proof textbook proof, which we follow. It has this parameter n which needs to be fixed at the very beginning because it plays a role in the definition of the potential function. This is where we see, this is where n is fixed and this is also where the UF predicate is created out of nothing. When you want to begin using the algorithm initially you have nothing, but you can say I'd like to begin working with empty Unionfind data structure and this does nothing at runtime, actually, because you don't need to do anything to initialize this structure. It's just a theorem that we have proven that says if you have nothing then you can turn that into an empty Union-find data structure with an empty domain. And at this point you have to choose n and it's going to be fixed for the remainder of the use of this data structure. This is not very, having to fix n like this is not very convenient and there is actually a recent paper which was published last year where they are able to void this thing and we hope to adapt their proof in the future. >>: [indiscernible] arbitrarily large with no bad consequences? >> Francois Pottier: You mean here, here or there… >>: Yeah here. >> Francois Pottier: Yet here you can choose any n you like and it's only used in the complexity analysis. It doesn't have any use at runtime, so you could choose… >>: [indiscernible] >> Francois Pottier: No, no. N isn't used at runtime at all. >>: [indiscernible] >> Francois Pottier: Yes, it's used in the complexity analysis, so, indeed, it is used, it appears in these preconditions, but it doesn't influence the actual runtime. It's only used in the analysis. And so you could use, I don't know, but something like the agent of the universe for n and it would give you alpha of n equals five and basically in practice you would be happy. >>: Can that be captured in the space? Saying n is a ghost? >> Francois Pottier: It is ghost, if you will, just like D and R, these things appear as only parameters to the abstract predicates, but they don't appear in the code itself, so that's the way we manipulate ghost state. >>: But, as you mentioned, somebody else did without the n? Could you have just incremented n, if it's the cardinality of the exceeded n? >> Francois Pottier: In the proof we follow, it's hard to change n somewhere in the middle because you have defined a potential function which depends on n. You have carefully arranged just right number of time credits everywhere on every element of your data structure and if suddenly you changed n then you may have to request new time credits and then put them in your data structure. So it's not easy to see how to do this, but actually, the paper I mentioned does this. It requires some subtle changes to the proof. Again, it doesn't look very hard and once these guys give you the right approach, but finding it is very difficult, I think. I'll say a few more now about the logic that we use. This slide is supposed to recall some definitions of separation logic. The assertions that we use our implicitly parameterized over the heap, so heap to prop means they are propositions which refer to the heap, or heaplet. It's typically a partial heap only the part of the heap that you own. I don't need for you to read this. These are the definitions of the usual connectives and, in particular, separating conjunction says the heap should be split above into compatible sub heaps and H1 should hold if one of them and H2 should hold if the other, so these are basically the definitions that you find in every paper on separation logic everywhere. And the basic CFML framework uses these definitions. What we want to do now is extended with time credits and somehow we have to explain what dollar n means. But we would like to have is dollar n as a new connective and a new kind of assertion and separation logic, so we would also like it to have type heap to prop. And the properties that we needed to satisfy our very few, only these two, basically, so zero credits should be the same as true as nothing. And we would like to have this equation here which says that if you have n plus n prime credits you can split them and use the two parts for different purposes. Maybe pass n credits to a function which you are calling and use the rest for yourself later on, or maybe store some of them in the heap for later use and keep the rest to use now or something like that. That's basically the only properties that we need. And we also need somehow to tie the meaning of these time credits to the actual computation mechanism. We need to say that whenever you perform one step of computation you need to consume what an credit. First I'll say how we define dollar n, how this predicate, it just says that there is a kind of, you can think of the heap as containing a special memory location where there is a certain number of credits that are stored. And to say that we are changing the type of heap. It's not just going to be a map of location to values, but we will add to it this special memory location which contains an integer that is the number of credits that are still in existence. In that case dollar n just means that this memory location, that this ghost memory location contains the value n at the moment. Again, this can be a partial count in the same way that the heap can be a partial heap which represents only the part you on. This n here typically represents only the number of credits that I own and maybe some other people own more and this is reflected in the definition of separating conjunction where you see we have a plus here. This basically means different creds can hold different numbers of credits. >>: Are there algorithms that would be useful for the rational number? >> Francois Pottier: Maybe it could be a rational number, or even a real number. At the moment we've been working with integers and this doesn't prevent you in some proofs from using rationales for reals to reason about how many credits you have, but we thought it would be simpler to make it an integer in this model. It could be real. For the moment we don't have, we haven't done many case studies yet, only two or three besides this one. For the moment working with integers has always been sufficient. To connect these time credits with actual computation steps, what we say is very simple. Every function call consumes one credit and if there are any loops in the program, we just view them as a recursive functions. We only need this rule, every function call costs one. And also we provide no way of creating new credits, so you have to think that at the beginning of the program's execution a certain number of credits are given by God, so to speak, and you can only consume them. You can never create new ones. Intuitively, it should be easy to see that the total number of function calls that are ever performed by your program is bounded by the initial number of credits that you provide, which intuitively, means that this initial number of credits is a good measure of the complexity. This particular inequality we proved on paper, but we have a precise statement which I will just skip, but we can prove it on paper in the same way that we prove the correctness of the logic, the CFML logic. >>: What you are counting is the number of calls to your API functions such as link. >> Francois Pottier: Every function, basically. >> Every single function? >> Francois Pottier: Yes. Including auxiliary functions that you may define inside your module. >>: What about all of this [indiscernible] function? >> Francois Pottier: You can consider it as a primitive operation and say it costs zero or you could consider it as a function and say it costs one. In the end it doesn't make any difference, of course. While it makes a difference in the concrete count of credits, but asymptotically it doesn't make any difference. Intuitively, one might think that it would be sufficient to say every recursive function call costs one credit, but since we are in a higher-order language we don't really know ahead of time which functions are going to be recursive, which functions are going to call each other, so it's simpler to just say every function call costs one. The way this is done in CFML is conceptually CFML with time credits will just insert a call to pay at the beginning of every function for instance, and then everything will proceed as in the old CFML without time credits. You only have to consider that this pseudo-function pay has dollar one as its precondition and that will force you when reasoning to pay one credit every time you come across this call to pay. That's how it's implemented, I think. >>: [indiscernible] you can't say if I have infinite [indiscernible] separations I will eventually advertise. You have to say in advance to pay for later? >> Francois Pottier: Yes. If you want to advertise you have to save in advance. I guess this means we do potential base proof of amortized complexity and some paper proofs don't fit in that framework because they want to consider the whole sequence of operations at once. For instance, I know of a paper that reasons about Union-find in a divide and conquer manner. It reasons about whole sequences of operations and this kind of proof we would not be able to port directly. Another thing which you may have been wondering about is is it okay to just count the function calls and in what sense is that a meaningful measure? There are two assumptions which we are making here. One of them is that if you have an OCAML function which doesn't contain any loops, then the compiler will translated to some machine code that doesn't contain any loops either, which means that the cost of invoking this function is basically a constant. It can, itself, of course do some other function calls, but those will be counted. There is basically here a constant factor which is the size of the largest function body in your code or something like that. The other assumption is that every machine instruction executes in constant time for some constant. Some people may throw their arms in the air and say this is crazy. Of course, some machine instructions take much longer than others, but at our level of abstraction we will just say it is a constant. If you accept these hypotheses, which we cannot really prove because that would require a detained model of the compiler end of the machine. If you accept them then you get that the number of credits which are spent are really up to a constant factor a good measure of the time that's required to execute the program. I mentioned this already in passing, but these time credits are ordinary separation logic assertions, so this means you can pass the ownership of a certain number of credits from a caller to a callee or even back then also you can store them in the heap. We will see that in the Union-find example where when we look at the definition of the UF predicate we'll see that it says several things. Among others, it says I own a certain number of time credits. That's the credits that I have saved ahead of time to be able to justify that the amortized cost of every operation is correct. This leads me to the last part of the talk where I'm going to say a few more words about how we define UF internally, in particular, what it looks like. So UF is a conjunction of four things. The definition looks like this and I'll explain in turn with these four things are. I don't think I have to go too deep into details. I just want to say roughly what this is about. The first part of the invariant is what you would find in a mathematical proof. If you were reasoning about the pseudocode and not the actual code, you would need to say things like the disjoints at forest is balanced in some way and the ranks which are stored on the nodes satisfy certain properties. That's the kind of thing you have to write here. This F which hasn't appeared yet is the graph. It's a relation between those and it tells you when there is an edge from one node to another. This particular line here, for instance, says if there is an edge from x to y, then the rank of x is strictly less than the rank of y. This says the rank grows strictly as you follow paths in the graph. That's one example of the property that you need to track and it needs to be part of the invariant. And so K is this function that maps every element to its rank and it's also ghost state in a way. If you recall the OCAML code, in the actual executable code we only keep track of the rank of every root. For the internal knows we don't need to keep track of the rank. Whereas, in the proof we need to keep track of the rank of every node and that's not a problem. It's permitted by this approach. The second part of the invariant is the ownership of the part of memory where the nodes exist. Since this is a separation logic we want to say that we are the unique owner of this part of memory. The CFML gives us the predicate called group ref, which represents a region or a group of references in the heap and it's parameterized by a finite map M which maps every memory location to its content. When the invariant we are going to have an insertion of this form group ref M. And then we need to connect the first two things that we just saw. On the one side we had the mathematical reasoning and on the other hand we had this group reference version and we need it to say what's the connection between them, so that's what this third part is about. We have to say that the domain of the map M which exists in memory is exactly D, the D that appears in the mathematical invariant. We also have to connect the content so we say that for every x in there M of x is a link if there is a link in the graph and M of x is a root if x is a root in the graph. And we also say that if it is a root, then the rank that exists at runtime is the rank that we keep track in the mathematical part of the proof. So this is the usual way of connecting what happens at runtime with what happens in your reasoning. It's a representation predicate. And the last part of the invariant is the potential. Here we follow the textbook proof about the amortized time complexity. It's the potential-based proof, so there is this potential function phi which tells us how many time credits we should have at hand at every point in time. So it's a function because it depends on many things, D, F, K, and so on. In Coq we will have to define this function phi and is part of the invariant we will have to say we hold phi time credits. To sum up, that's how UF is defined. It contains the mathematical invariants, the ownership of the references, the predicate that connects the first two aspects and the appropriate number of time credits. Once we have given this definition, then we only have to check that every operation satisfies its specifications. Typically every operation has UF as a precondition and UF as a postcondition, so we will have to prove that this invariant is maintained. The definition of phi is very tricky. I'm not going to read this in detail. The question here is how did they find this thing. I think it's quite hard to understand the historical process by which they came to this, but some intuition is given by this paper here where they have an approach which is not based on a potential function. It's based on looking at the whole sequence of operations and using a divide and conquer argument to find a recurrence equation that describes the overall cost. This paper is illuminating to some extent, but it's not directly amenable to our side of reasoning, so we have to follow the paper proof based on the potential function. This unfortunately doesn't seem to carry any interesting intuition at least for us. Maybe for Tarjan it does. >>: So there is no general way or someone would've found it by now of confirming a proof about a sequence of operations up to [indiscernible]? >>: [indiscernible] a constant number of credits at the beginning? >> Francois Pottier: The problem is we are in this whole logic approach where we have to give a pre and a post to every operation, and if you do this this forces you into a very local view where you reason about one operation ahead of time. I don't know if it's possible to remain in this framework and at the same time be able to reason about sequence of operations. Maybe it's possible. I'm not sure. That said, if you want to reason at the level of pseudocode instead of the actual executable code, then I guess you have more flexibility and you can reason about the sequence of operations. But there was a paper this year by Tobias Nitchov at ITP where he did some proofs of amortized complexity at the level of the pseudocode and I think he allowed himself to reason about sequences of operations. Coming back to these things, basically we translate them directly into Coq. It isn't very readable but it says the same thing, essentially. For convenience, we don't limit ourselves to constructive subset of Coq but we allow ourselves to use epsilon operator, for instance. It's used here because here Tarjan says let p F x be the parent of x if x has a parent and this seems innocuous but you cannot translate it into Coq without using the epsilon operator, or maybe yota [phonetic] operator or the unique choice operator because if x doesn't have a parent then this thing doesn't make sense, so you must allow yourself to say let p F x be this expression if it is defined. And if it's not defined then p F x is anything. It could be any node. >>: What if you just said p F x is defined as being the parent of x or like x itself? >> Francois Pottier: Yet, maybe you could give it the default value. Yes. But then you would need this if, which is another thing that we use, which is also nonconstructive. In Coq you cannot say if on a proposition. You can say if on a Boolean, which is not the same thing. If you want to say if x has a parent then this, else that. You need an if that takes a prop as an argument and this requires, it's essentially the middle or something like that. It's another feature which we use here. It's this capital IF which takes a prop and that's also not part of the constructive subset. And the last thing I'll say is you don't want to look at the proof in detail. It's hairy. But what's interesting is just the way that what you have to prove, which is this inequality, arises naturally just because you have to prove that the UF invariant is preserved. Initially, as part of UF, you get a certain amount of credits of phi. You also get the credits which are brought to you by the client. For instance, if we are verifying find, the precondition says the client should bring alpha of N plus two credits, so you get these credits here. And using all of that you are supposed to, well, you will have to consume some of them because you are making some function calls, so that's the actual cost. Then you will have to prove that you are able to reestablish phi, the invariant in the new state, in the modified state, so that's the phi prime term which is here. That's the inequality which is proven in the textbooks and here it just arises out of methodology which we follow. And that's about as much detail as I wanted to show about the proof, but I can show more details off-line if someone is interested. In conclusion, we are happy with the results because the proofs, the textbook proofs were relatively easy to port and we think it's nice to be able to write a whole triple [phonetic] which says something about correctness and complexity at the same time. And we think it's good to be able to reason about the actual code rather than some abstract pseudocode. The whole thing is relatively short. It's about 3000 non-comment lines of proof for the mathematical arguments, and then about 400 lines of code for the specs and the equivalent of the VC gen process. So producing the proof obligations which F star, for instance would do automatically or Daphne. In this case you have to do it step-by-step by using Coq tactics, but it's quite concise usually, so it doesn't take a lot of space. You can find the proofs online. In the future some things we would like to do, establish a more precise bound which doesn't force you to fix a capital N ahead of time. In practice, of course, alpha of little n and alpha of big N. They are both constants for any reasonable human use, but it's convenient not to have to fix capital N for the client. Also, we would like to introduce this big O notation. Nikil mentioned this. This could be, well it's one case study which will depart for larger project that just now is beginning, just about to begin. We have got some funding to try to develop a verified OCAML library containing some basic data structures and algorithms, typically collections of various kinds and graph algorithms and that to go kinds of algorithms which are useful when you implement compilers or improvers or things like that. So CFML would be one tool which we can use to develop this library. Other possible tools would be y3 or Coq itself or maybe F star, so that's something we are about to begin and that's it. Thank you. [applause] >>: How long did it take? >> Francois Pottier: How long did it take. It took me just about two weeks to do just the mathematical proof of complexity without any regard to the actual OCAML code. Then Arthur came and did the connection between the mathematical proof and the code and it took him maybe one week, maybe even less. I'm not sure exactly. >>: Arthur is known normal person. >> Francois Pottier: Of course, Arthur is the author of CFML and he is also a very smart guy. Someone who doesn't know CFML would need more time to get acquainted with it. At the moment, CFML is not very well documented, but as part of this project which I mentioned we intend to produce more documentation on how to use the tool. Yes? >>: In the case where you have the particular constants like three alpha n and so on [indiscernible] if you reach back to the code a little bit in a way that introduces more function calls it's going to change the constant. >> Francois Pottier: Yes. That's the trouble with concrete costs. It's not modular because some internal change in the code becomes visible to the outside because the concrete number of credits changes. That's the main motivation for introducing the big O notation. >>: That allows you to hide the internal… >> Francois Pottier: Yes. Once you say it's O of alpha of n then you are free to do as many function calls as you like internally as long as you preserve the asymptotic complexity. >>: [indiscernible] pass those functions to a higher-order function it's going to get called by the higher-order functions. >> Francois Pottier: Yeah. So when you write down a specification for a fold function, for instance, which does a loop, you could equip it with several different specifications which will be useful in different scenarios. For instance, the most precise specification would say the cost of code F is the sum of the costs of all individual invocations of F. That would be the most precise specification, but not very easy to use. You could also derive from that another specification which says if F is constant time, for instance, then the cost of fold F is just O of n where n is the number of elements and you could have other specifications in between like whatever the cost of F, if you have a bound on it, then the cost of the loop is n times this bound. So you can prove several specs for the same function and the client is then free to choose whichever one is most convenient. So that's the way we plan to approach working with higher-order functions. Yes? >>: I can see after seeing the talk while you are tempted to go to the O notation, but when you first introduced the 3 alpha plus 6 I thought that's going to be very useful. It's useful to see them if they were meaningful, if it were actually talking about x 86 instructions or something [indiscernible] or something. It would be useful to lift that to expose that if [indiscernible] >> Francois Pottier: If it were actually an estimate of the worst case execution time then maybe it would be useful. >>: But I just say you may not want to just throw away everything and make it O because ultimately, maybe you want to have a model machine. >> Francois Pottier: If we wanted to do that I guess we would need to try and have more automation in the process of gathering these little constants because at the moment it's tolerable to do it by hand because we seldom pay. We only pay at function calls, but if we had to pay for every single instruction, then we would have to compute the sums of all of these costs and we would have to automate it. Otherwise it would be intolerable. >>: You said that you have this division of labor where you spend a couple of weeks doing the mathematical proof. I'm wondering how that division of labor worked out because it seems like you need to reason about the structure of the code with the specific function calls that are being made to come up with these constants. How were you able to do it independently of the code, initially? >> Francois Pottier: It was fairly easy because abstractly you have an idea of what Union-find is, so let me go back to the slide where I showed this invariant. This is, you can describe the algorithm abstractly in terms of these things. The graph F which describes the edges in memory. It's just a relation. It has tied element to element to prop and similarly you have this function K which is just a function of every element to its rack and so on. So you can prove many properties of these things abstractly. I can define union and find at this level. I can say what is the effect of union and find on F, K, and R abstractly and then I can prove many lemmas about what these operations do and how much time, so to speak, they take. That's what I did without even looking at the code. You can do all of this math independently and then it's very easy to connect that with the actual OCAML code using CFML. Basically, CFML extracts proof obligations and every obligation is proven by one of the lemmas which you have done at the mathematical level ahead of time. >>: In the amount of time it takes at this level? >> Francois Pottier: Yes. You have to artificially produce something that allows you to reason about the time. For instance, to reason about the cost of find, I have defined inductive predicate which means we are doing n steps of path compression and once I have this predicate I can give a bound and I can prove that at most this many steps of path compression are needed. So that's where the hard part of the proof goes in and then it's very easy to connect that to what the actual code does. All right. Thank you. [applause].