>> Madan Musuvathi: Hi everybody. Thanks for coming. I'm Madan Musuvathi from Research and Software Engineering Group. And it's my pleasure to introduce Susmit Sarkar. Researcher from University of Cambridge. And he's interested in mathematically characterizing, vigorously characterizing different aspects of our real world. And he's been thinking about the C++, about the, about the shared memory models for quite some time now, and he's going to talk about both the hardware memory models of Power and ARM and how they led to the design of the recent C, C++ memory model. >> Susmit Sarkar: Thanks and hello again. So as Madan said, I've been interested for the past several years in looking at shared memory concurrency and see what is going on there. So shared memory concurrency, it's great. We've been thinking about it for a long time. We've been writing concurrent algorithms, reasoning about these algorithms for a long time. Unfortunately, most of this work that we have been doing we have started making this assumption that the way to think about it is you have a single shared memory and all threads have at all times a consistent view of this memory. So this is technically called sequential consistency, and at least to the nice property [inaudible] reasonable about the programs, reasonable about interleaving supports. Of course, as many of you well know, if you're programming on modern multiprocessors or even modern compilers, this is not true. What you get instead is something different. You get something called relaxed memory models. And these relaxed memory models are, well, they are stranger than sequential consistency, but they're also very different depending on the platforms you run on. So of course programming languages we design they have to deal with relaxed memory as well. And very recent lip, just late last year last year C and C++ has for the first time concurrency as a defined part of the language in the language standards. These language standards were basically a lot of clever people thinking about axioms about where concurrency should be and had a problem because they were going to implement it on real hardware, right, on x86, on ARM, on Power. And the problem is all these different hardware models, they are, well, different from the one C and C++ has but also very different from each other. So you get different flavors of relaxed memory depending on where you're running on. So there is a real question, and this was a question in the concurrency committee's mind as well. Can we even implement this model that we have come up with on modern hardware? What will it take to answer that question? So first of all what you have to do is map the construct set C and C++ now have down to the machine level into assembly code or something like that. Furthermore, your compiler has to understand this now. So it has to understand which optimizations are still legal to be doing and perhaps which optimizations are now good to be doing, things like fence elimination, fence insertion. So in this talk I'll focus on this question but only on a part of this question. I'll focus on a particular variety of modern hardware. The Power, which is very similar to ARM. So you can in many places think of what I'm saying as applying equally well to ARM and furthermore I'll just talk about mapping the constructs to assembly. I'll be happy to talk about optimizations with you if you want. But in this talk I'll just be talking about the mapping. As you will see, even in this restricted problem space, there's quite a number of challenges to go on. So what do I mean by this mapping? So I'm trying to explain that. I'll also explain what C and C++ now have for those who are not familiar. So, first of all, let's just normal stores and loads that you're used to writing in your programs. These are going to be mapped as you might imagine to assembly language stores and loads. Nothing very special going on here. Next, C and C++, by the way, I'll use those terms interchangeably here for the properties of the concurrency model it's just the same. C and C++ then have different kinds of what are called atomic stores and loads. And they come in a variety of flavors. They're called sequentially consistent, relaxed released what not. And these are mapped now to, well, first of all, just the underlying stores and loads. But also you'll notice stuff around. There's various kinds of barriers that, for example, Power gives you. There's compare and branch. There's all kinds of stuff. Next programmers want to impose order of if they want to by using sensors or barriers. And again C and C++ gives you various different flavors of fences. These are mapped to various kinds of barriers as well. Finally, there's slightly high level constructs that C11 also gives you. Things like compare and ops. And these are mapped to a rather longer but still not too long sequence of assembly. There's special instructions, logs and stocks which I'll talk about. There's complicated loops going on. Maybe there's barriers somewhere, that kind of stuff. So that's a mapping. Is that mapping correct? In other words, does it give you the semantics that C11 says it will? And it turns out that the mapping I gave you was not. It was not because, well, in one particular place you needed a different kind of barrier. So, okay, is that mapping correct? I'll not leave you in suspense. This time the answer is yes. This one does present semantics. But as a compiler I tell you I think of asking is that the only correct mapping? Can we do something else? And then the answer as it turns out is no. You could have alternative mappings where, for example, you put barriers on different places. You make some operations less expensive and make some other operations more expensive. So you think that compilers are free to choose whichever mapping they want. If they want they're interoperable then at least at the boundaries they must agree on the mapping otherwise the barriers will be in the wrong places. So here we are. I have given you three different mappings. One which I said was wrong. And two which I said were correct. Why should you have any confidence in me? The only way to answer this kind of question is to turn to formal methods, and we have proved this theorem that says that for any same compiler that you can think of that matches that mapping that I had, you take any C program at all. Then the compilation of this program has no more behavior in Power than it will have in the C11 concurrency model. So this is a standard kind of compiler soundness proof. So in doing this proof, it was easy to show that one version of the mapping, the one I showed you which is proposed by various people before was in fact incorrect. And various other mappings that people proposed were correct. equally good. And >>: So can you explain what it means for the compiler to be the same. >> Susmit Sarkar: Correct. >>: Nonoptimizing ->> Susmit Sarkar: Briefly what it means it doesn't move around your stores and loads. But I talk about that in detail as well. Okay. So that's the theorem we proved. And proving theorems is good. We like doing that. But also has good real world implications in this kind of theorem. So first of all it builds confidence in the C and C++ models. As I said this was just a model invented by people. So you built confidence in it by thinking about the intuitions of why it is correct to be implemented in this kind of way. Of course it also has relevance to compiler implementations in the real world. For example GCC used to get it wrong and now they don't. It's also as I said a path to reasoning out, because in terms of concurrency, ARM has a really similar concurrency model to Power. you can reason about ARM implementations as well. So So the plan for the rest of the talk, then, is I'll introduce to you a few examples of relaxed memory behavior, showing the kinds of things that we have to deal with when doing this kind of proof. I'll then for most of the talk about the Power model that we devised, taking into account all the various kinds of things that happen on Power and ARM architectures. I'll talk about synchronizing operations like compare and SOP and then wrap up by talking about the proof of C11. So relaxed memory behavior then. So perhaps many of you have seen this before, quick show of hands how many have. The message passing. Okay. Half. So here's our simple shared memory concurrent program. There's two threads. Thread zero and thread one, operating on two shared memory locations. D for data and F for flag. And the program, message passing, you might have seen it or heard of as produce as consumer does, is really a rather simple and widely used phenomenon many are doing shared memory concurrency. What's going on is the writer thread or the producer is writing something to the data structure, D. And then setting off flags saying it's done. The D or the consumer thread waits until it sees that flag and then accesses the data structure. So the question here is: Is the reader ever allowed to see an old value of data? And if you're thinking of this in interleaving kinds of terms then the answer is clearly no. You can never see an old value out here because we're waiting for the flag. So what happens if you take that program as written and just feed it to your compiler? C compiler, see? So what happens as of C11? So as of late 2011, is that C says well that program has undefined semantics. It has no semantics at all and why not? Because, of course, there's traces on both the data and the flag variables here. So what if you really wanted to program at the lower level, you wanted to do this kind of stuff. Then what C11 lets you do is to mark variable on which the basis are okay. It does this by calling them atomic variables and marking the loads and stores. So here I'm using C++ syntax, there's [inaudible] syntax also introduced. And these are marked as an atomic store to data and flag and atomic load from flag and data. Also allowed to give various parameters. So here's one of them, which is for those of you in the room, the relaxed memory order parameter. So anyway what happens with that? What happens with that is, first of all, C11 says that program has defined semantics. Furthermore, it is possible for that program to read an old value of data right there. So why did C11 allow that kind of thing. It did that to allow for various hardware and indeed compiler optimizations. So in modern hardware for like power or ARM, it would look at those two stores and see, well, those are two different locations. That's no reason for me to keep them in order. Let's just take them in order to other threads out of order and then, of course, you can see them out of order out here. A different kind of thing that ARM does is it allows speculative loads. That is, it says, well, that load is to a different location there. Let's speculate it. Let's do it early, even before that loop finishes. And, again, of course you can see an old value. I point out that on x86-like machines, DSO, you don't get this by itself. But, of course, your compiler can do stores and loads out of order as well, if it's an optimizing compiler. In fact, many of them do. So what if you really wanted to program message parsing produce a consumer in C11, then you'll have to give different parameters to these atomic stores and loads. What you have to do is to call that stores a release kind of store. And that load an acquire kind of load. If you do both of those things, then C11 says that because that acquire load reads from a release store, there's enough synchronization in the program and therefore N C that load is never allowed to be an old value. Okay? So what is an implementation supposed to do? It must forbid any compiler optimizations such as reordering these stores or loads that it might have done otherwise. Furthermore, when it's implementing this on hardware, it must take steps inserting barrier instructions of previous kinds to make sure that this never occurs on the real hardware. So that a recommended way to implement message parsing by translating the program I had in C into assembly looks something like this. So what's going on here there's a barrier which is in power terms lightweight sync and what it does briefly is to take stores before it and after it and keep them in order as they go across to other threads. So it forbids that means of reordering these tools. Out here, meanwhile, there's a different kind of barrier, an I sync, which is supposed to insert. And what that I sync does is taking into account the loop before. It makes sure that succeeding loads can no longer be speculated up above. So if you do both of those things, then you will not see this on ARM or on power, and we have tested this out. In fact never do. Okay. Questions? >>: Do you know if I sync actually slows down the grand speculation? >> Susmit Sarkar: It does slow down the speculation. So what its job in life to do is to basically stop run speculation, the [inaudible] speculation that modern hardware does have. >>: What's SC? >> Susmit Sarkar: SC is sequentially consistent, the old style interleaving we think of. >>: What is SC QST. >> Susmit Sarkar: SC QST is one more of these kind of memory orders that you're allowed to have in C++. And what that says briefly is that if you annotate all your stores and loads as SC QST you'll get back sequentially consistent behavior. >>: How did the I sync get out of the loop? >> Susmit Sarkar: I did a bit of optimization there, a very tiny bit. You're right, in fact. It should have been inside the loop. I just [inaudible] it out. This is a rather easy analysis to do. So that's message parsing or produce consumer. >>: Does ARM also have [inaudible] to I sync. >> Susmit Sarkar: Yes, it does it's called introduction synchronization barrier. ISB. >>: [inaudible] was relevant does it present any further challenges? >> Susmit Sarkar: You'd have to look at the alpha barriers, but alpha has similar barriers that will do similar kinds of things. Stop speculation, stop reordering and so on. And, of course, if you are doing this on different hardware, you will have to look at the carefully memory model there. So, for example, itanium would have something different again. In general there's lots of questions you might ask for programs that are just not message parsing. There are particular questions people ask. Is it safe to remove that barrier there? There's more general kinds of questions such as the one I'm talking about: Can we even implement this concurrency model that I've come up with on realistic hardware? There's more semanticy kind of questions, is it possible for us to guarantee to the programmer that he'll always get sequentially consistent behavior if I say block all my accesses or barrier all my accesses or something like that. And then there's kinds of compilers kind of questions, is it legal to do these kinds of optimizations. So where do you turn to if you wanted to answer these questions? One of the places you might turn to is to look at the manuals. All of these guys, the ARM, the power, x86, C++, they come with manuals which are really big chunky beasts. And they are -- well, scary because they're chunky but they're more scary when you open the pages. They're written in this kind of standardese, which is quite vague and imprecise. In fact, as we found out in previous work, sometimes they're just plain wrong. They're flat out lying to you. So it's no surprise, then, that even guys who write these, they say things like this is horribly incomprehensible. Nobody can ever ever pass or reason with these things. What else might you think of doing? One other kind of thing you might think of doing is to test out the implementations. Of course, they're just sitting there, right? That laptop of mine, that has a multi-core x86. The phones that many of you are carrying around, the iPads, they have multi-core arms. So you can run these programs. See what they are doing. So we did this. We take small tests and then we run them lots and lots of times. Many different iterations with randomization trying to explore what happens. And we found various kinds of cases, various rather rare behavior. In fact sometimes we found cases that were bugs real honest to God bugs in deployed hardware. So testing is good. But of course you have to interpret the results of the tests as well. And here you have to talk with the guys who are designing these beasts as well as guys who are using these. So people who are designing from the hardware side and people who are programming about them. And then they know typically quite a bit about their own designs or algorithms or what have you. But here we are trying to think of the general case. So we have to get to focus on what is the programmer observable behavior. So realistically, of course, you have to do this over and over again in an iterative process. You rig the manuals. You formalize what you think they are saying. You test it out. You see the results, and you discuss the consequences. And then you go back, have to tweak the model, do this iterative process again. And this process, we found that we were not just discovering what the programmer model is. In fact, inventing it. Nobody really knew what was going on. I found out also in doing this kind of work it's critical to have machine assistance. So and we use all kinds of things. We use thread improvers, we use model checkers, we used sat solvers to help us automate this process, keep track of what our assumptions there and what that implied. So based on this kind of work what we did was we devise a precise model of the power and the ARM. And it looks something like this. There's a model of the thread system, and this model is the various kinds of behavior that realistic cores do, realistic cores or threads do abstracting away and talking, dealing with things that depend only on, for example, core level speculation. Next there is an interconnection between all these threads which we call the storage subsystem. And this abstracts away from all the different kinds of interconnects, the cache protocols and what have you that joins up all these threads. Formally speaking, they are just abstract machines or label transition systems, which are a very operational way of thinking [inaudible] and synchronized with each other on different messages. And they are rather abstract, abstracting away from all the micro structure details. To give you an example of what I'm talking about, here's a handy explanation of my architecture. And modern hardware, there is various kinds of -- various levels of buffering that is going on. As the case on architectures like the ARM, that you can have levels of this buffering shared by different threads. So these two threads are in some senses closed neighbors and they are far away from these guys. And this shows up in real examples. You can write programs which depict that a write done by that thread is seen by that one which is close by but not seen yet by other ones which are far away. And perhaps there's even more layers of hierarchy. So there's levels of closeness that you might have. So that's a fine micro architectural kind of explanation. But really don't want to write programs which depend on two threads being close together and two other threads being far away. What we did was something much more abstract. We equipped each thread in our system with its own private memory, and then having a peer-to-peer linkages between all of them. In doing this, we have abstracted away totally from the particular topology that we have at the moment. So to give you a sense of what else we need, here's another example. And this is quite similar to the message passing example that we had. This is meant to show you the kinds of things that happen when you program on more than two threads. So here what I've done I've taken message passing and I've split up the writing thread. So the guy who writes data is split off into its own thread. And this middle intermediate thread what it does is wait for that data and then set the flag. And the reading thread meanwhile that's just that load of flag and load of data. A particular funny way. Particular funny way is that dynamically it's always going to read from data because that XOR is always going to return zero. But syntactically it looks like that value of that data, value of the address you're going to read from depends on that load. What all that does is it gums up the speculation of loads, and this is something you can do and people take advantage of doing on machines like the ARM. So here we can omit the barrier of I sync we had before because of that dependency. So anyway ->>: Get the [inaudible] sorry. Reading from [inaudible]. >> Susmit Sarkar: D and F are still global variables. >>: And so what do you want to do D plus ->> Susmit Sarkar: So we're going to read from ->>: Dependency on the load on F. >>: Exactly. So do the architects actually, do they allow -- do they agree with this? >> Susmit Sarkar: [inaudible]. >>: Because this does prevent some future optimization. >> Susmit Sarkar: Absolutely. Absolutely. So they are tying one hand behind their backs in doing this. But they do guarantee this. And they guarantee this because programmers want to make reads deep. So recall in the previous example we had the I sync more expensive than doing this kind of thing so this is a kind of programming device case. So anyway, the question we ask is again the same, can we ever read an old value of data. And to prevent that, you really need a property of the barrier there, which is called cumulativity, not only does the barrier keep the same threads stores before it in order of its stores after it, but keeps any store that it might have read before the barrier in order with respect to stores after that barrier. So this is a kind of property that we really need when you are going from two to three and more threads. Okay. It's called cumulativity. So just to give you a flavor of the kind of model that we have, here's a description of one rule. You don't have to read this very carefully. Just brought this up to show you that we are talking in terms of what abstract concepts, just in terms of concepts that I've been talking about here. So a write can propagate to another thread under some conditions and the conditions are stated fairly abstractly as well as some kind of Sanity check it's not been propagated less the cumulativity condition that all barriers have been propagated and there's some other conditions to make go ahead and seen. It's not too scary to look at. And if you look at the formal mathematical definition, it's not too scary either, just basic direct transliteration of the explanation and prose I had before. This is propagating a write W to a thread DID prime, and again there's some sanity conditions, it's not already been propagated there's the cumulativity condition there of barriers before being after and the coherence condition. It's not too big. So to talk about another feature of the example, of the machines, and this time about the core level speculations, I will introduce this example. This might look a bit scary, but I'll walk you through it slowly. Think of it just as message passing. In fact organization the regular thread it's the exact same thing, there's a write to B. The barrier, and the write to F. Over here the reading flag we have the write to there, the load to flag there. And the load from a location which is going to turn out to be dynamically always there. So it's just message passing. In between however we're doing some funny stuff. So the funny stuff we're doing is we are looking at that flag, we are ensuring we always need another one, always going off somewhere. But if we do, we are writing something out to some temporary location, reading it back, and then doing something that seems to depend on the value you loaded just there. Okay. So what does all that complicated rigmarole give you? So here's a chain of reasoning. That load seems to depend on that load up there. And therefore it cannot be speculated before that load. Okay. Now, that load, of course, is going to take its value from that store and therefore it cannot be done any earlier than that store. If you read the manuals, meanwhile, they'll promise to you stores and Power and ARM are not speculated past branches. So you think that would mean that that store there it's not going to be speculated before you get back a value before the branch gets resolved and thereby before that you do that read. Okay. Now, that read, if it's ever to read one, has to take its value from that right there. And because of that barrier, by that time that store must have come across to this thread as well. So you think that if you depended on all those chains of reasoning that I quickly sketched out you'd never be able to read an old value of data. And, again, then this program out we did and we discovered in fact in some cases you do. So what's going on here? Where in that chain of reasoning that I had was a flaw that allowed this to happen? The flaw was that the manuals are correct in saying that that store is not going to be speculated as far as other threads are concerned this is the part that missed out. As far as the same thread is concerned, it can perfectly well forward it. So it's going on in the real hardware is that the branch prediction mechanism is saying I'm not going to take that branch so we're eventually going to get to that store. And therefore that load will eventually get its value from that store. Clear the dependency, do that load, all good. Some time later, along comes both of those stores, you read that. You see that in fact you did read one and therefore your speculation was justified. Things are good. This turned out to be quite a surprise to both the designers and the programmers in that now speculation is in fact visible to the programmer. So we have to in fact explicitly model this kind of speculation in our model. >>: Willing to fix the hardware? >> Susmit Sarkar: No, no, no. No. >>: Did you ask them? Because you said refuse the designers as well and ->> Susmit Sarkar: They expected this, yes. One said -- you know, deal with it basically. >>: Doesn't look like a big deal to me. >> Susmit Sarkar: So this is a kind of argument that they make. >>: This is not -- doesn't look so different to me from store bought all along, store sample. >> Susmit Sarkar: Sure. In the kind of programming kind of world view, maybe you really shouldn't have been depending on all that complicated chain I talked about, you should have had barriers. And this is the kind of arguments that hardware designers do and in fact why they're saying we're not willing to fix the hardware or the model. So I'm just pointing out that contrary to the expectations, you really have to think about speculation if you're going to explore all the possible behaviors of all kinds of programs. So anyway, so our model has to deal with speculation and it does that in just the kind of way you'd think of doing if you were in fact modeling speculation, you have instructions which are in flight and later they're going to be committed. They're going to be inflight even past programming all those branches so those arrows are program orders. >>: I have a question dealing with speculation. So in some sense -there's two levels of having to deal with speculation. One level is having to deal with the fact that events could be temporally somehow ordered before they're confirmed, which is I think we have been used to that for a long time now. The other one is you have to deal with speculations that, misspeculations, misspeculations are somehow cerebral. That one seems much worse. So which one are you talking about here when you say you have differences? >> Susmit Sarkar: I'm talking about both of those. >>: You have in fact ->> Susmit Sarkar: For example, like I had before was where the speculation was justified. But there's also examples where speculation was unjustified. >>: Every time you have something about the misspeculated execution that [inaudible] the final execution. >> Susmit Sarkar: So it influences the values that you read on the same thread, yes. >>: Can you give an example? >> Susmit Sarkar: Not on these slides. >>: So if -- so if the hardware does some branch misspeculation, it turns out to be wrong, the instruction secured in the wrong branch with some more influence the non-speculated? That was your question? >>: Yes, that was the question. >> Susmit Sarkar: You can see it in various kinds of ways. I'll get to talking about CASs, you can make your CASs fail whether or not they were supposed to. >>: CASs can fail even if you don't have branch speculation, right, the expect in our [inaudible] the analysis you can say even if you have a looming thread that's ever operating ->> Susmit Sarkar: Mostly it does not, sure. >>: The spec says ->> Susmit Sarkar: So anyway, we have to deal with speculation in the model. So in terms of the size of the model then, we have an explanation about three pages of prose, explanation at the level I showed you. And in math, it's about 2500 lines. 2500 lines of a new language that you devise we call it LEM. And what LEM does is it's a front end language you can extract proof definitions from it. You can extract all four and top code out of it. You can also extract executable code, that is, OCaml. So I wrote side harness above this automatically extracted code, and all that let me have was a tool that explored the model. Exhaustively checking it or interactively checking it. Turns out you can compile or combine the code to JavaScript and thereby run it on your browser and run it on your phone having personally tested this out. So here I am running that program. This is running our model just in JavaScript in the browser. We call this tool PPC MEM. And what you can do here is write assembly code by yourself if you want. We also have a library of tests. So I take, for example, a plain old message parsing example. So here is real assembly code that does message passing. I've taken off all the barriers so there's just the plain stores and loads here. So that store is power assembly speak for stores. There's a store two locations on this thread, and load from the other locations on that thread. And you can ask the question were it's possible for you to read one first and zero next. Right? So you can take the noninteractive mode. What that will do is exhaustively check our model and see whether it's possible. I don't recommend doing this in JavaScript because it's slow. But you can do it in the command line version of the tool we have. But you can also do interactive checking. If you do interactive checking, you can step through what the model says. So here's the state of the system. There's a state for the stored subsystem. And states for each of the threads. As you can see at this point in time nothing much has happened. There's just the initialization rates that have been seen. And all these instructions here are waiting around. Okay? All the transitions that are modeled now [inaudible] are clickable now. So, for example, I can commit that instruction in there. And once I do that, all the preconditions for committing that store is possible. So I can now commit that store thereby making the right of one to Y invisible. I also point out that I did this well before I did commits of that store there. So I really did out of auto commits. I can, for example, read again out of order on that thread, maybe even commit that, why not, then perhaps propagate this thread, this write there to that thread. Now you see I can only read one where before I used to be able to read zero. We can do it various other steps. You get the idea. We have a variety of different tests as I said, library including the speculation example I had for various others. You can go in and, for example, write in a barrier and see what happens. So it's really fun to play with. consequences thereof. And it lets you explain our model the So how do we go about validating the model that we had? First of all, we can by the process I just said extract executable code and have this exhaustive checkoff. Take tiny litmus tests and then see all the different behaviors allowed by the model. You can take the race tests and run them on real hardware. We did, on various generations of Power, and we built up histograms of what behavior was seen and not seen on different hardware. Here's a tiny result of those results. And I'll show you how to read those results. First of all, consider it on the behaviors that the model says is forbidden. In other words, the model guarantees that that will never be seen. It's important for the model to be sound, that will never actually see that on real hardware. So we tested this quite a number of times. So that's about 10 to the 11 times. This is, of course, empirical testing. So maybe you don't have changed on 10th to the 12th run but we did a reasonable amount of effort. Next there are tests where the model says some behavior is allowed. Most of the times what we see is that it really does occur on real hardware. Sometimes really often. Sometimes not quite so often. Sometimes for some varieties of tests you see them on some generations of the hardware and not on the others. This points out a key fact that the specifications that you're building up, these models, have to be loose models because they have to cover all the different variance of the architecture you might have. In fact, future implementations as well. And this brings me to the last kind of models, last kind of tests where in fact the model says they're allowed but we've never seen them on real hardware. We took all of these kinds of tests and discussed them quite carefully with the designers. And in all of these cases what this is, yes, particular features of the micro architecture they have so far implemented ensures that it is not seen on current hardware, but they want to leave open the possibility that some future generations of the hardware might do these kinds of things. So I'll briefly move on now to talking about synchronizing operations, because they are fun and they're used by several real world programmers. >>: I have a question about this model. So is the model understandable to smart programmers? That is an evaluation ->> Susmit Sarkar: I'd like to think so. So basically at the scale that I showed you in Excel, writes propagating threads and to help them understand you have this tool that lets them explore consequences of programs they have. >>: Do you envision that you could write like a verification tools, like you can say if I wrote a piece of code and I believe that this code had some specification there, it's implementing a link node, for instance? Once it's required to prove, for a programmer to convince himself that it's right. >> Susmit Sarkar: That's a good question. We have in fact looked at rather simple programs basically link lists and tried to reason informally on top of this model and tested out informal reasoning by running it through this kind of test that we have. So this is the kind of things we've been doing. We also want to package this up, and this is sort of future work, and package this up into reasoning principles which let you reason at a higher level. >>: So this model is operational in the sense that do you have an understanding on what the matching axiomatic model is, or you just know some ->> Susmit Sarkar: Haven't invested much effort into that kind of thing, because, well, any axiomatic thing that would precisely capture the operational model would be equivalent -- would have more or less much of the operation and understanding with them. But on a different kind of front, the C++ model, it is an axiomatic model and can be sound, proved. That's what we did. We can have axiomatic approximations, if you're looking for precise axiomatic ->>: C++ model is reasonable [inaudible]. >> Susmit Sarkar: Okay. So briefly moving on to talk about synchronization operations, what do I mean by that, things that might have looked on as compare [inaudible] atomic addition, atomic subtraction, that kind of thing. So if we're programming on a risk like architecture what do you typically get are pairs of constructions? Sometimes called load resolve and stored conditional, load link and stored condition, and what these are, they let you implement all these kinds of synchronization operations. So here's an implementation of the atomic addition primitive. So out here it's a, there's a load resolve which on power is called a locks or load link. And out here is a stored condition, which is called a storks. So informally what's going on? What's going on is the locks is a load. And you can do various other stuff. For example, atomic addition maybe you want to add stuff. And then you do a stored conditional. Among other things, the stored condition is a store. It's a particular kind of store. It's a store that can succeed and thereby do the store. But can fail. And thereby not do the store. The machine says you whether it succeeded or not in a flag and the typical way to use this is to look back and try again if you failed. So what's supposed to happen if that sequence gives you an atomic addition? What does that mean? Supposed to happen informally is that the machine says the storks can succeed only if no other thread wrote to that location in question B since the last locks. Okay. So if you're thinking now relaxed memory behavior, at this point you should be jumping at my throats, what is the sense you speak of, what does that mean? Maybe sense in machine time? Turns out that's the wrong answer. That's made unnecessary not sufficient. And to understand what's going on you have to think of really what's the micro architecture doing. So informally speaking the micro architecture, what it's doing is if that thread in question did not lose ownership of the cache line between the locks and the stocks, then it knows no other thread could have written in in between and therefore the load and the store were atomic. So that's a fine micro architecture explanation. But, of course, the program already want to reason about do I own the cache have I lost ownership of the cache. You want to think in a more abstract level. And to think in a more abstract level you have to think about what is it that that cache protocol is by you. What that cache protocol is buying you in terms of transferring ownership from one thread to the other is building up a chain of ownership for the location in play. It's building up in other words an order relating stores to that location, we call this coherence. That you can agree on every different thread. Once you have that abstract notion in your hands now it's easier to state what atomic means. What it means is that a stocks is allowed to succeed only if it can become coherence right next to the write it read from. So those two are atomic. Furthermore, of course, this coherence section abstract order so it builds up over time. It must be the case that you're never allowed to violate that atomic of those two writes staying together ever again. So that's the key concept we need and now we can give it a name. We call this write reaching coherence point we do this in our model saying a store reads dynamic coherence point operationally when this abstract coherence order we built up becomes linear before this write and furthermore it's never going to be, going to become different again. >>: Is this the level each model is -- you have these relations that they add to? And you never remove any entries, I guess? >> Susmit Sarkar: All your rules have preconditions that they're never doing bad edges. Exactly so. Okay. So, of course, you also have to deal with what the interactions are with the rest of the system, in other words what are the interactions with the stores normal kinds of stores and loads and barriers you had before. And it's really easy to get this wrong. There was a rather recent kernel bug where they got confused about what the ordering properties where with respect to normal loads and stores. But once you have this kind of formal model then again prove stuff about it. One kind of simple result that you can get is you place all your axises, in other words all your stores and loads by atomic kinds of accesses and then you regain sequentially consistent behavior. So this is an alternative to locking all your stores and loads or maybe putting barriers between every pairs of stores and loads. All right. To wrap up then I'll talk about the proof of correctness of implementation for C11. I don't have time here to talk about the whole of the C11 model. It's rather complex. I just handle it at a rather high level. This is the program we saw before and have seen multiple times in this talk. Release acquired version of message passing. So what C11 is what's called an axiomatic model. What that means is that it reasons not about that program making steps, but after that program is executed and created all its stores and loads, whether that execution was allowed or not. So it takes that program and it converts all of those stores and loads that you get into events and then it defines various situations on those stores and loads, events. And furthermore various axioms about those relations. It has various kinds of relations. There's a relation which is called sequence before which is sort of like program order. And there's various other kinds of calculations. The key one here is something called happens before which is defined in a very particular way in C11. And right here on this example, what it does is because that was a release and acquire, that happens before there, they're and their. There's now consistency conditions on this execution. And for example a kind of consistency condition that's there is that read is allowed to read from that store, but it's not because of that happens before [inaudible] allowed to read from something further back, that initialization rate. Also I'll point out that the semantics are only defined for race three programs, race three are defined very particularly in happens before correlation. So there they are. There's a bunch of axioms. Fairly complex axioms. But in fact there is some intuition behind those axioms and this we discover by doing this kind of proof. So the base case of that happens before relation that I had was such that synchronization between a release kind of store and acquired kind of load. And we see this clearly reflect itself in things that you get on the hardware, properties of the barriers that we had. Next, this kind of release acquires synchronization. It has to be transitive. If you release acquired here and on another location you do release on acquired thread all these would chain up together and this corresponds in a fairly direct way again to properties of the hardware, the cumulative property that we talked about. There's various other kinds of things. There's particular features of C11 that were carefully designed to take into account dependencies. And this comes up in the power many are doing these kind of dependent reasoning that I was doing. Special rules for compare and SOP and these compare in a fairly reasonable way reasoning when is it possible for writes or stored conditionals to reach their coherence point. So then how do we prove this theorem? The broad view is recall that we're talking about all possible C programs. In fact, C11 does not give any semantics at all to race C programs. So we need only consider data race free programs because other programs they're allowed to do whatever it is they want. So for any program, then, we'll look at any compiler. Not just any compiler, any same compiler like I said. What that means? What that means is it preserves memory accesses. Does not optimize them away or reorder them. Furthermore, it uses the mapping table that we had. So if we take any such compiler at all, then we look at the target of that compiler. We have a model for the power and therefore we can find all the behavior that that power program has. This model, recall, is an operational kind of model. So the behavior you have is the set of races the model is allowed to have. You look at each of those traces and you try to build up executions as allowed by C11. You do this by building up the key relations that you need that happens before and so on out of the base. In each case you'll find that the axioms that C11 depend on, they depend on features of the machine. you have to look closely at what the machine is doing. So There's also subtlety in here of course on the machine level there's no concept of race, programs just do something. So if it looked like -looks like there is in fact a race C kind of execution in terms of C11, you have to actually construct that race in C11 and thereby get a contradiction with that data race C precondition. So this can be done. And we did this, it's a proof, and what did we learn in doing this proof? We learned various kinds of things. We learned, for example, that release requires reasoning is, well, used by a lot of programmers, but also corresponds directly to what hardware does. So it in fact transfers directly between software and hardware levels. If you can't think of your programs just in terms of release acquire maybe it's a good thing, too. We also learned facts about the hardware. We learned facts if a certain hardware optimization had to be done and, by the way, this optimization is seemingly quite natural to do, that in fact C11 would be unimplementable, unimplementable that is without putting in barriers absolutely everywhere and thereby defeating point of having different kinds of stores and loads. Fortunately current hardware does not do this. And now we have a strong argument to put forwards to designers about why they should never do it. We also ->>: Definition of [inaudible] implemented if we need fences for [inaudible] is that the definition of [inaudible]. >> Susmit Sarkar: Unimplementable efficient, yes. Yeah, and you also learned various ways, as I said, of regaining sequential consistent behavior. >>: Can we go -- so here I can think of this as the last phase of the compiler, right, in the sense the compiler takes the program and compiles it to let's say machine independent IR. Then it's going to have a back end phase that's going to take this IR and translate it ->> Susmit Sarkar: Sure. >>: So I could think of the program -- so the reason I asked is can we actually have two phases compiler where you allow all DR of zero optimizations [inaudible] get to the IR? And you use this translation. >> Susmit Sarkar: Yes, absolutely. >>: Then will the theorem hold. Can you prove that? >> Susmit Sarkar: Sure. stay within RF up there. You can do any kind of optimizations which >>: What the problem is, if your voice is you stay in allowed to do some things that otherwise you would be drop down earlier. For example, in the travel model, there's no problem with data raises in the power. If introduce a data race in the power program because it run faster, you're okay doing that. >> Susmit Sarkar: No, no -- the RF you're not able to do if you for example, you want to makes the program >>: That's not what I'm asking. I agree with you. But I'm asking real compiler's actually -- are they the same but are they definitely. >> Susmit Sarkar: Optimizing. >>: Optimizing. >> Susmit Sarkar: Absolutely. >>: They're doing reorder memory operations do reorder memory operations. >> Susmit Sarkar: As long as they stay within this. Another way to ask the question you're trying to design an intermediate representation that has the C11 model. >>: Exactly. >> Susmit Sarkar: I think that's a perfectly -- >>: Cast all transformations so the front end -- so the [inaudible] is doing this source-to-source translation, from C to C++. >>: Are you reasonable, flexible, do you have to stay within -probably if you have to ->>: My question was more in can I take this theorem and include the GCC's correct. >>: GCC does not stay within C, C++. >> Susmit Sarkar: So ->>: My version of GCC that is DRF compliant. >> Susmit Sarkar: Yes. So then you are sort of -- in fact, they're trying to build up to that kind of version. >>: Is it stronger than the first line, the same, in the sense you could come up with an automatic compiler, put restrictions on the compiler. >> Susmit Sarkar: Absolutely. If you want to do that kind of thing. But the first instance we're just going from source to target. You're allowed do anything, as Sebastian is saying, optimizations that stay within that fragment. Sort of future work is what if you take optimizations that go outside that fragment but still in some way resolve source level properties. >>: One thing the compiler is definitely going to do, if you do the czar trick, read it and add something, it's going to read it as zero, definitely. You have to make sure that that's not ->> Susmit Sarkar: I'm glad you brought that up. So it's perfectly safe to do that for non-atomic or what you used to think of as normal stores and loads. But it's not safe in C11 to do that for atom mix or volatiles, say. >>: Okay. So for regular stores and loads the theorem should be strong enough to show that [inaudible] the compiler ->> Susmit Sarkar: You have a bit of work not very hard work, but sure. >>: Okay. Any help from the DRF property? >> Susmit Sarkar: Yes, essentially. You are doing sort of source level optimizations that still stay within DRF. Because you're optimizing away nonatomic stores and loads, if there was no race to begin with, you're not introducing new races. Right? >>: I think in some sense [inaudible] question can you get the most important optimizations done, last thing and then [inaudible]. >> Susmit Sarkar: All right. So here we are. Here we have been reasoning about mainstream concurrent programs. At the very lowest level doing this on real hardware like the Power and ARM and trying to show how high level language primitives can be compiled. So we have a theorem which is the correct compilation result and this clearly has as I said relevance to the real world compilers and also builds confidence in these models. What about the future? Well, one thing about this proof is that it really boils down what is it that we are depending on from the hardware? And this lets us design new kinds of hardware that maybe relaxes some of these. Also, of course, this is a path to reasoning about low level programs. Building up reasoning principles from assembly level up to high level language maybe C and C++ is not the understanding high enough level language. But sure beats reasoning about in terms of assembly. Thank you for your attention, and there's more details there, in particular that's the URL for the tool which I really encourage you to play with. Thank you. [applause]. I don't know if you have ->> Madan Musuvathi: Any questions? >> Susmit Sarkar: Questions after. >>: So it looks like I have two options. Say I want to write a low level log tree high optimized piece of code, should I try to use the C++ memo to do that or should I use your model to do that? >> Susmit Sarkar: So my personal feeling is you should use a C11 model because then in fact you can put it to various different kinds of hardware. >>: Okay. >> Susmit Sarkar: So the proof that I just did was to let you do that reasoning at C11 while still getting all the properties that power gives you. Of course, if you care just about power, for example, if you care just about ARM, then you can use directly my models to reason about it at that level as well. >> Madan Musuvathi: Thank you. >> Susmit Sarkar: Thanks.