Document 17864504

advertisement
>> Madan Musuvathi: Hi everybody. Thanks for coming. I'm Madan
Musuvathi from Research and Software Engineering Group. And it's my
pleasure to introduce Susmit Sarkar. Researcher from University of
Cambridge. And he's interested in mathematically characterizing,
vigorously characterizing different aspects of our real world. And
he's been thinking about the C++, about the, about the shared memory
models for quite some time now, and he's going to talk about both the
hardware memory models of Power and ARM and how they led to the design
of the recent C, C++ memory model.
>> Susmit Sarkar: Thanks and hello again. So as Madan said, I've been
interested for the past several years in looking at shared memory
concurrency and see what is going on there.
So shared memory concurrency, it's great. We've been thinking about it
for a long time. We've been writing concurrent algorithms, reasoning
about these algorithms for a long time.
Unfortunately, most of this work that we have been doing we have
started making this assumption that the way to think about it is you
have a single shared memory and all threads have at all times a
consistent view of this memory. So this is technically called
sequential consistency, and at least to the nice property [inaudible]
reasonable about the programs, reasonable about interleaving supports.
Of course, as many of you well know, if you're programming on modern
multiprocessors or even modern compilers, this is not true. What you
get instead is something different. You get something called relaxed
memory models.
And these relaxed memory models are, well, they are stranger than
sequential consistency, but they're also very different depending on
the platforms you run on. So of course programming languages we design
they have to deal with relaxed memory as well. And very recent lip,
just late last year last year C and C++ has for the first time
concurrency as a defined part of the language in the language
standards.
These language standards were basically a lot of clever people thinking
about axioms about where concurrency should be and had a problem
because they were going to implement it on real hardware, right, on
x86, on ARM, on Power.
And the problem is all these different hardware models, they are, well,
different from the one C and C++ has but also very different from each
other. So you get different flavors of relaxed memory depending on
where you're running on.
So there is a real question, and this was a question in the concurrency
committee's mind as well. Can we even implement this model that we
have come up with on modern hardware?
What will it take to answer that question? So first of all what you
have to do is map the construct set C and C++ now have down to the
machine level into assembly code or something like that.
Furthermore, your compiler has to understand this now. So it has to
understand which optimizations are still legal to be doing and perhaps
which optimizations are now good to be doing, things like fence
elimination, fence insertion.
So in this talk I'll focus on this question but only on a part of this
question. I'll focus on a particular variety of modern hardware. The
Power, which is very similar to ARM. So you can in many places think
of what I'm saying as applying equally well to ARM and furthermore I'll
just talk about mapping the constructs to assembly.
I'll be happy to talk about optimizations with you if you want. But in
this talk I'll just be talking about the mapping. As you will see,
even in this restricted problem space, there's quite a number of
challenges to go on.
So what do I mean by this mapping? So I'm trying to explain that.
I'll also explain what C and C++ now have for those who are not
familiar. So, first of all, let's just normal stores and loads that
you're used to writing in your programs.
These are going to be mapped as you might imagine to assembly language
stores and loads. Nothing very special going on here. Next, C and
C++, by the way, I'll use those terms interchangeably here for the
properties of the concurrency model it's just the same.
C and C++ then have different kinds of what are called atomic stores
and loads. And they come in a variety of flavors. They're called
sequentially consistent, relaxed released what not. And these are
mapped now to, well, first of all, just the underlying stores and
loads.
But also you'll notice stuff around. There's various kinds of barriers
that, for example, Power gives you. There's compare and branch.
There's all kinds of stuff. Next programmers want to impose order of
if they want to by using sensors or barriers. And again C and C++
gives you various different flavors of fences. These are mapped to
various kinds of barriers as well.
Finally, there's slightly high level constructs that C11 also gives
you. Things like compare and ops. And these are mapped to a rather
longer but still not too long sequence of assembly. There's special
instructions, logs and stocks which I'll talk about. There's
complicated loops going on. Maybe there's barriers somewhere, that
kind of stuff.
So that's a mapping. Is that mapping correct? In other words, does it
give you the semantics that C11 says it will? And it turns out that
the mapping I gave you was not. It was not because, well, in one
particular place you needed a different kind of barrier. So, okay, is
that mapping correct? I'll not leave you in suspense. This time the
answer is yes. This one does present semantics. But as a compiler I
tell you I think of asking is that the only correct mapping? Can we do
something else? And then the answer as it turns out is no. You could
have alternative mappings where, for example, you put barriers on
different places. You make some operations less expensive and make
some other operations more expensive.
So you think that compilers are free to choose whichever mapping they
want. If they want they're interoperable then at least at the
boundaries they must agree on the mapping otherwise the barriers will
be in the wrong places.
So here we are. I have given you three different mappings. One which
I said was wrong. And two which I said were correct. Why should you
have any confidence in me?
The only way to answer this kind of question is to turn to formal
methods, and we have proved this theorem that says that for any same
compiler that you can think of that matches that mapping that I had,
you take any C program at all. Then the compilation of this program
has no more behavior in Power than it will have in the C11 concurrency
model. So this is a standard kind of compiler soundness proof.
So in doing this proof, it was easy to show that one version of the
mapping, the one I showed you which is proposed by various people
before was in fact incorrect.
And various other mappings that people proposed were correct.
equally good.
And
>>: So can you explain what it means for the compiler to be the same.
>> Susmit Sarkar: Correct.
>>: Nonoptimizing ->> Susmit Sarkar: Briefly what it means it doesn't move around your
stores and loads. But I talk about that in detail as well. Okay. So
that's the theorem we proved. And proving theorems is good. We like
doing that. But also has good real world implications in this kind of
theorem. So first of all it builds confidence in the C and C++ models.
As I said this was just a model invented by people. So you built
confidence in it by thinking about the intuitions of why it is correct
to be implemented in this kind of way. Of course it also has relevance
to compiler implementations in the real world. For example GCC used to
get it wrong and now they don't.
It's also as I said a path to reasoning out, because in terms of
concurrency, ARM has a really similar concurrency model to Power.
you can reason about ARM implementations as well.
So
So the plan for the rest of the talk, then, is I'll introduce to you a
few examples of relaxed memory behavior, showing the kinds of things
that we have to deal with when doing this kind of proof. I'll then for
most of the talk about the Power model that we devised, taking into
account all the various kinds of things that happen on Power and ARM
architectures.
I'll talk about synchronizing operations like compare and SOP and then
wrap up by talking about the proof of C11. So relaxed memory behavior
then. So perhaps many of you have seen this before, quick show of
hands how many have. The message passing. Okay. Half. So here's our
simple shared memory concurrent program. There's two threads. Thread
zero and thread one, operating on two shared memory locations. D for
data and F for flag. And the program, message passing, you might have
seen it or heard of as produce as consumer does, is really a rather
simple and widely used phenomenon many are doing shared memory
concurrency. What's going on is the writer thread or the producer is
writing something to the data structure, D. And then setting off flags
saying it's done.
The D or the consumer thread waits until it sees that flag and then
accesses the data structure.
So the question here is: Is the reader ever allowed to see an old
value of data? And if you're thinking of this in interleaving kinds of
terms then the answer is clearly no. You can never see an old value
out here because we're waiting for the flag.
So what happens if you take that program as written and just feed it to
your compiler? C compiler, see? So what happens as of C11? So as of
late 2011, is that C says well that program has undefined semantics.
It has no semantics at all and why not? Because, of course, there's
traces on both the data and the flag variables here. So what if you
really wanted to program at the lower level, you wanted to do this kind
of stuff. Then what C11 lets you do is to mark variable on which the
basis are okay. It does this by calling them atomic variables and
marking the loads and stores. So here I'm using C++ syntax, there's
[inaudible] syntax also introduced.
And these are marked as an atomic store to data and flag and atomic
load from flag and data.
Also allowed to give various parameters. So here's one of them, which
is for those of you in the room, the relaxed memory order parameter.
So anyway what happens with that? What happens with that is, first of
all, C11 says that program has defined semantics. Furthermore, it is
possible for that program to read an old value of data right there. So
why did C11 allow that kind of thing. It did that to allow for various
hardware and indeed compiler optimizations.
So in modern hardware for like power or ARM, it would look at those two
stores and see, well, those are two different locations. That's no
reason for me to keep them in order. Let's just take them in order to
other threads out of order and then, of course, you can see them out of
order out here.
A different kind of thing that ARM does is it allows speculative loads.
That is, it says, well, that load is to a different location there.
Let's speculate it. Let's do it early, even before that loop finishes.
And, again, of course you can see an old value.
I point out that on x86-like machines, DSO, you don't get this by
itself. But, of course, your compiler can do stores and loads out of
order as well, if it's an optimizing compiler. In fact, many of them
do.
So what if you really wanted to program message parsing produce a
consumer in C11, then you'll have to give different parameters to these
atomic stores and loads. What you have to do is to call that stores a
release kind of store. And that load an acquire kind of load.
If you do both of those things, then C11 says that because that acquire
load reads from a release store, there's enough synchronization in the
program and therefore N C that load is never allowed to be an old
value.
Okay? So what is an implementation supposed to do? It must forbid any
compiler optimizations such as reordering these stores or loads that it
might have done otherwise. Furthermore, when it's implementing this on
hardware, it must take steps inserting barrier instructions of previous
kinds to make sure that this never occurs on the real hardware.
So that a recommended way to implement message parsing by translating
the program I had in C into assembly looks something like this.
So what's going on here there's a barrier which is in power terms
lightweight sync and what it does briefly is to take stores before it
and after it and keep them in order as they go across to other threads.
So it forbids that means of reordering these tools. Out here,
meanwhile, there's a different kind of barrier, an I sync, which is
supposed to insert. And what that I sync does is taking into account
the loop before. It makes sure that succeeding loads can no longer be
speculated up above.
So if you do both of those things, then you will not see this on ARM or
on power, and we have tested this out. In fact never do.
Okay.
Questions?
>>: Do you know if I sync actually slows down the grand speculation?
>> Susmit Sarkar: It does slow down the speculation. So what its job
in life to do is to basically stop run speculation, the [inaudible]
speculation that modern hardware does have.
>>: What's SC?
>> Susmit Sarkar: SC is sequentially consistent, the old style
interleaving we think of.
>>: What is SC QST.
>> Susmit Sarkar: SC QST is one more of these kind of memory orders
that you're allowed to have in C++. And what that says briefly is that
if you annotate all your stores and loads as SC QST you'll get back
sequentially consistent behavior.
>>: How did the I sync get out of the loop?
>> Susmit Sarkar: I did a bit of optimization there, a very tiny bit.
You're right, in fact. It should have been inside the loop. I just
[inaudible] it out. This is a rather easy analysis to do.
So that's message parsing or produce consumer.
>>: Does ARM also have [inaudible] to I sync.
>> Susmit Sarkar: Yes, it does it's called introduction
synchronization barrier. ISB.
>>: [inaudible] was relevant does it present any further challenges?
>> Susmit Sarkar: You'd have to look at the alpha barriers, but alpha
has similar barriers that will do similar kinds of things. Stop
speculation, stop reordering and so on.
And, of course, if you are doing this on different hardware, you will
have to look at the carefully memory model there. So, for example,
itanium would have something different again. In general there's lots
of questions you might ask for programs that are just not message
parsing. There are particular questions people ask. Is it safe to
remove that barrier there?
There's more general kinds of questions such as the one I'm talking
about: Can we even implement this concurrency model that I've come up
with on realistic hardware? There's more semanticy kind of questions,
is it possible for us to guarantee to the programmer that he'll always
get sequentially consistent behavior if I say block all my accesses or
barrier all my accesses or something like that.
And then there's kinds of compilers kind of questions, is it legal to
do these kinds of optimizations. So where do you turn to if you wanted
to answer these questions?
One of the places you might turn to is to look at the manuals. All of
these guys, the ARM, the power, x86, C++, they come with manuals which
are really big chunky beasts. And they are -- well, scary because
they're chunky but they're more scary when you open the pages. They're
written in this kind of standardese, which is quite vague and
imprecise. In fact, as we found out in previous work, sometimes
they're just plain wrong. They're flat out lying to you.
So it's no surprise, then, that even guys who write these, they say
things like this is horribly incomprehensible. Nobody can ever ever
pass or reason with these things.
What else might you think of doing? One other kind of thing you might
think of doing is to test out the implementations. Of course, they're
just sitting there, right? That laptop of mine, that has a multi-core
x86. The phones that many of you are carrying around, the iPads, they
have multi-core arms. So you can run these programs. See what they
are doing. So we did this. We take small tests and then we run them
lots and lots of times. Many different iterations with randomization
trying to explore what happens.
And we found various kinds of cases, various rather rare behavior. In
fact sometimes we found cases that were bugs real honest to God bugs in
deployed hardware.
So testing is good. But of course you have to interpret the results of
the tests as well. And here you have to talk with the guys who are
designing these beasts as well as guys who are using these. So people
who are designing from the hardware side and people who are programming
about them.
And then they know typically quite a bit about their own designs or
algorithms or what have you. But here we are trying to think of the
general case. So we have to get to focus on what is the programmer
observable behavior.
So realistically, of course, you have to do this over and over again in
an iterative process. You rig the manuals. You formalize what you
think they are saying. You test it out. You see the results, and you
discuss the consequences. And then you go back, have to tweak the
model, do this iterative process again. And this process, we found
that we were not just discovering what the programmer model is. In
fact, inventing it. Nobody really knew what was going on. I found out
also in doing this kind of work it's critical to have machine
assistance. So and we use all kinds of things. We use thread
improvers, we use model checkers, we used sat solvers to help us
automate this process, keep track of what our assumptions there and
what that implied. So based on this kind of work what we did was we
devise a precise model of the power and the ARM. And it looks
something like this.
There's a model of the thread system, and this model is the various
kinds of behavior that realistic cores do, realistic cores or threads
do abstracting away and talking, dealing with things that depend only
on, for example, core level speculation.
Next there is an interconnection between all these threads which we
call the storage subsystem. And this abstracts away from all the
different kinds of interconnects, the cache protocols and what have you
that joins up all these threads.
Formally speaking, they are just abstract machines or label transition
systems, which are a very operational way of thinking [inaudible] and
synchronized with each other on different messages.
And they are rather abstract, abstracting away from all the micro
structure details. To give you an example of what I'm talking about,
here's a handy explanation of my architecture. And modern hardware,
there is various kinds of -- various levels of buffering that is going
on. As the case on architectures like the ARM, that you can have
levels of this buffering shared by different threads. So these two
threads are in some senses closed neighbors and they are far away from
these guys.
And this shows up in real examples. You can write programs which
depict that a write done by that thread is seen by that one which is
close by but not seen yet by other ones which are far away.
And perhaps there's even more layers of hierarchy. So there's levels
of closeness that you might have. So that's a fine micro architectural
kind of explanation. But really don't want to write programs which
depend on two threads being close together and two other threads being
far away.
What we did was something much more abstract. We equipped each thread
in our system with its own private memory, and then having a
peer-to-peer linkages between all of them. In doing this, we have
abstracted away totally from the particular topology that we have at
the moment.
So to give you a sense of what else we need, here's another example.
And this is quite similar to the message passing example that we had.
This is meant to show you the kinds of things that happen when you
program on more than two threads. So here what I've done I've taken
message passing and I've split up the writing thread. So the guy who
writes data is split off into its own thread. And this middle
intermediate thread what it does is wait for that data and then set the
flag.
And the reading thread meanwhile that's just that load of flag and load
of data. A particular funny way. Particular funny way is that
dynamically it's always going to read from data because that XOR is
always going to return zero. But syntactically it looks like that
value of that data, value of the address you're going to read from
depends on that load.
What all that does is it gums up the speculation of loads, and this is
something you can do and people take advantage of doing on machines
like the ARM.
So here we can omit the barrier of I sync we had before because of that
dependency. So anyway ->>: Get the [inaudible] sorry. Reading from [inaudible].
>> Susmit Sarkar: D and F are still global variables.
>>: And so what do you want to do D plus ->> Susmit Sarkar: So we're going to read from ->>: Dependency on the load on F.
>>: Exactly. So do the architects actually, do they allow -- do they
agree with this?
>> Susmit Sarkar: [inaudible].
>>: Because this does prevent some future optimization.
>> Susmit Sarkar: Absolutely. Absolutely. So they are tying one hand
behind their backs in doing this. But they do guarantee this. And
they guarantee this because programmers want to make reads deep. So
recall in the previous example we had the I sync more expensive than
doing this kind of thing so this is a kind of programming device case.
So anyway, the question we ask is again the same, can we ever read an
old value of data. And to prevent that, you really need a property of
the barrier there, which is called cumulativity, not only does the
barrier keep the same threads stores before it in order of its stores
after it, but keeps any store that it might have read before the
barrier in order with respect to stores after that barrier.
So this is a kind of property that we really need when you are going
from two to three and more threads. Okay. It's called cumulativity.
So just to give you a flavor of the kind of model that we have, here's
a description of one rule. You don't have to read this very carefully.
Just brought this up to show you that we are talking in terms of what
abstract concepts, just in terms of concepts that I've been talking
about here. So a write can propagate to another thread under some
conditions and the conditions are stated fairly abstractly as well as
some kind of Sanity check it's not been propagated less the
cumulativity condition that all barriers have been propagated and
there's some other conditions to make go ahead and seen. It's not too
scary to look at. And if you look at the formal mathematical
definition, it's not too scary either, just basic direct
transliteration of the explanation and prose I had before. This is
propagating a write W to a thread DID prime, and again there's some
sanity conditions, it's not already been propagated there's the
cumulativity condition there of barriers before being after and the
coherence condition. It's not too big.
So to talk about another feature of the example, of the machines, and
this time about the core level speculations, I will introduce this
example. This might look a bit scary, but I'll walk you through it
slowly. Think of it just as message passing.
In fact organization the regular thread it's the exact same thing,
there's a write to B. The barrier, and the write to F. Over here the
reading flag we have the write to there, the load to flag there. And
the load from a location which is going to turn out to be dynamically
always there. So it's just message passing. In between however we're
doing some funny stuff.
So the funny stuff we're doing is we are looking at that flag, we are
ensuring we always need another one, always going off somewhere. But
if we do, we are writing something out to some temporary location,
reading it back, and then doing something that seems to depend on the
value you loaded just there. Okay. So what does all that complicated
rigmarole give you? So here's a chain of reasoning. That load seems
to depend on that load up there. And therefore it cannot be speculated
before that load.
Okay. Now, that load, of course, is going to take its value from that
store and therefore it cannot be done any earlier than that store.
If you read the manuals, meanwhile, they'll promise to you stores and
Power and ARM are not speculated past branches. So you think that
would mean that that store there it's not going to be speculated before
you get back a value before the branch gets resolved and thereby before
that you do that read. Okay. Now, that read, if it's ever to read
one, has to take its value from that right there. And because of that
barrier, by that time that store must have come across to this thread
as well.
So you think that if you depended on all those chains of reasoning that
I quickly sketched out you'd never be able to read an old value of
data.
And, again, then this program out we did and we discovered in fact in
some cases you do. So what's going on here?
Where in that chain of reasoning that I had was a flaw that allowed
this to happen? The flaw was that the manuals are correct in saying
that that store is not going to be speculated as far as other threads
are concerned this is the part that missed out.
As far as the same thread is concerned, it can perfectly well forward
it. So it's going on in the real hardware is that the branch
prediction mechanism is saying I'm not going to take that branch so
we're eventually going to get to that store. And therefore that load
will eventually get its value from that store. Clear the dependency,
do that load, all good. Some time later, along comes both of those
stores, you read that. You see that in fact you did read one and
therefore your speculation was justified. Things are good.
This turned out to be quite a surprise to both the designers and the
programmers in that now speculation is in fact visible to the
programmer.
So we have to in fact explicitly model this kind of speculation in our
model.
>>: Willing to fix the hardware?
>> Susmit Sarkar: No, no, no. No.
>>: Did you ask them? Because you said refuse the designers as well
and ->> Susmit Sarkar: They expected this, yes. One said -- you know, deal
with it basically.
>>: Doesn't look like a big deal to me.
>> Susmit Sarkar: So this is a kind of argument that they make.
>>: This is not -- doesn't look so different to me from store bought
all along, store sample.
>> Susmit Sarkar: Sure. In the kind of programming kind of world
view, maybe you really shouldn't have been depending on all that
complicated chain I talked about, you should have had barriers. And
this is the kind of arguments that hardware designers do and in fact
why they're saying we're not willing to fix the hardware or the model.
So I'm just pointing out that contrary to the expectations, you really
have to think about speculation if you're going to explore all the
possible behaviors of all kinds of programs.
So anyway, so our model has to deal with speculation and it does that
in just the kind of way you'd think of doing if you were in fact
modeling speculation, you have instructions which are in flight and
later they're going to be committed. They're going to be inflight even
past programming all those branches so those arrows are program orders.
>>: I have a question dealing with speculation. So in some sense -there's two levels of having to deal with speculation. One level is
having to deal with the fact that events could be temporally somehow
ordered before they're confirmed, which is I think we have been used to
that for a long time now. The other one is you have to deal with
speculations that, misspeculations, misspeculations are somehow
cerebral. That one seems much worse. So which one are you talking
about here when you say you have differences?
>> Susmit Sarkar: I'm talking about both of those.
>>: You have in fact ->> Susmit Sarkar: For example, like I had before was where the
speculation was justified. But there's also examples where speculation
was unjustified.
>>: Every time you have something about the misspeculated execution
that [inaudible] the final execution.
>> Susmit Sarkar: So it influences the values that you read on the
same thread, yes.
>>: Can you give an example?
>> Susmit Sarkar: Not on these slides.
>>: So if -- so if the hardware does some branch misspeculation, it
turns out to be wrong, the instruction secured in the wrong branch with
some more influence the non-speculated? That was your question?
>>: Yes, that was the question.
>> Susmit Sarkar: You can see it in various kinds of ways. I'll get
to talking about CASs, you can make your CASs fail whether or not they
were supposed to.
>>: CASs can fail even if you don't have branch speculation, right, the
expect in our [inaudible] the analysis you can say even if you have a
looming thread that's ever operating ->> Susmit Sarkar: Mostly it does not, sure.
>>: The spec says ->> Susmit Sarkar: So anyway, we have to deal with speculation in the
model. So in terms of the size of the model then, we have an
explanation about three pages of prose, explanation at the level I
showed you. And in math, it's about 2500 lines. 2500 lines of a new
language that you devise we call it LEM. And what LEM does is it's a
front end language you can extract proof definitions from it. You can
extract all four and top code out of it.
You can also extract executable code, that is, OCaml. So I wrote side
harness above this automatically extracted code, and all that let me
have was a tool that explored the model. Exhaustively checking it or
interactively checking it. Turns out you can compile or combine the
code to JavaScript and thereby run it on your browser and run it on
your phone having personally tested this out.
So here I am running that program. This is running our model just in
JavaScript in the browser. We call this tool PPC MEM. And what you
can do here is write assembly code by yourself if you want. We also
have a library of tests. So I take, for example, a plain old message
parsing example. So here is real assembly code that does message
passing. I've taken off all the barriers so there's just the plain
stores and loads here.
So that store is power assembly speak for stores. There's a store two
locations on this thread, and load from the other locations on that
thread.
And you can ask the question were it's possible for you to read one
first and zero next. Right? So you can take the noninteractive mode.
What that will do is exhaustively check our model and see whether it's
possible. I don't recommend doing this in JavaScript because it's
slow. But you can do it in the command line version of the tool we
have. But you can also do interactive checking.
If you do interactive checking, you can step through what the model
says. So here's the state of the system. There's a state for the
stored subsystem. And states for each of the threads. As you can see
at this point in time nothing much has happened. There's just the
initialization rates that have been seen. And all these instructions
here are waiting around.
Okay? All the transitions that are modeled now [inaudible] are
clickable now. So, for example, I can commit that instruction in
there. And once I do that, all the preconditions for committing that
store is possible. So I can now commit that store thereby making the
right of one to Y invisible.
I also point out that I did this well before I did commits of that
store there. So I really did out of auto commits.
I can, for example, read again out of order on that thread, maybe even
commit that, why not, then perhaps propagate this thread, this write
there to that thread.
Now you see I can only read one where before I used to be able to read
zero. We can do it various other steps. You get the idea.
We have a variety of different tests as I said, library including the
speculation example I had for various others. You can go in and, for
example, write in a barrier and see what happens.
So it's really fun to play with.
consequences thereof.
And it lets you explain our model the
So how do we go about validating the model that we had? First of all,
we can by the process I just said extract executable code and have this
exhaustive checkoff. Take tiny litmus tests and then see all the
different behaviors allowed by the model. You can take the race tests
and run them on real hardware. We did, on various generations of
Power, and we built up histograms of what behavior was seen and not
seen on different hardware. Here's a tiny result of those results.
And I'll show you how to read those results.
First of all, consider it on the behaviors that the model says is
forbidden. In other words, the model guarantees that that will never
be seen. It's important for the model to be sound, that will never
actually see that on real hardware.
So we tested this quite a number of times. So that's about 10 to the
11 times. This is, of course, empirical testing. So maybe you don't
have changed on 10th to the 12th run but we did a reasonable amount of
effort.
Next there are tests where the model says some behavior is allowed.
Most of the times what we see is that it really does occur on real
hardware. Sometimes really often. Sometimes not quite so often.
Sometimes for some varieties of tests you see them on some generations
of the hardware and not on the others. This points out a key fact that
the specifications that you're building up, these models, have to be
loose models because they have to cover all the different variance of
the architecture you might have.
In fact, future implementations as well. And this brings me to the
last kind of models, last kind of tests where in fact the model says
they're allowed but we've never seen them on real hardware.
We took all of these kinds of tests and discussed them quite carefully
with the designers. And in all of these cases what this is, yes,
particular features of the micro architecture they have so far
implemented ensures that it is not seen on current hardware, but they
want to leave open the possibility that some future generations of the
hardware might do these kinds of things.
So I'll briefly move on now to talking about synchronizing operations,
because they are fun and they're used by several real world
programmers.
>>: I have a question about this model. So is the model understandable
to smart programmers? That is an evaluation ->> Susmit Sarkar: I'd like to think so. So basically at the scale
that I showed you in Excel, writes propagating threads and to help them
understand you have this tool that lets them explore consequences of
programs they have.
>>: Do you envision that you could write like a verification tools,
like you can say if I wrote a piece of code and I believe that this
code had some specification there, it's implementing a link node, for
instance? Once it's required to prove, for a programmer to convince
himself that it's right.
>> Susmit Sarkar: That's a good question. We have in fact looked at
rather simple programs basically link lists and tried to reason
informally on top of this model and tested out informal reasoning by
running it through this kind of test that we have. So this is the kind
of things we've been doing. We also want to package this up, and this
is sort of future work, and package this up into reasoning principles
which let you reason at a higher level.
>>: So this model is operational in the sense that do you have an
understanding on what the matching axiomatic model is, or you just know
some ->> Susmit Sarkar: Haven't invested much effort into that kind of
thing, because, well, any axiomatic thing that would precisely capture
the operational model would be equivalent -- would have more or less
much of the operation and understanding with them. But on a different
kind of front, the C++ model, it is an axiomatic model and can be
sound, proved. That's what we did. We can have axiomatic
approximations, if you're looking for precise axiomatic ->>: C++ model is reasonable [inaudible].
>> Susmit Sarkar: Okay. So briefly moving on to talk about
synchronization operations, what do I mean by that, things that might
have looked on as compare [inaudible] atomic addition, atomic
subtraction, that kind of thing.
So if we're programming on a risk like architecture what do you
typically get are pairs of constructions? Sometimes called load
resolve and stored conditional, load link and stored condition, and
what these are, they let you implement all these kinds of
synchronization operations.
So here's an implementation of the atomic addition primitive. So out
here it's a, there's a load resolve which on power is called a locks or
load link. And out here is a stored condition, which is called a
storks. So informally what's going on? What's going on is the locks
is a load. And you can do various other stuff. For example, atomic
addition maybe you want to add stuff. And then you do a stored
conditional. Among other things, the stored condition is a store.
It's a particular kind of store. It's a store that can succeed and
thereby do the store. But can fail. And thereby not do the store.
The machine says you whether it succeeded or not in a flag and the
typical way to use this is to look back and try again if you failed.
So what's supposed to happen if that sequence gives you an atomic
addition? What does that mean? Supposed to happen informally is that
the machine says the storks can succeed only if no other thread wrote
to that location in question B since the last locks.
Okay. So if you're thinking now relaxed memory behavior, at this point
you should be jumping at my throats, what is the sense you speak of,
what does that mean? Maybe sense in machine time? Turns out that's
the wrong answer. That's made unnecessary not sufficient. And to
understand what's going on you have to think of really what's the micro
architecture doing.
So informally speaking the micro architecture, what it's doing is if
that thread in question did not lose ownership of the cache line
between the locks and the stocks, then it knows no other thread could
have written in in between and therefore the load and the store were
atomic.
So that's a fine micro architecture explanation. But, of course, the
program already want to reason about do I own the cache have I lost
ownership of the cache. You want to think in a more abstract level.
And to think in a more abstract level you have to think about what is
it that that cache protocol is by you.
What that cache protocol is buying you in terms of transferring
ownership from one thread to the other is building up a chain of
ownership for the location in play.
It's building up in other words an order relating stores to that
location, we call this coherence. That you can agree on every
different thread. Once you have that abstract notion in your hands now
it's easier to state what atomic means. What it means is that a stocks
is allowed to succeed only if it can become coherence right next to the
write it read from. So those two are atomic. Furthermore, of course,
this coherence section abstract order so it builds up over time. It
must be the case that you're never allowed to violate that atomic of
those two writes staying together ever again.
So that's the key concept we need and now we can give it a name. We
call this write reaching coherence point we do this in our model saying
a store reads dynamic coherence point operationally when this abstract
coherence order we built up becomes linear before this write and
furthermore it's never going to be, going to become different again.
>>: Is this the level each model is -- you have these relations that
they add to? And you never remove any entries, I guess?
>> Susmit Sarkar: All your rules have preconditions that they're never
doing bad edges. Exactly so. Okay. So, of course, you also have to
deal with what the interactions are with the rest of the system, in
other words what are the interactions with the stores normal kinds of
stores and loads and barriers you had before.
And it's really easy to get this wrong. There was a rather recent
kernel bug where they got confused about what the ordering properties
where with respect to normal loads and stores.
But once you have this kind of formal model then again prove stuff
about it. One kind of simple result that you can get is you place all
your axises, in other words all your stores and loads by atomic kinds
of accesses and then you regain sequentially consistent behavior. So
this is an alternative to locking all your stores and loads or maybe
putting barriers between every pairs of stores and loads.
All right. To wrap up then I'll talk about the proof of correctness of
implementation for C11. I don't have time here to talk about the whole
of the C11 model. It's rather complex. I just handle it at a rather
high level. This is the program we saw before and have seen multiple
times in this talk. Release acquired version of message passing.
So what C11 is what's called an axiomatic model. What that means is
that it reasons not about that program making steps, but after that
program is executed and created all its stores and loads, whether that
execution was allowed or not. So it takes that program and it converts
all of those stores and loads that you get into events and then it
defines various situations on those stores and loads, events. And
furthermore various axioms about those relations. It has various kinds
of relations. There's a relation which is called sequence before which
is sort of like program order.
And there's various other kinds of calculations. The key one here is
something called happens before which is defined in a very particular
way in C11.
And right here on this example, what it does is because that was a
release and acquire, that happens before there, they're and their.
There's now consistency conditions on this execution. And for example
a kind of consistency condition that's there is that read is allowed to
read from that store, but it's not because of that happens before
[inaudible] allowed to read from something further back, that
initialization rate.
Also I'll point out that the semantics are only defined for race three
programs, race three are defined very particularly in happens before
correlation.
So there they are. There's a bunch of axioms. Fairly complex axioms.
But in fact there is some intuition behind those axioms and this we
discover by doing this kind of proof. So the base case of that happens
before relation that I had was such that synchronization between a
release kind of store and acquired kind of load. And we see this
clearly reflect itself in things that you get on the hardware,
properties of the barriers that we had.
Next, this kind of release acquires synchronization. It has to be
transitive. If you release acquired here and on another location you
do release on acquired thread all these would chain up together and
this corresponds in a fairly direct way again to properties of the
hardware, the cumulative property that we talked about.
There's various other kinds of things. There's particular features of
C11 that were carefully designed to take into account dependencies.
And this comes up in the power many are doing these kind of dependent
reasoning that I was doing. Special rules for compare and SOP and
these compare in a fairly reasonable way reasoning when is it possible
for writes or stored conditionals to reach their coherence point.
So then how do we prove this theorem? The broad view is recall that
we're talking about all possible C programs. In fact, C11 does not
give any semantics at all to race C programs. So we need only consider
data race free programs because other programs they're allowed to do
whatever it is they want.
So for any program, then, we'll look at any compiler. Not just any
compiler, any same compiler like I said. What that means? What that
means is it preserves memory accesses. Does not optimize them away or
reorder them. Furthermore, it uses the mapping table that we had.
So if we take any such compiler at all, then we look at the target of
that compiler. We have a model for the power and therefore we can find
all the behavior that that power program has. This model, recall, is
an operational kind of model. So the behavior you have is the set of
races the model is allowed to have. You look at each of those traces
and you try to build up executions as allowed by C11.
You do this by building up the key relations that you need that happens
before and so on out of the base. In each case you'll find that the
axioms that C11 depend on, they depend on features of the machine.
you have to look closely at what the machine is doing.
So
There's also subtlety in here of course on the machine level there's no
concept of race, programs just do something. So if it looked like -looks like there is in fact a race C kind of execution in terms of C11,
you have to actually construct that race in C11 and thereby get a
contradiction with that data race C precondition.
So this can be done. And we did this, it's a proof, and what did we
learn in doing this proof? We learned various kinds of things. We
learned, for example, that release requires reasoning is, well, used by
a lot of programmers, but also corresponds directly to what hardware
does. So it in fact transfers directly between software and hardware
levels. If you can't think of your programs just in terms of release
acquire maybe it's a good thing, too.
We also learned facts about the hardware. We learned facts if a
certain hardware optimization had to be done and, by the way, this
optimization is seemingly quite natural to do, that in fact C11 would
be unimplementable, unimplementable that is without putting in barriers
absolutely everywhere and thereby defeating point of having different
kinds of stores and loads.
Fortunately current hardware does not do this. And now we have a
strong argument to put forwards to designers about why they should
never do it.
We also ->>: Definition of [inaudible] implemented if we need fences for
[inaudible] is that the definition of [inaudible].
>> Susmit Sarkar: Unimplementable efficient, yes.
Yeah, and you also learned various ways, as I said, of regaining
sequential consistent behavior.
>>: Can we go -- so here I can think of this as the last phase of the
compiler, right, in the sense the compiler takes the program and
compiles it to let's say machine independent IR. Then it's going to
have a back end phase that's going to take this IR and translate it ->> Susmit Sarkar: Sure.
>>: So I could think of the program -- so the reason I asked is can we
actually have two phases compiler where you allow all DR of zero
optimizations [inaudible] get to the IR? And you use this translation.
>> Susmit Sarkar: Yes, absolutely.
>>: Then will the theorem hold.
Can you prove that?
>> Susmit Sarkar: Sure.
stay within RF up there.
You can do any kind of optimizations which
>>: What the problem is, if your voice is you stay in
allowed to do some things that otherwise you would be
drop down earlier. For example, in the travel model,
there's no problem with data raises in the power. If
introduce a data race in the power program because it
run faster, you're okay doing that.
>> Susmit Sarkar: No, no --
the RF you're not
able to do if you
for example,
you want to
makes the program
>>: That's not what I'm asking. I agree with you. But I'm asking real
compiler's actually -- are they the same but are they definitely.
>> Susmit Sarkar: Optimizing.
>>: Optimizing.
>> Susmit Sarkar:
Absolutely.
>>: They're doing reorder memory operations do reorder memory
operations.
>> Susmit Sarkar: As long as they stay within this. Another way to
ask the question you're trying to design an intermediate representation
that has the C11 model.
>>: Exactly.
>> Susmit Sarkar:
I think that's a perfectly --
>>: Cast all transformations so the front end -- so the [inaudible] is
doing this source-to-source translation, from C to C++.
>>: Are you reasonable, flexible, do you have to stay within -probably if you have to ->>: My question was more in can I take this theorem and include the
GCC's correct.
>>: GCC does not stay within C, C++.
>> Susmit Sarkar: So ->>: My version of GCC that is DRF compliant.
>> Susmit Sarkar: Yes. So then you are sort of -- in fact, they're
trying to build up to that kind of version.
>>: Is it stronger than the first line, the same, in the sense you
could come up with an automatic compiler, put restrictions on the
compiler.
>> Susmit Sarkar: Absolutely. If you want to do that kind of thing.
But the first instance we're just going from source to target. You're
allowed do anything, as Sebastian is saying, optimizations that stay
within that fragment. Sort of future work is what if you take
optimizations that go outside that fragment but still in some way
resolve source level properties.
>>: One thing the compiler is definitely going to do, if you do the
czar trick, read it and add something, it's going to read it as zero,
definitely. You have to make sure that that's not ->> Susmit Sarkar: I'm glad you brought that up. So it's perfectly
safe to do that for non-atomic or what you used to think of as normal
stores and loads. But it's not safe in C11 to do that for atom mix or
volatiles, say.
>>: Okay. So for regular stores and loads the theorem should be strong
enough to show that [inaudible] the compiler ->> Susmit Sarkar: You have a bit of work not very hard work, but sure.
>>: Okay. Any help from the DRF property?
>> Susmit Sarkar: Yes, essentially. You are doing sort of source
level optimizations that still stay within DRF. Because you're
optimizing away nonatomic stores and loads, if there was no race to
begin with, you're not introducing new races.
Right?
>>: I think in some sense [inaudible] question can you get the most
important optimizations done, last thing and then [inaudible].
>> Susmit Sarkar: All right. So here we are. Here we have been
reasoning about mainstream concurrent programs. At the very lowest
level doing this on real hardware like the Power and ARM and trying to
show how high level language primitives can be compiled. So we have a
theorem which is the correct compilation result and this clearly has as
I said relevance to the real world compilers and also builds confidence
in these models.
What about the future? Well, one thing about this proof is that it
really boils down what is it that we are depending on from the
hardware? And this lets us design new kinds of hardware that maybe
relaxes some of these. Also, of course, this is a path to reasoning
about low level programs. Building up reasoning principles from
assembly level up to high level language maybe C and C++ is not the
understanding high enough level language. But sure beats reasoning
about in terms of assembly.
Thank you for your attention, and there's more details there, in
particular that's the URL for the tool which I really encourage you to
play with. Thank you. [applause].
I don't know if you have ->> Madan Musuvathi: Any questions?
>> Susmit Sarkar: Questions after.
>>: So it looks like I have two options. Say I want to write a low
level log tree high optimized piece of code, should I try to use the
C++ memo to do that or should I use your model to do that?
>> Susmit Sarkar: So my personal feeling is you should use a C11 model
because then in fact you can put it to various different kinds of
hardware.
>>: Okay.
>> Susmit Sarkar: So the proof that I just did was to let you do that
reasoning at C11 while still getting all the properties that power
gives you. Of course, if you care just about power, for example, if
you care just about ARM, then you can use directly my models to reason
about it at that level as well.
>> Madan Musuvathi: Thank you.
>> Susmit Sarkar: Thanks.
Download