1

advertisement
1
>> Jim Larus: It's my pleasure today to introduce someone who, looking around
the room, really doesn't need an introduction. Hans Boehm has one of the
leaders in the programming languages and compilers in memory system community
for as long as I can remember. And knows many of you quite well. We're
fortunate to have him today, giving a talk on some of his more recent work with
memory models and C++.
>> Hans Boehm: Thank you, Jim. Before I go on here, I should say that what
I'm describing here is mostly joint work with lots of other people, some of
whose names are up here. In particular, I'm going to be advocating an approach
to memory models which was really pioneered by Sarita Adve a long time ago and
was in some less formal, less precise form actually incorporated into other
programming languages even before that.
I'm going to be talking about threads and shared variables. So occasionally,
the question comes up why that's actually the right subject to talk about. It
does seem to be controversial these days, whether we should be communicating -whether we should be writing parallel programs by actually sharing memory or
not occasionally.
So if you type the phrase threads are evil, if you look for the phrase threads
are evil in Google, for example, last time I tried you, get about 32,000 hits.
And that number is increasing rapidly every time I try it. It's more.
On the other hand, threads are clearly a convenient way initially to express
processing of multiple event streams or to keep gooeys active while you're
doing some processing in the background. And they're also at the moment, aside
from the HPC community, the dominant way to really take advantage of multiple
cores in a single machine. So whether you like them or not, they seem to be
pervasive.
So what I want to talk about is some of the basic C++11 and C11 this is the
approach to threads and shared variables. That was incorporated into new
standards for both C and C++. Those rely heavily on the notion of atomic
objects that's somewhat different from what programming languages have
traditionally had. So I'll spend a little bit more time talking about those.
And then I'll spend actually a lot of the talk talking about data races in
particular, which play a core role in the memory model and trying to convince
you that those are just a bad idea and you have never write code that has data
2
races in them.
I'll also talk a little bit about the implementation consequences of the C++
and C11 memory model.
So the state of the world before we started, and this is now quite a while ago,
this is around sort of 2004, 2005, was that for at least people programming in
C++ and C, the story was that they were basically programming in a single
threaded programming language. The programming language said essentially
nothing about threads, though lots of people were actually using threads. They
were using threads in the form of an add-on thread library, either Windows
threads or around here, probably mostly Windows threads. Also, were sometimes
Posix threads.
Unfortunately, if you looked at those specifications, they were either sort of
intentionally or unintentionally unclear about what even some very basic
multithreaded programs meant. So if we look at this is about as basic as a
multithreaded program as you can think of, you have one thread that assigned
one to X and one thread that assigns one to Y, you would really like to know
that by the time you're done, both X and Y have a value of 1.
If you look at the posix specification, that was actually intentionally
unclear. It was left intentionally unclear, motivated by some hardware and
compiler implementation considerations. Basically, what people were afraid of
is that an assignment to a variable here, which was small, might, in fact, read
and rewrite, overwrite adjacent memory which, in fact, in some implementations
it did, so it could, in fact, overwrite an adjacent slot field. And as far as
based on my reading of posix, it was also allowed to read and overwrite an
adjacent unrelated variable. So even if these were not slot fields. So there
was actually very little you could say about what this program meant.
And there were much more interesting examples which actually caused serious
problems occasionally. Not that frequently. On the other hand, when they did
arise, it was really difficult to figure out what was going on. Probably the
worst consequence in my opinion of this situation was that it's really very
difficult to teach people about multithreaded programming, which is difficult
enough without being able to give them a consistent story of how things
actually work.
So if this doesn't work all the time, then it's very difficult to teach people
3
how this does work at all. So what we start the with for the C and C++ memory
model, which is sort of what everybody starts with, and what we'll start with
here, is the model that everybody thinks they want from a programming language,
which is sequential consistency.
So usually, when people think about threads, they think about threads executing
as though the actions of the individual threads were just interleaved. So if
we have thread 1 doing this and thread 2 doing that, there they might be
executed as though these statements were short of shuffled up -- well,
interleaved in this way so that the actions performed by each thread are
performed in order. And whenever you look at a shared variable, and here I'll
use X and Z and Y to denote shared variables. And the thing starting with R,
which you can think of as registers denoting local variables.
So that when I look at a variable here, the value that I see is the last
variable -- the last -- the value that I see is the last value assigned to it
in this interleaved execution in that interleaving. So that's what everybody
thinks of as sequential consistency. That's what people usually initially at
least believe threads should behave like.
In addition to just sort of interleaved execution of threads, we sometimes need
to control that, as I think everybody in this audience knows, so in reality,
there are many cases in which we don't want to allow arbitrary interleaving of
thread actions. We want to control those. For example, if we want to
implement a variable which really is going to be implemented under the covers
as loading the variable and storing back the result, if we want to do that
concurrently in two threads, typically we want to prevent the execution of
those two threats from being interleaved in the way that I've shown here, where
I read X in one thread, read X in the other thread, increment the temporary
value in both and write back the same value in both threads, because that ends
up resulting in X only getting incremented by one instead of by two, as I had
intended.
So the way we do that is we introduce some sort of mutex or lock mechanism that
prevents this kind of interleaving that forces, that basically prevents one of
those threads from executing while the other one is busy accessing X.
So we make sure that we have some notion of locks so that only one thread can
hold a lock at a time. We can find that sort of very precisely if we want to.
But in this interleaving-based semantics, by saying that we only allow
4
interleavings in which one of these lock actions here can be scheduled only
while no other thread is holding that lock on mutex.
So with that selection then, rather than allowing all interleavings between
these two threads, we only allow these two possible interleavings, because we
can't schedule the second lock operation until the first unlock completes. So
the only acceptable schedules are those two.
And that's all easily -- easy to make precise. However, it has two problems.
Actually, before I get there, sorry, I should say how this is actually written
in C++11. So in C++11, we have now have a mutex concept. We can lock it and
we can unlock it when we're done or, alternatively, since this is C++, you
would like a more C++-like syntax, where you declare variable whose constructer
acquires the mutex here, whose constructer executes M dot lock, and whose
destructor executes M dot unlock so that if you leave this block by an
exception, the lock still gets released and the right things happen.
For those of you who are used to Java synchronized blocks, this sort of gives
you the same effect essentially. Or something like that.
So that gives us a very nice model based on sequential consistency, but it
actually has a couple of problems. The one that everybody, I think, realizes
is that there are implementation issues with sequential consistency. It gets
in the way of some optimizations that we would really like to be able to
perform.
So here, this is a sort of standard example that people use in this context,
which is based on Dekker's mutual exclusion algorithm. The idea here is that I
assign 1 to X and then I assign 1 to Y and then in each thread, I read the
other variable next. If I view this execution as being purely interleaved, it
has to be the case that either the assignment to X or the assignment to Y goes
first. One of those has to execute first, meaning the other thread, whichever
thread gets scheduled second, has to see a value of 1. So if this one gets
scheduled first, that read of Y into R1 has to see a value of 1 for Y. Rather
than the initial value of zero, which I will assume I have here everywhere.
So basically, one of these loads has to see the store in the other thread. On
the other hand, what happens in real life, there are a number of optimizations
that counteract that, and that generally prevent this from being true. So
what, in particular, both the compiler and the compiler in the hardware is
5
likely to want to break this. So what compilers will typically observe is that
if this appears in some bigger context, if I'm loading Y and I use Y for the
down, I can give the hardware more time for the load to complete if I schedule
this load of Y earlier so the compiler is likely to want to perform this load
up above the starter X.
And since the compiler looks like, especially for a compiler that sort of was
initially a sequential compiler, it looks like these two statements don't
interfere. They're independent. They don't touch the same variables. So it
looks like these can be perfectly safely interchanged. So as a result, that
compiler transformation makes sense.
Even if your compiler doesn't interchange these, it turns out any sort of
reasonable modern hardware that you're likely to run this on, when you assign 1
to X, is really not going to put 1 into the memory or even the cache at that
point. It's going to put 1 into a store buffer someplace, scheduling it to
eventually make it to the cache. It's not going to wait on the -- it's not
going to wait on that store to reach the cache, because that would slow down
the hardware.
So as a result of that, when you assign 1 to X here, that assignment isn't
visible to the other thread for a while, because the store buffer is only
visible locally. So that has the same effect that it looks like these two
statements are interchanged by the hardware, which again makes it possible that
we end up in this scenario where I read the initial value, the preassignment
value of zero for both X and Y because, in fact, these two assignments can, in
the simple case, these two can both occur, really occur before the stores.
That's actually not the only reason, as it turns out, that I want to do this.
It turns out, as a programming model, I'll argue sequential consistency isn't
all it's cracked up to be either. So in particular, the really nasty property
that sequential consistency has is it's dependent on the access granularity to
memory.
So if I have a hypothetical machine which can only access memory a byte at a
time for this example, and I store 300 to X in one thread and 100 to X in the
other thread, each of those assignments to X, in fact, are going to get
decomposed into assignments, the high byte and the low byte, and they can get
interleaved in this fashion, and I can end up with a final value of X of 356 in
this case, which is not very intuitive or expected outcome and not something
6
that people really want to program to.
Maybe even more important, if I sort of move this up a level, I think really,
when anybody writes multithreaded code, they already assume that we, in fact,
have more than sequential consistency. We don't really reason about
programming, program interleaving at the level of individual assignments or
store instructions.
So, for example, if I have two arrays which are distinct, array 1 and array 2,
and I call the sort function on array 1 and array 2 in two different threads,
I'm not going to explicitly reason about the different interleavings of those
two sort operations, because I know somehow that those are independent, they
don't interfere. It all doesn't matter.
So what I really want to argue is that these things don't interfere and,
therefore, I don't have to look inside them. I don't have to worry about what
actually happens. And the actual C++ and C memory model relies, sort of
leverages that and this is the so-called data race free model which I'll
describe here next.
So the real memory model for both C11 and C++11 relies heavily on this
definition of a data race, which is more a standard, but I'll reproduce it here
anyway. So we say that two memory accesses conflict if, basically, they're
order mappers. Which basically means that they have to access the same memory
location and they can't both be a read. That's a fairly standard definition.
So in the preceding example, the assignment X equals 1 the store to X and the
load from X conflicted. We then say that two memory accesses participate in a
data race if they both conflict and they can occur simultaneously, meaning they
perform by different threads and there's nothing to prevent them from -there's nothing to enforce an order between them.
A program is data race free, and really we mean on a particular input when we
say that, if no sequentially consistent execution results in a data race. So
the notion of sequential consistency is still significant. So it's used in the
definition of the data race here for example.
>>: With respect to your previous point, a hypothetical byte machine, this is
at the programming language level, not at the machine level? Sequential
consistency?
7
>> Hans Boehm: Right. The problem is that it actually matters what level year
talking about. So enforcing sequential consistency at the programming language
level is actually a hard thing to define, because usually the programming
language doesn't actually define how many accesses are involved in a particular
assignment.
At the machine level, it makes more sense, but you don't want to program at the
machine level.
>>: So how is that -- like you have access the same scalar object so that's
where that would come in, right?
>> Hans Boehm: That's really where that issue comes in, right. So the
definition of -- the way this is actually presented in this standard is they
have to access the same memory location and the memory location is defined as a
scalar object, and then there's this footnote here, for C and C++, or a
contiguous sequence of bit fields counts as a single memory location. So
updating any bit field in a sequence is viewed as potentially updating,
affecting all of them. I'll have some more examples dealing with this later.
>>: But, I mean, this deals with the problem you had before in the sense that
X gets 30 on a byte machine, you would consider the whole sequence as one
scalar object?
>> Hans Boehm: In this case, in that case, yeah, the whole thing is one scalar
object, and the whole thing is a memory location. Yeah, but so far we
haven't -- we've only -- here, we've only defined a data race. So let me go on
for a little bit.
So then the guarantee, what we actually get out of the C and C++ memory models
is we get sequential consistency again, but we get it only for data race
programs. For anything that contains a data race, all bets are off. Normally,
what's referred to as catch-fire semantics, which is the same sort of semantics
we assign to an out of bound subscript assignment or to an out of bounds array
assignment or something like that.
Anything can happen, and we'll see examples of that. So it's the program is
responsible to prevent Dat races and you get to do that by either using locks
or -- and I'll talk about this in a few slides, using atomic operations, which
8
are new to C++11 and C11. And I'm cheating a little bit here and I'll talk a
little bit more about that later. In C++11, there are actually ways to relax
the sequential consistency guarantee for data race free programs in order to
get more performance. Whether or not that makes sense sort of depends on the
context.
So if we look back at Dekker's example from before, what happens here is that
if X and Y are just declared as, say, integer variables, then this program just
has a data race, because X equals 1 conflicts with 2 equals X, and those two
cannot be executed simultaneously. So basically, all bets are off. This
program in C++11 has undefined semantics. Anything can happen.
One of the nice sort of core advantage of programming in this data race free
model is that we no longer have to really worry about instruction interleaving.
I already hinted at that before. So by not having to worry about in selection
interleaving, we also allow a whole bunch of optimizations that we're used to
having performed, which now become legal.
So to illustrate this problem, to illustrate this, basically, the property that
we're promising here is that any region of code which involves no
synchronization acts as though it's atomic. We don't have to worry about what
happens in the middle of any code region that has no synchronization, so long
as we know the program is data race free. And sort of the left hand waving
argument here is that consider some block of code that has no synchronization,
say assigns 1 to A and 1 to B. Let's assume that I could somehow notice -- I
could somehow observe a state in the middle of this that would demonstrate that
this isn't executed atomically. I can only do that by having an observer
thread that looks at A and B and determines it's in the middle. This observer
thread unavoidably has to have a data race with the thread assigns 1 to A and 1
to B. In fact, on both A and B.
So things look atomic because any program that could tell the difference, that
could prove it's really not atomic, is outlawed. So that's sort of the quick
introduction to the memory model. Let me say something about these atomic
objects now which I've hinted at before. So the basic problem we had with this
model before, so in a less well defined state, this is actually very similar to
what was required by Pthreads before and even [indiscernible] 83 way back when.
So we're not sort of changing the approach here fundamentally in some sense,
but if we look back at something like the Pthreads model, basically what went
9
wrong there is that the model outlawed data races. On the other hand, as I
think many people in this audience know, the problem was that in order to avoid
data races, you needed to use locks, basically, or mutexes. Those were viewed
as heavyweight and expensive. So what people did is they cheated. They wrote
programs that had data races in them anyway. So if you look at real Pthreads
programs, many people have demonstrated this in various research papers. Real
Pthreads programs typically contain data races.
So what we really wanted to do is not give people that excuse for writing
programs with data races and really leave, make it possible to write correct
code.
So the solution is that we introduce this notion of atomic objects. These are
objects that sort of superficially behave like ordinary data. On the other
hand, they do allow concurrent access. So you can access these things
concurrently without introducing data races. The actual data race definition
excludes these operations. It only applies to ordinary memory operations, not
atomic operations.
By default, unless you tell us otherwise, these preserve the simple
sequentially consistent behavior. They also act indivisibly, so they also
address the granularity issue. But they do give you sequentially consistent
behavior.
These actually, turns out, have a huge impact on the memory model specification
and if you actually try to read this memory model specification in the C++
standard, which I don't think I'd recommend, you'll find that it's mostly
dominated by describing the behavior of these atomics. Everything else sort of
falls out as an easy special case.
So just by way of illustration, these are easy to use, at least in the simple
case. If I want to increment X, and I don't want to use a mutex in order to
protect X, I can do that by declaring X to be atomic. Instead of saying int X,
I say atomic of int X. This is C++ syntax. C syntax is a little different.
And I'm actually, by doing this here, turns out I get an overloaded version of
the increment operator that actually does an atomic increment, I guess what
around here would be called an inter-locked increment implicitly. So that
works.
10
If we go back to Dekker's example and we actually wanted to work correctly,
where by work correctly I mean that if X and Y are initially zero, then R1 and
R2 can't both see the value of zero, I can make it work correctly by simply
declaring X and Y to be atomic in this world.
The requirement here is the catch here is, of course, the bottom line. The
compiler and hardware have to do whatever it takes in order to actually make
that work correctly and make sure that R1 equals R2 equals zero can't happen.
I should say that this is not -- this actually fits in fairly well with a whole
bunch of other different languages that have vaguely similar constructs, which
I should mention here, because some of them are more similar than others. So
as I said, C++11 has atomic, and it also has types like atomic underscore int
for some special cases, basically for C compatibility.
C has these atomic underscore types which it was originally meant to have and
late in the standardization process, the C committee introduced this version,
which is very similar to that. But has different syntax, just to keep you on
your toes. Java has volatiles or Java dot [indiscernible] dot atomic which
actually has very similar semantics to this.
On the other hand, there are various other languages that have constructs that
are profoundly different, but related in the sense that they have a similar
goal. So C-Sharp volatiles are different in that they don't give you this, at
least last I heard, they don't give you the sequential consistency guarantee
quite. They give you a weaker guarantee. OpenMP3.1 atomics are a lot weaker
still in that they give you a very weak ordering. They also give you
operations indivisible and exempt from data races. On the other hand, with
very weak ordering guarantees.
And officially, completely unrelated but unfortunately with confusing naming
are C and C++ volatiles which really do something else. But unfortunately, in
the meantime, sort of often used as a hack to get similar semantics to the
atomics here.
>>: So if I'm remembering correctly, Java volatiles have semantics which
affect memory accesses for non-volatile stores location, like a store can be
used as a signaling mechanism to say these happen. Is that the case?
>> Hans Boehm:
That's the case here.
That's sort of implicit in the statement
11
that they preserve sequentially consistent semantics so long as there are no
data races on non-atomic variables.
>>:
Okay.
>> Hans Boehm: Without that guarantee, you don't have that property, because
anything that uses an atomic variable for signaling might use the atomic
variables to present -- to prevent a race on something else but would not have
sequentially consistent semantics.
>>: So what you showed in the previous slide is that the plus plus operator is
a special case, or ->> Hans Boehm: That's actually a place where C++ deviates from some of these
others, and that's sort of an artifact of the language, because the way atomics
are defined, atomics are a template class so they have a bunch of overloaded
operators anyway. And the natural thing to do in that context is to just
define plus plus to be an atomic increment.
>>:
So there are a few predefined --
>> Hans Boehm: There are some predefined operations that, in fact, are atomic.
But the only ones -- there's also a compare and swap, called compare exchange.
But ->>: Those that are not ready, you have to do it yourself to make sure it works
the way you want.
>> Hans Boehm: Right.
exchange more or less.
All of these operators can be implemented with compare
You know more about the more or less than most of us.
Okay. So I'll say a little bit about sort of a loophole that was introduced
into the language here in order to address performance concerns. As I said,
and we sort of hinted at this, the atomic operations have fairly strong
properties, and the compiler has to do a fair amount of work in order to
preserve those properties. That can be fairly expensive.
It's actually getting less expensive on modern hardware than it used to be, but
it's still fairly expensive.
12
So in order to prevent that from getting in the way in those cases where people
would otherwise be tempted to write data races instead, there's actually
another facility that's part of the atomics library in C++ which is that
programmers are allowed to explicitly specify weaker memory ordering than
sequential consistency, even in the absence of data races.
This, as it turns out, greatly complicates the specification. It also greatly
complicates the programmer's job. The programmer actually has to understand a
very complicated specification in order to make this correctly. There's
ongoing work here as to how to isolate that complexity in a library that's a
non-trivial exercise.
So the bad news is that using this facility actually is much more complicated
and much more bug-prone than just sticking to the default sequentially
consistent semantics, probably more so than most programmers appreciate, which
makes this more dangerous. On the other hand, it sometimes is significantly
faster on current hardware, unfortunately.
>>: So the support for this atomic object has to be implemented in the
hardware somehow? When you're enforcing the hardware companies ->> Hans Boehm: Well, what actually happens these days is that atomics are
typically implemented with ordinarily load and store instructions for small
objects. They're implemented with ordinarily load and store instructions and
memory fences. But that's hardware specific. So on some hardware you want to
use, there are other primitives other than memory fences that enforce ordering.
So Itanium and OMV8 have other primitives to do that. But the default
mechanism these days are memory fences.
>>: Is there any control to compiler to get rid of the breach of consistency
so you have a gold standard of correctness?
>> Hans Boehm: You mean compiler flag to basically ignore all of these things?
That's not something the standard can really address. I mean, it's something
that a compiler implementer could reasonably do. The standard doesn't address
compiler flags.
>>: Instead of [indiscernible] for the hybrid to make strong atomics
[indiscernible] so that we don't have to [inaudible].
13
>> Hans Boehm: That seems to be going on. So I think it's public that OMV8
has hardware primitives that model these very closely. OMV8 doesn't exist yet.
But clearly, we're moving in the right direction.
>>: Is there any understanding of suppose everything for me is a
[indiscernible] and I am just program like that. How slow [inaudible].
>> Hans Boehm: I'll actually say a little bit about that at the end if I have
time. It will probably, on existing hardware, be probably quite slow. On X-86
hardware, turns out store instructions end up being quite a bit slower. Load
instructions unaffected, essentially.
So basically, for these low-level atomics, we still have the rule that pairs of
atomic operations can never form a data race. That's unchanged. But these
atomic operations can specify explicit memory ordering. And the way that works
is normally, when I load an atomic value, probably typically I would just write
it this way. I would just mention the atomic in an expression. It turns out
under the covers, this is an implicit conversion from the atomic type to the
underlying type.
That's really equivalent to writing it this way. But if I write it this way, I
can give it an explicit argument that specifies the ordering. So if I write
memory order sequentially consistent that's a very verbose way of saying
exactly that.
So the ordering types we have here is that there's actually one more that I
haven't mentioned, but the main ones here is we have what's known as acquire
release ordering. I won't say very much about this, but if you specify memory
order release on a store operation, that basically means that when you perform
a load operation that sees the value that was stored by the store operation and
the load operation uses memory order acquire, then you guarantee that after
that load operation, you can see all memory operations that were performed
before the release operation.
So this is the ordering that you want if you use a flag to communicate from one
thread to the other. It's of the minimum ordering you need there.
Memory order sequentially consistent enforce additional ordering so things like
Dekker's example work. Memory order relax basically doesn't ensure any
ordering almost. It actually does ensure what's called cache coherence, which
14
is that operations on a single variable appear to occur in a single -- in a
single total order.
So if we wanted to increment a counter in a way that we don't actually need to
read the value of the counter until the end of the program, until all the
threads are finished, then there's no reason to enforce ordering when we
increment the counter here, so we can -- so we can implement the increment
operation by X dot fetch add and then specify memory order relaxed as the
ordering type. And on some hardware, that will make a difference. On X-86,
that won't make any difference.
But when you do that, you have to be really careful, because their usage models
for which this doesn't work. So for example, if you're using this to increment
a pointer in a buffer that another thread is leading that's the long way to do
it. It will not work.
So let me try to clarify some of this by going through a bunch of data race
examples. And hopefully give you some insight as to how these things actually
work. So data races are crucial here if we want to really show that a C++ or C
program is correct. So rather than, in the old model, well, basically, in
order to show that anything is correct in this model, we first have to
demonstrate that there are no data races.
Once there are no data races. Well, in proving there are no data races, we get
to assume sequentially consistency, because data races are defined with respect
to sequential consistency. Once there are no data races, we can prove
correctness sort of more traditionally of the program, assuming both sequential
consistency, and since there are no data races, we also get to assume that
synchronization free code regions atomically indivisible, making that proof
easier.
It turns out, as well see later, there's actually sort of a third proof
obligation here which we'll run into in one of the examples here. But so as a
simple example, if we use the simple flag case of communication case here, I
say set some variable to 42. Then I tell another thread that I'm done
initializing X. The other thread waits for the done flag, and then there's X.
If I declare here just to be a Boolean flag, this is incorrect. There's data
race on done. As many people have found out the hard way, this is one of the
few uses of data races that fails fairly repeatedly in practice.
15
What typically happens is that the compiler notices that done is loop invariant
and moves it out of the loop and checks it exactly once. So that will fail.
If I want to fix this, what I have to do is I have to declare -- rather than
declaring done to be a Boolean, I have to declare it to be an atomic of bool.
And this is precisely the model for which acquire release memory ordering was
designed so we could probably get away with using memory order release here and
acquire on the other side.
So here's actually a confusing example which, as it turned out, causes a lot of
controversy typically. When people ask whether or not this has a data race,
this is an example that [indiscernible] is actually fond of pointing out as a
counter example to all sorts of interesting things.
So we have a program here. Again, we assume as in all of the examples that X
and Y are initially zero. I then have one thread that checks X. If it's
non-zero, sets Y to 1, and the other thread sort of does the converse of that.
The question is, does this have a data race. Many people look at that and say
well, they touch the same variable. So yes. But actually, it does not have a
data race.
The way to convince yourself of that is data race is defined with respect to
sequentially consistent execution. There's no sequentially consistent
execution of this in which either assignment is executed. So therefore, this
is a program without any assignments so there's no way there can possibly be a
data race.
So this gets back to the sort of original motivating example. If we have a
structure with two character fields, and we assign one to the A field and
assign one to the B field in the other thread, does that have a data race in
C++11. This does not have a data race. Basically, the A fields and the B
fields are separate memory locations. They're separate scalar objects. They
have nothing to do with each other.
So an assignment to one does not conflict with another one and this has to work
correctly. Under Posix rules, this is actually intentionally implementation
defined. If you try to do this on an alpha, you may well get the wrong answer.
But yeah.
>>: Assuming that the compiler [indiscernible] padding in these kind of data
structures?
16
>> Hans Boehm: Actually, fortunately, it turns out the only major architecture
that really had trouble with this was alpha, which is no longer very
interesting architecture. And alpha only had trouble with it until 1995, as it
turned out. So even if alpha was still around, this wouldn't be a problem.
The constraint on the hardware really is that you need byte store and
selections in order to implement this. And everything other than pre-'95 alpha
basically have byte store and selections these days.
>>:
The rules on A and B were bit fields, do those rules reference bytes?
>> Hans Boehm: No, actually, those are the next example. So if I try to do
what sort of logically the same thing, now the answer is different. This has a
data race. The difference here is A and B are part of the same contiguous
sequence of bit fields. Technically, there's a zero length bit field play a
special role here, but they already did before this.
But that aside, these are part of the same memory location, so therefore, this
and that both assigned to the memory location, the same memory location in the
terminology of the standard and there is a data race. So this is not allowed.
The mixed case is interesting in that by the rules of the standard, this
structure here is two memory locations. The A field is a memory location, and
the sequence of bit fields containing only B is the other memory location. So
this does not have a data race. So this should be okay. On the other hand, it
turns out if you try that sort of with every 2011 and earlier compiler for X-86
or something, this will give you the wrong answer.
So the standard way to implement the assignment to be there is to read the
whole incised world, replace the bits and write the whole incised field back.
And the standard basically made that illegal so compilers have to change to
deal with that. It's not particularly expensive, but it's a change. The code
sequence to implement the assignment to B there changed.
So far, we've been talking mostly about scalar objects. I have one slide, a
couple of slides here on library issues. So what happens if I now, rather than
performing operations on scalar object, I perform an operation on sort of a
library container.
17
So in the case of C++11 here, let's say I have a list of ints and I
simultaneously execute a push front and a pop front operation on that list. So
that also, it's not clear whether that -- it's not immediately clear whether
that has a data race, because it depends on whether I access the same scalar
object at the same time.
But the crucial convention here is that -- the crucial convention for library
writers is that libraries shall only introduce a data race at the scalar object
level if there is sort of logically a data race at the library object level.
So if I have two rights to the same list object at the same time, as in this
case, the library is allowed to introduce a data race. That's not allowed, so
this has a data race.
If I perform two updates to different library objects at the same time, that's
not allowed to introduce a data race. So if I have a user implemented
allocator underneath this that's shared across all instances of lists, it has
to make sure that it does enough locking so that operations on different lists
can proceed in parallel without interfering with each other.
But it doesn't have to do locking in order to make this sort of access actually
safe. And that's the default convention for libraries in C++. It's kind of
interesting because when you look at the literature here, the people
distinguish between thread safe and thread unsafe. Really, the default
convention, I claim the convention you usually want is actually somewhere in
between. It's precisely this convention. That the notion of data race at the
library level reflects what it would be at the scalar object level.
>>: Does this mean that every library object has to be completely
self-contained? You couldn't have a library object that exposes some internal
part as a separate object, because then you're no longer going to be able to
follow [indiscernible] data race.
>> Hans Boehm: This is a default rule. So you can certainly have libraries
that are exceptions to this. So there are libraries that are going to export
stronger properties and are designed for concurrent access. They're more
analogous to atomic scalars than ->>:
For stronger, I understand.
18
>> Hans Boehm: For weaker, you're going to have to document that kind of
information if you're going to use it.
>>:
It's not a hard rule.
>> Hans Boehm: It's not a hard rule. It's one that's followed by the standard
library, except when it specifies otherwise. Yeah.
>>: Isn't the implication of [indiscernible] sufficient?
race on object like [indiscernible].
>> Hans Boehm:
>>:
That's okay, right.
Okay.
>> Hans Boehm:
Yeah.
>>:
It's okay to have a
This is probably stronger than it should be here, you're right.
Perform concurrent reads on objects in the library.
>> Hans Boehm: Two concurrent reads on objects have to work without
introducing a data race. So if you're implementing a splay tree, beware.
specify that that is not thread safe and the client has to beware.
Or
So there's what with a weird case here. What happens if I have an infinite
loop, and then I assign 1 to X, while assigning 2 to X in the other thread. Is
that a data race in actually, in Java, it's not. But in C++, actually, it
turns out the answer is also it's not a data race and it's very hard to define
this to be a data race. On the other hand, it turns out that there are various
reasons why you would really like compilers to be able to interchange code
across that infinite loop.
So for that reason and for other weird historical reasons, it actually turns
out that in C++11, the rule is that infinite loops like this that have no IO
effects and no synchronization effects themselves invoke undefined behavior.
So this does not technically have a data race, but it has the same semantics as
if it had one. But that's because the infinite loop itself is a bug.
So here's one, I'm not sure that that's particularly relevant to this audience
here. The important point here is that what I'm doing is I'm setting a
19
variable to X, and then I'm using a condition variable. I'm notifying a
condition variable when I'm done setting X. And the other thread sort of
checks if X is equal to zero, if I happen to execute -- this is all done inside
a critical section here. If I happen to execute before this critical section
executed, then I wait for this critical section to set X equal 42. And the
question is, does this have a data race, can I safely execute X after that. Do
I know that X has been initialized.
The answer is yes, it does have a data race. The reason it has a data race is
because in C++, like almost every other language, condition variable weights
can wake up spuriously. So the fact that you executed a condition variable
weight tells you absolutely nothing about the state of the computation. So
this ->>:
So if you put it in a loop --
>> Hans Boehm: If you put it in a loop, this then does not have a data race
and it's okay. So I mention this primarily because I regularly see research
paper submissions about how to do flow analysis, how to analyze programs with
condition variable weights, and the answer is you don't.
There's actually similar, somewhat less expected situation where try lock.
This is a really weird program which uses locks backwards. It uses mutexes
backwards. So what I do is I set X to 42. And when I'm done, I lock the
mutex. The other thread waits for the mutex to be locked and then the question
is can it conclude that X is now 42.
The answer is no, and there's sort of a difference here between the official
explanation and the real reason for it. The official explanation and the
standard is that try lock, it can spuriously fail. So just the fact that try
lock failed to acquire the lock doesn't mean it was actually available.
The real explanation is that in order to -- you don't really want to implement
try lock that way. On the other hand, in order to make this work correctly,
you would have to prevent reordering of those two, and it turns out that's
expensive on a bunch of hardware and useless for real code. It's only useful
for code that you really don't want people to write.
Double check locking, this is for many of you a well-known example. This used
to be an advocated idiom for programming with threads. The issue here is I
20
want to initialize X on demand before I access it, but I want to do this in a
way that I don't have to acquire a lock every time I access it.
So I could obviously do this correctly if I just used code here, if I left off
the conditional at the beginning here, if I just acquired the lock, protecting
X, and then checked has it been initialized. If not, initialize it and so on.
But it was recommended that you check at the beginning before you acquire the
mutex. And then if it's not initialized, you reacquire the mutex so that only
one thread can initialize it. And the answer is, as many of us know, I think,
at this point, that's still incorrect. The problem is that this assignment to
initialize the races with the initialized access outside the critical section.
And, in fact, there's no real guarantee that this will work. In particular, I
can interchange these, the compiler can interchange these and break the code.
This one is sort of, I don't know how -- yeah?
>>:
So you advocate [inaudible].
>> Hans Boehm: Yeah, I should have mentioned the way out of that is to make
the init flag atomic pool. Good point, yeah.
So yeah, I was told that it's okay to run over here. Hopefully, that's -yeah. So here's another example, which is really of interest mostly to C++
programmers. It's much more C++ specific than the rest. It's sort of a trick
question, do these things race. So what I'm doing here is while in a critical
section, protected by a mutex M, I push something on to the front of a list and
then I have an infinite loop which goes around and acquires the mutex
occasionally and checks whether the list is entry and does something with the
entry on the list if it's not empty.
Does this have a race, a data race. The answer is yes, but not the one you
expected, maybe. These two don't actually race because the accesses to X all
here are inside the critical section protected by M. The problem is with
having an infinite loop that's providing this service here and looking at the
list regularly, the problem is at some point, X, when the program shuts down, X
is going to be destroyed. The destructor for X is going to run, and this guy
is still going to be running because, after all, it's an infinite loop. So you
end up introducing a data race here between X's destructor and the infinite
loop.
21
So in this model, basically having threads that run forever until the process
shuts down is really not acceptable. It turns out C++11 provides some notion
of detached threads which, for sort of approximation, you just shouldn't use.
Let me quickly say something about the implementation consequences. I think
we've already gone through a lot of this. So the main implementation
consequence is implementations may not visibly introduce memory references that
weren't there in the source.
So one example of that was reading and rewriting an adjacent structure field
when you're assigning to one field of a structure. So there's lots of
implementations that actually do this. I'll give you another one really
quickly.
Other than that, basically, this model restricts reordering of memory
operations around synchronization operations substantially, and the compiler
has to be careful and the synchronization operations have to include memory
fences and so on so make it, to ensure that.
On the other hand, within a region of code that contains no synchronization
operations, the compiler is free to order, basically, at will, because those
look atomic because of the data race free property.
>>:
Is that true for both weak and strong atomics?
>> Hans Boehm: That's true for both weak and strong atomics. But the weak
atomics count as synchronization operations in terms of determining the
synchronization free regions.
So hardware requirements, we already said that we need byte stores. We also,
and this is sort of a longer talk by itself, the hardware has to be able to
enforce sequential consistency. If I end up writing all of my code using
atomics, I have to be able to implement that to it looks sequentially
consistent. And it turns out that introducing fences between every pair of
selections often doesn't work.
So, for example, on Itanium, that's not sufficient.
generally have other mechanisms for enforcing that.
But those architectures
22
>>:
Is it easy to say why it's not sufficient?
>> Hans Boehm: The basic problem -- yeah, do you know about the independent
reads and independent writes example? Let me talk to you about it afterwards.
I think that's a fairly long discussion.
So compiler requirements. We don't get to introduce memory references. We
don't get to introduce data races that weren't already there. And part of that
is that struct fields and bit fields need very careful treatment. It turns out
there are also other cases, though, where compilers naturally want to introduce
stores that weren't there originally.
So this was really sort of the example that motivated my looking at this in the
beginning. So we have a loop here, which every once in a while checks am I
multithreaded. If I'm multithreaded, I acquired a lock. So I should have said
something dot lock here. This is an old slide.
So if I'm multithreaded, acquire a lock. And again at the end if I'm
multithreaded, acquire an unlock. Sorry, release the lock. In between there,
I use some global variable G.
So the problem is that with fairly traditional compiler optimizations, as you
sometimes find in textbooks, you can optimize this in the flowing way,
especially if you have profile feedback information that tells you that usually
this program is single threaded. And so you know that typically, these lock
and unlock operations are not executed. So what the compiler can do is promote
the global variable to register, as far as the compiler is concerned, lock and
unlock are function calls that it knows nothing about.
So what it's going to do around these lock and unlock function calls, well, I
load a G into a global register -- the global G into a register. I, around
these function calls that I know nothing about, I take the register stored back
into the global and reload it after the function call. Do the same thing down
here, and at the end I take the register value and assign it back to the
global.
So sequential code, where lock and unlock are just function calls that I know
nothing about, this is a perfectly good optimization and the code runs faster
if, in fact, MT is usually false.
23
If I look at this as a multithreaded program and I understand this is checking
whether this is multithreaded and these are lock and unlock operations, this is
just -- the output here is complete gobbledy-gook, right. What I'm doing is he
was accessed only inside the critical section. Now it's accessed repeatedly
outside the critical section. So basically, don't do that.
On the other hand, compilers did do that. They actually do it fairly
frequently in this case, which is sort of easy to understand why they would do
it. If I have a loop that counts the number of positive elements in a list,
it's -- and say count is a global, it's tempting to promote counter register.
So, say, a register equals count. Increment register in the loop. And then
stored back at the end.
Again, this is potentially introducing a reference if there are no positive
elements in this list, I've just introduced the store to count where there
wasn't one, and I've introduced the data race. So don't do that either, and
compiler writers like that less.
>>: So in the previous example, the implication, a compiler has to be
conservative, right, if you have to function called -- you could implement lock
from an underlying atomic.
>> Hans Boehm:
Right.
>>: And a function call could be a locking operation or not a locking
operation.
>> Hans Boehm:
Right.
>>: So the compilers now have to be conservative with respect to whether calls
might do synchronization operations?
>> Hans Boehm: In a sense. I mean, in general, compilers are not justified in
introducing stores to variables where there was no store in the source. That's
really the role. So you have to be really careful about this sort of
speculative register promotion. You don't get to ->>:
But if you have a CSE involving memory -- never mind.
>> Hans Boehm:
I mean, actually, it's not too painful.
The latest version of
24
GCC actually does this correctly.
I don't know about Visual Studio.
>>: So one way to fix it is do it at the end, when you do the write, if the
register has changed.
>> Hans Boehm: Right. And that scheme in general works, and that's what GCC
ended up doing. There are cleverer schemes as well which turned out not to
work as well. But so that one certainly seems to work. If you do this, but
also in addition here set a flag whenever you actually increment register, so
whenever you would have assigned to count, set a flag, and then do this stall
back only if the flag is set, that solves the problem.
It solves the problem in part because of the next slide here, but so sometimes
adding data races at the object code level are okay, as long as they're not
observable. So if you're writing a C++11 to C++11 transformation system,
you're never allowed to add a data races, because that changes defined
semantics to undefined semantics.
But if you're exiling C++11 to some machine code, machine code does not have
undefined semantics for data races, in general. It might effectively, because
in some hypothetical ideal world, we might want to have the hardware checked
for data races, but current hardware doesn't, unfortunately.
So in a case like this, when we're only reading the global, it's actually
acceptable to read it speculatively outside of the loop where we might not
otherwise have read it, because we can show that sort of based on the semantics
of the underlying hardware, this actually has no impact. The user can't see
this.
But as a C++11 to C++11 transformation, this transformation is not legal.
The news isn't all bad, actually. I don't want to go into details here. As a
result of clarity in the memory model, there are actually certain kinds of
analyses that we traditionally haven't done which we now know for sure actually
are correct.
So, for example, if we have this program here, which assigns 2 to X and then
executes this loop, which performs a critical section in here, but doesn't
assign to X inside the loop, only references X there, we actually now know that
X is constant. We can prove that based on the data race free assumption, in
25
spite of the fact that there's a critical section in here, so long as there's
only one critical section in here. This is joint work with a student at the
University of Washington and two of my colleagues at HP labs.
So I'll conclude here with some sort of explanation. I've tried to convince
you that data races are bad and the standard tells you basically, don't use
them. However -- yeah?
>>: If it needs a leading researcher and two grad students to figure out that
a [indiscernible] is free of data races, how are we as programmers to manage
our business?
>> Hans Boehm: Not that this is free of data races. The fact that X actually
is constant here. So this is something that the compiler will figure out. The
programmer doesn't need to figure this out. It's an optimization problem.
The other question you're asking is still a good one, though, which is how do
you know that your program is actually free of data races. And the answer is,
though I haven't put a good example of this, I think we actually should rely on
data race checkers a lot more than we do.
Personally, I think the right place, we should be headed, but it's going to be
really difficult to get there, is to get to the state where the hardware
actually does data race checking. But we're going to have to accept some small
performance loss in order to do that and we're going to need hardware support
in order to do it.
So what I wanted to convince you is that even if you don't believe the language
specification, there are actually all sorts of things that actually do go wrong
in practice or could potentially go wrong in practice if you program with data
races. The other way to look at it is these are the kinds of transformations
that you may see that actually motivated the cache fire semantics for data
races.
So here's a simple example of where things can go wrong in very unexpected ways
as a result of putting a data races in your program. I'm checking is X less
than 3. If X is less than 3, I perform a switch on the three possible values
of X. And now let's see what happens here when the compiler translates this.
Let's say the compiler translates this sort of relatively naively with one
exception here, which is it transforms this switch to a branch table, which is
26
common. So it will use the value of X to index into a table of branch targets
and then branch to the right table entry.
The one clever thing that it decides to do is that it knows that X is less than
3 from up here so it gets rid of the balance check on the branch table index.
So now what happens, if there's a race, and X changes in the middle here to,
let's say, 3, I access a branch table entry that you've bounds and I branch to
nowhere. So that's one of the things that can happen.
There are a bunch of other failure modes for code containing data races that
we've already seen. So the invariant code moved out of loop is one failure
that's actually somewhat reproducible. If, in the presence of data races, we
have to worry about byte operations and unaligned operations being used to
access variables so we have fractional updates.
For something like a done flag here, if I don't declare done as atomic at all,
what's going to happen in practice is that the compiler or, possibly, the
hardware may reorder these two operations so that, in fact, when I see done set
in the other thread here, I'm not guaranteed that data has actually been set to
42.
One interesting case here that people always bring up of data races that are
definitely benign and we should be able to use is the case of redundant writes.
So if I have two threads, both of which set X to 17, that should be even better
than having one of them set X to 17, right? So I should definitely know that X
is 17 at the end.
The answer is in this brave new world, not necessarily.
>>: This could make me unhappy as somebody who writes parallel -- because I
might want to do this for some reason.
>> Hans Boehm:
>>:
But you can still do it using atomics, right?
But do I know that I don't pay for extra fences or --
>> Hans Boehm: Well, I mean, if you're willing to live dangerously, but not
this dangerously, you can always say memory order relaxed, which basically ->>:
Yes, but then I cannot know what happens at all.
If I want this allowed
27
but nothing else bad?
>> Hans Boehm: I mean, it's different -- you don't really see a performance
difference there, right. I mean, I think this is very hard to come up with
architectural cases where that actually makes a performance difference.
>>: So I could imagine that I have an array and computing [indiscernible] and
nobody has computed it and then I compute it in [indiscernible] and I can't
have stuff happening in the memory model in real machine.
>> Hans Boehm: The problem is in order to actually get that guarantee, and
it's very difficult to express that guarantee in a language, and it also ends
up inhibiting some transformations that probably, in the long run, we want for
a performance benefit that's really very nebulous, I think.
>>: If I did not [indiscernible] in this case, it seemed to illustrate
[indiscernible] as an algorithm.
>> Hans Boehm: Except there's no performance -- I think you'll be hard pressed
to come up with a case on mainstream hardware in which there's actually a
performance difference between memory order relaxed and the race you store.
>>: Yeah, yeah. I guess so what you're saying is use memory order relaxed and
it will essentially just drop down to the hardware in terms of guarantees that
I get.
>> Hans Boehm: It's not quite the hardware. You get cache coherence, which is
why [indiscernible] a little bit. But that's actually a useful guarantee.
People are very surprised when they don't get cache coherence and on some
hardware, you don't.
>>:
Okay.
I guess I didn't understand what you actually --
>> Hans Boehm:
Okay.
>>:
[indiscernible] would be sufficient.
>>:
So did you explain the problem here?
>> Hans Boehm:
No, I am about to.
Sorry, and then I'm pretty much done here.
28
Sorry. So the problem here, the reason that why
necessarily better is we've already seen a bunch
introduced by the compiler caused problems. And
introduce a self-assignment, but it introduces a
assigning 17 to X twice isn't
of cases with self-assignments
the compiler really wanted to
race.
So it like in this structure case if we look at the other field, what we're
doing is we're basically assigning the other field to itself behind the covers.
Normally in this memory model, it's not okay to do that. Because if we
introduce a self-assignment as we saw earlier, we can hide an assignment in the
other thread. So if I just have the compiler introducing X equals X, while at
thread is introducing X equals 17, that's not okay because the assignment to X
here can be hidden, right.
On the other hand, compilers sometimes want to do that. And the problem is
that in a -- it actually turns out in this memory model, if I see X equals 17,
it becomes legal to introduce this self-assignment. Because I know that
there's no data races, nothing is racing with this, so therefore if I put X
equals X after it, that's okay. So some of those dubious transformations that
I told you about before actually can be re-enabled in cases in which I have a
visible assignment with X without intervening synchronization.
So X equals 17, again, can be safely transformed to this.
>>:
That's if you assume no data races?
>> Hans Boehm: If I assume no data races, right. And the compiler will assume
no data races, right. So the problem I have now is if I have X equals 17 here,
it's actually legal to transform this to a self-assignment X equals X in thread
1, followed by X equals 17. And the same way, but in the opposite order here.
And now I can select an interleaving where as a result of the redundant
assignments of 17 to X, I actually see neither.
Actually, let me skip this because I think we already motivated by
self-assignments can be added. Let me just conclude with this quick slide
here. So the other question is if you want to introduce data races, what they
actually buy you, and this is sort of more or less asking the question here.
This needs more explanation here.
What I'm doing is I'm running a parallel Sieve of [indiscernible] program,
which is sort of propagating different primes, eliminating multiples of
29
different primes by different threads. The top line, it's doing the
elimination by storing to a byte array each time.
The top line has the store to each byte array protected by a critical section,
by a mutex lock unlock. The middle line uses a sequentially consistent atomic
operation. The bottom line uses a plain store operation as you would get from
either a race or a memory order relaxed operation. It turns out those probably
would generate exactly the same code here since this is run on X-86.
And, in fact, memory order release would also generate the same bottom line
here. So what was interesting to me here, at least, is that if you look at the
single threaded performance, and I sort of cut off the top here so you can't
really see it, but, in fact, there's a huge overhead associated with either a
mutex or sequentially consistent atomic, because this is really store-heavy
code and it turns out on X-86 sequentially comes with [indiscernible] fence.
So the fence sort of dominates the running time at the single-threaded end.
As you scale up to higher thread counts, the difference on this machine, at
least, essentially disappears, which initially people might find surprising.
But in retrospect, actually, makes sense, because I think what I'm doing here
is I'm scaling this up to sufficiently many threads. It's completely memory
band was limited.
And it turns out that all of the synchronization overhead I'm incurring to sort
of protect the store with locks or the additional memory fences introduced
primarily much more local behavior, which doesn't interfere with the other
calls.
So in some sense, the point here is that by going through all this effort, what
I'm actually benefitting is primarily sort of performance at local accounts,
rather than high call accounts, at least based on this example.
>>:
Do you know why the blue dots are all over the place?
>> Hans Boehm: This is a highly non-deterministic example, because there are
no data races in the blue one, but there are definitely races as to who gets
which prime. So that's my explanation. I didn't get very detailed profiles.
But it's not too surprising.
So the summary, basically, because I'm way over time here is that basically
30
don't use data races. Data races are evil. Any questions? And this is my
usual list of references the best description of the C++11 memory model is
probably by of the Cambridge group here, which is also the most mathematically
intense. So depending on which variant you want, this is a lot more precise
than what the standard actually says.
So in safe methods languages, like Java or dot-net, safe subset thereof,
anyway, we try hard to ensure that data races do not cause violations of type
safety. And I guess there's two motivations. One is limiting what your
program can do if it's erroneous. And the other is debuggability in the face
of data races. How available do you think those are, and how problematic is
giving them up in the cache fire semantics, do you know?
>> Hans Boehm: That's an interesting question. As you probably know, we
have -- I didn't talk about Java here very much at all. So the state of the
Java memory model is sort of clear in the data race free case. Everything
works the same was in C++. There was an attempt to define what data races mean
and I think that attempt was generally unsuccessful.
So as you said, but I'm not sure that motivation is just to produce -- to
preserve type safety. I mean, I think in the Java setting, the way people
generally perceive it is the security model allows you to run on trusted code
inside your address space. Once you say you're going to run on trusted code
inside your address space, I think you have to preserve type safety.
I think we could probably ensure that you preserve type safety in the presence
of data races just by mandating that explicitly without saying anything
terribly complicated. That itself is not sufficient for Java, because it turns
out I also need the sort of absence of thin air results guarantee. I need to
be able to guarantee that somebody can't manufacture perfectly type safe
pointer to a password string that they weren't supposed to have access to.
And ensuring that turns out to be incredibly difficult, but I think it's also
quite important unless we really give up completely on the Java approach to
security.
>>: So one of the ways you could measure if the model is useful to programmers
is to see if the programs are correct written in this model. And the clear
path for the programs that stick to the strong stuff. Not so clear for the
programs that really try to, you know, use more relaxed -- the code of more
31
relaxed accesses. Do you know of any successful attempts to prove correctness
of such low level?
>> Hans Boehm: I don't know of any, really. And I completely agree that's a
concern. I mean, so long as you stick purely to the sequentially -- to the
sequentially consistent subset and data race free programs, it's fine. And I
think in many ways this simplifies matters a lot there, right, because the
interleaving granularity issue.
>>: [indiscernible] for sequential consistency and ignore the rest of the
problem. This would allow you to actually do that.
>> Hans Boehm: Right. In some sense, it sort of gives you a solid footing for
what everybody has been doing anyway, which is proving the number of
interleavings you need to consider live to only interleavings of
synchronization free regions. And you can do that also for proving the absence
of data races, I believe. So yeah, for that one, probably have a good story,
but I completely agree that for the general model, we don't have really very
good story.
>> Jim Larus:
Any other questions?
Let's thank Hans.
Download