1 >> Jim Larus: It's my pleasure today to introduce someone who, looking around the room, really doesn't need an introduction. Hans Boehm has one of the leaders in the programming languages and compilers in memory system community for as long as I can remember. And knows many of you quite well. We're fortunate to have him today, giving a talk on some of his more recent work with memory models and C++. >> Hans Boehm: Thank you, Jim. Before I go on here, I should say that what I'm describing here is mostly joint work with lots of other people, some of whose names are up here. In particular, I'm going to be advocating an approach to memory models which was really pioneered by Sarita Adve a long time ago and was in some less formal, less precise form actually incorporated into other programming languages even before that. I'm going to be talking about threads and shared variables. So occasionally, the question comes up why that's actually the right subject to talk about. It does seem to be controversial these days, whether we should be communicating -whether we should be writing parallel programs by actually sharing memory or not occasionally. So if you type the phrase threads are evil, if you look for the phrase threads are evil in Google, for example, last time I tried you, get about 32,000 hits. And that number is increasing rapidly every time I try it. It's more. On the other hand, threads are clearly a convenient way initially to express processing of multiple event streams or to keep gooeys active while you're doing some processing in the background. And they're also at the moment, aside from the HPC community, the dominant way to really take advantage of multiple cores in a single machine. So whether you like them or not, they seem to be pervasive. So what I want to talk about is some of the basic C++11 and C11 this is the approach to threads and shared variables. That was incorporated into new standards for both C and C++. Those rely heavily on the notion of atomic objects that's somewhat different from what programming languages have traditionally had. So I'll spend a little bit more time talking about those. And then I'll spend actually a lot of the talk talking about data races in particular, which play a core role in the memory model and trying to convince you that those are just a bad idea and you have never write code that has data 2 races in them. I'll also talk a little bit about the implementation consequences of the C++ and C11 memory model. So the state of the world before we started, and this is now quite a while ago, this is around sort of 2004, 2005, was that for at least people programming in C++ and C, the story was that they were basically programming in a single threaded programming language. The programming language said essentially nothing about threads, though lots of people were actually using threads. They were using threads in the form of an add-on thread library, either Windows threads or around here, probably mostly Windows threads. Also, were sometimes Posix threads. Unfortunately, if you looked at those specifications, they were either sort of intentionally or unintentionally unclear about what even some very basic multithreaded programs meant. So if we look at this is about as basic as a multithreaded program as you can think of, you have one thread that assigned one to X and one thread that assigns one to Y, you would really like to know that by the time you're done, both X and Y have a value of 1. If you look at the posix specification, that was actually intentionally unclear. It was left intentionally unclear, motivated by some hardware and compiler implementation considerations. Basically, what people were afraid of is that an assignment to a variable here, which was small, might, in fact, read and rewrite, overwrite adjacent memory which, in fact, in some implementations it did, so it could, in fact, overwrite an adjacent slot field. And as far as based on my reading of posix, it was also allowed to read and overwrite an adjacent unrelated variable. So even if these were not slot fields. So there was actually very little you could say about what this program meant. And there were much more interesting examples which actually caused serious problems occasionally. Not that frequently. On the other hand, when they did arise, it was really difficult to figure out what was going on. Probably the worst consequence in my opinion of this situation was that it's really very difficult to teach people about multithreaded programming, which is difficult enough without being able to give them a consistent story of how things actually work. So if this doesn't work all the time, then it's very difficult to teach people 3 how this does work at all. So what we start the with for the C and C++ memory model, which is sort of what everybody starts with, and what we'll start with here, is the model that everybody thinks they want from a programming language, which is sequential consistency. So usually, when people think about threads, they think about threads executing as though the actions of the individual threads were just interleaved. So if we have thread 1 doing this and thread 2 doing that, there they might be executed as though these statements were short of shuffled up -- well, interleaved in this way so that the actions performed by each thread are performed in order. And whenever you look at a shared variable, and here I'll use X and Z and Y to denote shared variables. And the thing starting with R, which you can think of as registers denoting local variables. So that when I look at a variable here, the value that I see is the last variable -- the last -- the value that I see is the last value assigned to it in this interleaved execution in that interleaving. So that's what everybody thinks of as sequential consistency. That's what people usually initially at least believe threads should behave like. In addition to just sort of interleaved execution of threads, we sometimes need to control that, as I think everybody in this audience knows, so in reality, there are many cases in which we don't want to allow arbitrary interleaving of thread actions. We want to control those. For example, if we want to implement a variable which really is going to be implemented under the covers as loading the variable and storing back the result, if we want to do that concurrently in two threads, typically we want to prevent the execution of those two threats from being interleaved in the way that I've shown here, where I read X in one thread, read X in the other thread, increment the temporary value in both and write back the same value in both threads, because that ends up resulting in X only getting incremented by one instead of by two, as I had intended. So the way we do that is we introduce some sort of mutex or lock mechanism that prevents this kind of interleaving that forces, that basically prevents one of those threads from executing while the other one is busy accessing X. So we make sure that we have some notion of locks so that only one thread can hold a lock at a time. We can find that sort of very precisely if we want to. But in this interleaving-based semantics, by saying that we only allow 4 interleavings in which one of these lock actions here can be scheduled only while no other thread is holding that lock on mutex. So with that selection then, rather than allowing all interleavings between these two threads, we only allow these two possible interleavings, because we can't schedule the second lock operation until the first unlock completes. So the only acceptable schedules are those two. And that's all easily -- easy to make precise. However, it has two problems. Actually, before I get there, sorry, I should say how this is actually written in C++11. So in C++11, we have now have a mutex concept. We can lock it and we can unlock it when we're done or, alternatively, since this is C++, you would like a more C++-like syntax, where you declare variable whose constructer acquires the mutex here, whose constructer executes M dot lock, and whose destructor executes M dot unlock so that if you leave this block by an exception, the lock still gets released and the right things happen. For those of you who are used to Java synchronized blocks, this sort of gives you the same effect essentially. Or something like that. So that gives us a very nice model based on sequential consistency, but it actually has a couple of problems. The one that everybody, I think, realizes is that there are implementation issues with sequential consistency. It gets in the way of some optimizations that we would really like to be able to perform. So here, this is a sort of standard example that people use in this context, which is based on Dekker's mutual exclusion algorithm. The idea here is that I assign 1 to X and then I assign 1 to Y and then in each thread, I read the other variable next. If I view this execution as being purely interleaved, it has to be the case that either the assignment to X or the assignment to Y goes first. One of those has to execute first, meaning the other thread, whichever thread gets scheduled second, has to see a value of 1. So if this one gets scheduled first, that read of Y into R1 has to see a value of 1 for Y. Rather than the initial value of zero, which I will assume I have here everywhere. So basically, one of these loads has to see the store in the other thread. On the other hand, what happens in real life, there are a number of optimizations that counteract that, and that generally prevent this from being true. So what, in particular, both the compiler and the compiler in the hardware is 5 likely to want to break this. So what compilers will typically observe is that if this appears in some bigger context, if I'm loading Y and I use Y for the down, I can give the hardware more time for the load to complete if I schedule this load of Y earlier so the compiler is likely to want to perform this load up above the starter X. And since the compiler looks like, especially for a compiler that sort of was initially a sequential compiler, it looks like these two statements don't interfere. They're independent. They don't touch the same variables. So it looks like these can be perfectly safely interchanged. So as a result, that compiler transformation makes sense. Even if your compiler doesn't interchange these, it turns out any sort of reasonable modern hardware that you're likely to run this on, when you assign 1 to X, is really not going to put 1 into the memory or even the cache at that point. It's going to put 1 into a store buffer someplace, scheduling it to eventually make it to the cache. It's not going to wait on the -- it's not going to wait on that store to reach the cache, because that would slow down the hardware. So as a result of that, when you assign 1 to X here, that assignment isn't visible to the other thread for a while, because the store buffer is only visible locally. So that has the same effect that it looks like these two statements are interchanged by the hardware, which again makes it possible that we end up in this scenario where I read the initial value, the preassignment value of zero for both X and Y because, in fact, these two assignments can, in the simple case, these two can both occur, really occur before the stores. That's actually not the only reason, as it turns out, that I want to do this. It turns out, as a programming model, I'll argue sequential consistency isn't all it's cracked up to be either. So in particular, the really nasty property that sequential consistency has is it's dependent on the access granularity to memory. So if I have a hypothetical machine which can only access memory a byte at a time for this example, and I store 300 to X in one thread and 100 to X in the other thread, each of those assignments to X, in fact, are going to get decomposed into assignments, the high byte and the low byte, and they can get interleaved in this fashion, and I can end up with a final value of X of 356 in this case, which is not very intuitive or expected outcome and not something 6 that people really want to program to. Maybe even more important, if I sort of move this up a level, I think really, when anybody writes multithreaded code, they already assume that we, in fact, have more than sequential consistency. We don't really reason about programming, program interleaving at the level of individual assignments or store instructions. So, for example, if I have two arrays which are distinct, array 1 and array 2, and I call the sort function on array 1 and array 2 in two different threads, I'm not going to explicitly reason about the different interleavings of those two sort operations, because I know somehow that those are independent, they don't interfere. It all doesn't matter. So what I really want to argue is that these things don't interfere and, therefore, I don't have to look inside them. I don't have to worry about what actually happens. And the actual C++ and C memory model relies, sort of leverages that and this is the so-called data race free model which I'll describe here next. So the real memory model for both C11 and C++11 relies heavily on this definition of a data race, which is more a standard, but I'll reproduce it here anyway. So we say that two memory accesses conflict if, basically, they're order mappers. Which basically means that they have to access the same memory location and they can't both be a read. That's a fairly standard definition. So in the preceding example, the assignment X equals 1 the store to X and the load from X conflicted. We then say that two memory accesses participate in a data race if they both conflict and they can occur simultaneously, meaning they perform by different threads and there's nothing to prevent them from -there's nothing to enforce an order between them. A program is data race free, and really we mean on a particular input when we say that, if no sequentially consistent execution results in a data race. So the notion of sequential consistency is still significant. So it's used in the definition of the data race here for example. >>: With respect to your previous point, a hypothetical byte machine, this is at the programming language level, not at the machine level? Sequential consistency? 7 >> Hans Boehm: Right. The problem is that it actually matters what level year talking about. So enforcing sequential consistency at the programming language level is actually a hard thing to define, because usually the programming language doesn't actually define how many accesses are involved in a particular assignment. At the machine level, it makes more sense, but you don't want to program at the machine level. >>: So how is that -- like you have access the same scalar object so that's where that would come in, right? >> Hans Boehm: That's really where that issue comes in, right. So the definition of -- the way this is actually presented in this standard is they have to access the same memory location and the memory location is defined as a scalar object, and then there's this footnote here, for C and C++, or a contiguous sequence of bit fields counts as a single memory location. So updating any bit field in a sequence is viewed as potentially updating, affecting all of them. I'll have some more examples dealing with this later. >>: But, I mean, this deals with the problem you had before in the sense that X gets 30 on a byte machine, you would consider the whole sequence as one scalar object? >> Hans Boehm: In this case, in that case, yeah, the whole thing is one scalar object, and the whole thing is a memory location. Yeah, but so far we haven't -- we've only -- here, we've only defined a data race. So let me go on for a little bit. So then the guarantee, what we actually get out of the C and C++ memory models is we get sequential consistency again, but we get it only for data race programs. For anything that contains a data race, all bets are off. Normally, what's referred to as catch-fire semantics, which is the same sort of semantics we assign to an out of bound subscript assignment or to an out of bounds array assignment or something like that. Anything can happen, and we'll see examples of that. So it's the program is responsible to prevent Dat races and you get to do that by either using locks or -- and I'll talk about this in a few slides, using atomic operations, which 8 are new to C++11 and C11. And I'm cheating a little bit here and I'll talk a little bit more about that later. In C++11, there are actually ways to relax the sequential consistency guarantee for data race free programs in order to get more performance. Whether or not that makes sense sort of depends on the context. So if we look back at Dekker's example from before, what happens here is that if X and Y are just declared as, say, integer variables, then this program just has a data race, because X equals 1 conflicts with 2 equals X, and those two cannot be executed simultaneously. So basically, all bets are off. This program in C++11 has undefined semantics. Anything can happen. One of the nice sort of core advantage of programming in this data race free model is that we no longer have to really worry about instruction interleaving. I already hinted at that before. So by not having to worry about in selection interleaving, we also allow a whole bunch of optimizations that we're used to having performed, which now become legal. So to illustrate this problem, to illustrate this, basically, the property that we're promising here is that any region of code which involves no synchronization acts as though it's atomic. We don't have to worry about what happens in the middle of any code region that has no synchronization, so long as we know the program is data race free. And sort of the left hand waving argument here is that consider some block of code that has no synchronization, say assigns 1 to A and 1 to B. Let's assume that I could somehow notice -- I could somehow observe a state in the middle of this that would demonstrate that this isn't executed atomically. I can only do that by having an observer thread that looks at A and B and determines it's in the middle. This observer thread unavoidably has to have a data race with the thread assigns 1 to A and 1 to B. In fact, on both A and B. So things look atomic because any program that could tell the difference, that could prove it's really not atomic, is outlawed. So that's sort of the quick introduction to the memory model. Let me say something about these atomic objects now which I've hinted at before. So the basic problem we had with this model before, so in a less well defined state, this is actually very similar to what was required by Pthreads before and even [indiscernible] 83 way back when. So we're not sort of changing the approach here fundamentally in some sense, but if we look back at something like the Pthreads model, basically what went 9 wrong there is that the model outlawed data races. On the other hand, as I think many people in this audience know, the problem was that in order to avoid data races, you needed to use locks, basically, or mutexes. Those were viewed as heavyweight and expensive. So what people did is they cheated. They wrote programs that had data races in them anyway. So if you look at real Pthreads programs, many people have demonstrated this in various research papers. Real Pthreads programs typically contain data races. So what we really wanted to do is not give people that excuse for writing programs with data races and really leave, make it possible to write correct code. So the solution is that we introduce this notion of atomic objects. These are objects that sort of superficially behave like ordinary data. On the other hand, they do allow concurrent access. So you can access these things concurrently without introducing data races. The actual data race definition excludes these operations. It only applies to ordinary memory operations, not atomic operations. By default, unless you tell us otherwise, these preserve the simple sequentially consistent behavior. They also act indivisibly, so they also address the granularity issue. But they do give you sequentially consistent behavior. These actually, turns out, have a huge impact on the memory model specification and if you actually try to read this memory model specification in the C++ standard, which I don't think I'd recommend, you'll find that it's mostly dominated by describing the behavior of these atomics. Everything else sort of falls out as an easy special case. So just by way of illustration, these are easy to use, at least in the simple case. If I want to increment X, and I don't want to use a mutex in order to protect X, I can do that by declaring X to be atomic. Instead of saying int X, I say atomic of int X. This is C++ syntax. C syntax is a little different. And I'm actually, by doing this here, turns out I get an overloaded version of the increment operator that actually does an atomic increment, I guess what around here would be called an inter-locked increment implicitly. So that works. 10 If we go back to Dekker's example and we actually wanted to work correctly, where by work correctly I mean that if X and Y are initially zero, then R1 and R2 can't both see the value of zero, I can make it work correctly by simply declaring X and Y to be atomic in this world. The requirement here is the catch here is, of course, the bottom line. The compiler and hardware have to do whatever it takes in order to actually make that work correctly and make sure that R1 equals R2 equals zero can't happen. I should say that this is not -- this actually fits in fairly well with a whole bunch of other different languages that have vaguely similar constructs, which I should mention here, because some of them are more similar than others. So as I said, C++11 has atomic, and it also has types like atomic underscore int for some special cases, basically for C compatibility. C has these atomic underscore types which it was originally meant to have and late in the standardization process, the C committee introduced this version, which is very similar to that. But has different syntax, just to keep you on your toes. Java has volatiles or Java dot [indiscernible] dot atomic which actually has very similar semantics to this. On the other hand, there are various other languages that have constructs that are profoundly different, but related in the sense that they have a similar goal. So C-Sharp volatiles are different in that they don't give you this, at least last I heard, they don't give you the sequential consistency guarantee quite. They give you a weaker guarantee. OpenMP3.1 atomics are a lot weaker still in that they give you a very weak ordering. They also give you operations indivisible and exempt from data races. On the other hand, with very weak ordering guarantees. And officially, completely unrelated but unfortunately with confusing naming are C and C++ volatiles which really do something else. But unfortunately, in the meantime, sort of often used as a hack to get similar semantics to the atomics here. >>: So if I'm remembering correctly, Java volatiles have semantics which affect memory accesses for non-volatile stores location, like a store can be used as a signaling mechanism to say these happen. Is that the case? >> Hans Boehm: That's the case here. That's sort of implicit in the statement 11 that they preserve sequentially consistent semantics so long as there are no data races on non-atomic variables. >>: Okay. >> Hans Boehm: Without that guarantee, you don't have that property, because anything that uses an atomic variable for signaling might use the atomic variables to present -- to prevent a race on something else but would not have sequentially consistent semantics. >>: So what you showed in the previous slide is that the plus plus operator is a special case, or ->> Hans Boehm: That's actually a place where C++ deviates from some of these others, and that's sort of an artifact of the language, because the way atomics are defined, atomics are a template class so they have a bunch of overloaded operators anyway. And the natural thing to do in that context is to just define plus plus to be an atomic increment. >>: So there are a few predefined -- >> Hans Boehm: There are some predefined operations that, in fact, are atomic. But the only ones -- there's also a compare and swap, called compare exchange. But ->>: Those that are not ready, you have to do it yourself to make sure it works the way you want. >> Hans Boehm: Right. exchange more or less. All of these operators can be implemented with compare You know more about the more or less than most of us. Okay. So I'll say a little bit about sort of a loophole that was introduced into the language here in order to address performance concerns. As I said, and we sort of hinted at this, the atomic operations have fairly strong properties, and the compiler has to do a fair amount of work in order to preserve those properties. That can be fairly expensive. It's actually getting less expensive on modern hardware than it used to be, but it's still fairly expensive. 12 So in order to prevent that from getting in the way in those cases where people would otherwise be tempted to write data races instead, there's actually another facility that's part of the atomics library in C++ which is that programmers are allowed to explicitly specify weaker memory ordering than sequential consistency, even in the absence of data races. This, as it turns out, greatly complicates the specification. It also greatly complicates the programmer's job. The programmer actually has to understand a very complicated specification in order to make this correctly. There's ongoing work here as to how to isolate that complexity in a library that's a non-trivial exercise. So the bad news is that using this facility actually is much more complicated and much more bug-prone than just sticking to the default sequentially consistent semantics, probably more so than most programmers appreciate, which makes this more dangerous. On the other hand, it sometimes is significantly faster on current hardware, unfortunately. >>: So the support for this atomic object has to be implemented in the hardware somehow? When you're enforcing the hardware companies ->> Hans Boehm: Well, what actually happens these days is that atomics are typically implemented with ordinarily load and store instructions for small objects. They're implemented with ordinarily load and store instructions and memory fences. But that's hardware specific. So on some hardware you want to use, there are other primitives other than memory fences that enforce ordering. So Itanium and OMV8 have other primitives to do that. But the default mechanism these days are memory fences. >>: Is there any control to compiler to get rid of the breach of consistency so you have a gold standard of correctness? >> Hans Boehm: You mean compiler flag to basically ignore all of these things? That's not something the standard can really address. I mean, it's something that a compiler implementer could reasonably do. The standard doesn't address compiler flags. >>: Instead of [indiscernible] for the hybrid to make strong atomics [indiscernible] so that we don't have to [inaudible]. 13 >> Hans Boehm: That seems to be going on. So I think it's public that OMV8 has hardware primitives that model these very closely. OMV8 doesn't exist yet. But clearly, we're moving in the right direction. >>: Is there any understanding of suppose everything for me is a [indiscernible] and I am just program like that. How slow [inaudible]. >> Hans Boehm: I'll actually say a little bit about that at the end if I have time. It will probably, on existing hardware, be probably quite slow. On X-86 hardware, turns out store instructions end up being quite a bit slower. Load instructions unaffected, essentially. So basically, for these low-level atomics, we still have the rule that pairs of atomic operations can never form a data race. That's unchanged. But these atomic operations can specify explicit memory ordering. And the way that works is normally, when I load an atomic value, probably typically I would just write it this way. I would just mention the atomic in an expression. It turns out under the covers, this is an implicit conversion from the atomic type to the underlying type. That's really equivalent to writing it this way. But if I write it this way, I can give it an explicit argument that specifies the ordering. So if I write memory order sequentially consistent that's a very verbose way of saying exactly that. So the ordering types we have here is that there's actually one more that I haven't mentioned, but the main ones here is we have what's known as acquire release ordering. I won't say very much about this, but if you specify memory order release on a store operation, that basically means that when you perform a load operation that sees the value that was stored by the store operation and the load operation uses memory order acquire, then you guarantee that after that load operation, you can see all memory operations that were performed before the release operation. So this is the ordering that you want if you use a flag to communicate from one thread to the other. It's of the minimum ordering you need there. Memory order sequentially consistent enforce additional ordering so things like Dekker's example work. Memory order relax basically doesn't ensure any ordering almost. It actually does ensure what's called cache coherence, which 14 is that operations on a single variable appear to occur in a single -- in a single total order. So if we wanted to increment a counter in a way that we don't actually need to read the value of the counter until the end of the program, until all the threads are finished, then there's no reason to enforce ordering when we increment the counter here, so we can -- so we can implement the increment operation by X dot fetch add and then specify memory order relaxed as the ordering type. And on some hardware, that will make a difference. On X-86, that won't make any difference. But when you do that, you have to be really careful, because their usage models for which this doesn't work. So for example, if you're using this to increment a pointer in a buffer that another thread is leading that's the long way to do it. It will not work. So let me try to clarify some of this by going through a bunch of data race examples. And hopefully give you some insight as to how these things actually work. So data races are crucial here if we want to really show that a C++ or C program is correct. So rather than, in the old model, well, basically, in order to show that anything is correct in this model, we first have to demonstrate that there are no data races. Once there are no data races. Well, in proving there are no data races, we get to assume sequentially consistency, because data races are defined with respect to sequential consistency. Once there are no data races, we can prove correctness sort of more traditionally of the program, assuming both sequential consistency, and since there are no data races, we also get to assume that synchronization free code regions atomically indivisible, making that proof easier. It turns out, as well see later, there's actually sort of a third proof obligation here which we'll run into in one of the examples here. But so as a simple example, if we use the simple flag case of communication case here, I say set some variable to 42. Then I tell another thread that I'm done initializing X. The other thread waits for the done flag, and then there's X. If I declare here just to be a Boolean flag, this is incorrect. There's data race on done. As many people have found out the hard way, this is one of the few uses of data races that fails fairly repeatedly in practice. 15 What typically happens is that the compiler notices that done is loop invariant and moves it out of the loop and checks it exactly once. So that will fail. If I want to fix this, what I have to do is I have to declare -- rather than declaring done to be a Boolean, I have to declare it to be an atomic of bool. And this is precisely the model for which acquire release memory ordering was designed so we could probably get away with using memory order release here and acquire on the other side. So here's actually a confusing example which, as it turned out, causes a lot of controversy typically. When people ask whether or not this has a data race, this is an example that [indiscernible] is actually fond of pointing out as a counter example to all sorts of interesting things. So we have a program here. Again, we assume as in all of the examples that X and Y are initially zero. I then have one thread that checks X. If it's non-zero, sets Y to 1, and the other thread sort of does the converse of that. The question is, does this have a data race. Many people look at that and say well, they touch the same variable. So yes. But actually, it does not have a data race. The way to convince yourself of that is data race is defined with respect to sequentially consistent execution. There's no sequentially consistent execution of this in which either assignment is executed. So therefore, this is a program without any assignments so there's no way there can possibly be a data race. So this gets back to the sort of original motivating example. If we have a structure with two character fields, and we assign one to the A field and assign one to the B field in the other thread, does that have a data race in C++11. This does not have a data race. Basically, the A fields and the B fields are separate memory locations. They're separate scalar objects. They have nothing to do with each other. So an assignment to one does not conflict with another one and this has to work correctly. Under Posix rules, this is actually intentionally implementation defined. If you try to do this on an alpha, you may well get the wrong answer. But yeah. >>: Assuming that the compiler [indiscernible] padding in these kind of data structures? 16 >> Hans Boehm: Actually, fortunately, it turns out the only major architecture that really had trouble with this was alpha, which is no longer very interesting architecture. And alpha only had trouble with it until 1995, as it turned out. So even if alpha was still around, this wouldn't be a problem. The constraint on the hardware really is that you need byte store and selections in order to implement this. And everything other than pre-'95 alpha basically have byte store and selections these days. >>: The rules on A and B were bit fields, do those rules reference bytes? >> Hans Boehm: No, actually, those are the next example. So if I try to do what sort of logically the same thing, now the answer is different. This has a data race. The difference here is A and B are part of the same contiguous sequence of bit fields. Technically, there's a zero length bit field play a special role here, but they already did before this. But that aside, these are part of the same memory location, so therefore, this and that both assigned to the memory location, the same memory location in the terminology of the standard and there is a data race. So this is not allowed. The mixed case is interesting in that by the rules of the standard, this structure here is two memory locations. The A field is a memory location, and the sequence of bit fields containing only B is the other memory location. So this does not have a data race. So this should be okay. On the other hand, it turns out if you try that sort of with every 2011 and earlier compiler for X-86 or something, this will give you the wrong answer. So the standard way to implement the assignment to be there is to read the whole incised world, replace the bits and write the whole incised field back. And the standard basically made that illegal so compilers have to change to deal with that. It's not particularly expensive, but it's a change. The code sequence to implement the assignment to B there changed. So far, we've been talking mostly about scalar objects. I have one slide, a couple of slides here on library issues. So what happens if I now, rather than performing operations on scalar object, I perform an operation on sort of a library container. 17 So in the case of C++11 here, let's say I have a list of ints and I simultaneously execute a push front and a pop front operation on that list. So that also, it's not clear whether that -- it's not immediately clear whether that has a data race, because it depends on whether I access the same scalar object at the same time. But the crucial convention here is that -- the crucial convention for library writers is that libraries shall only introduce a data race at the scalar object level if there is sort of logically a data race at the library object level. So if I have two rights to the same list object at the same time, as in this case, the library is allowed to introduce a data race. That's not allowed, so this has a data race. If I perform two updates to different library objects at the same time, that's not allowed to introduce a data race. So if I have a user implemented allocator underneath this that's shared across all instances of lists, it has to make sure that it does enough locking so that operations on different lists can proceed in parallel without interfering with each other. But it doesn't have to do locking in order to make this sort of access actually safe. And that's the default convention for libraries in C++. It's kind of interesting because when you look at the literature here, the people distinguish between thread safe and thread unsafe. Really, the default convention, I claim the convention you usually want is actually somewhere in between. It's precisely this convention. That the notion of data race at the library level reflects what it would be at the scalar object level. >>: Does this mean that every library object has to be completely self-contained? You couldn't have a library object that exposes some internal part as a separate object, because then you're no longer going to be able to follow [indiscernible] data race. >> Hans Boehm: This is a default rule. So you can certainly have libraries that are exceptions to this. So there are libraries that are going to export stronger properties and are designed for concurrent access. They're more analogous to atomic scalars than ->>: For stronger, I understand. 18 >> Hans Boehm: For weaker, you're going to have to document that kind of information if you're going to use it. >>: It's not a hard rule. >> Hans Boehm: It's not a hard rule. It's one that's followed by the standard library, except when it specifies otherwise. Yeah. >>: Isn't the implication of [indiscernible] sufficient? race on object like [indiscernible]. >> Hans Boehm: >>: That's okay, right. Okay. >> Hans Boehm: Yeah. >>: It's okay to have a This is probably stronger than it should be here, you're right. Perform concurrent reads on objects in the library. >> Hans Boehm: Two concurrent reads on objects have to work without introducing a data race. So if you're implementing a splay tree, beware. specify that that is not thread safe and the client has to beware. Or So there's what with a weird case here. What happens if I have an infinite loop, and then I assign 1 to X, while assigning 2 to X in the other thread. Is that a data race in actually, in Java, it's not. But in C++, actually, it turns out the answer is also it's not a data race and it's very hard to define this to be a data race. On the other hand, it turns out that there are various reasons why you would really like compilers to be able to interchange code across that infinite loop. So for that reason and for other weird historical reasons, it actually turns out that in C++11, the rule is that infinite loops like this that have no IO effects and no synchronization effects themselves invoke undefined behavior. So this does not technically have a data race, but it has the same semantics as if it had one. But that's because the infinite loop itself is a bug. So here's one, I'm not sure that that's particularly relevant to this audience here. The important point here is that what I'm doing is I'm setting a 19 variable to X, and then I'm using a condition variable. I'm notifying a condition variable when I'm done setting X. And the other thread sort of checks if X is equal to zero, if I happen to execute -- this is all done inside a critical section here. If I happen to execute before this critical section executed, then I wait for this critical section to set X equal 42. And the question is, does this have a data race, can I safely execute X after that. Do I know that X has been initialized. The answer is yes, it does have a data race. The reason it has a data race is because in C++, like almost every other language, condition variable weights can wake up spuriously. So the fact that you executed a condition variable weight tells you absolutely nothing about the state of the computation. So this ->>: So if you put it in a loop -- >> Hans Boehm: If you put it in a loop, this then does not have a data race and it's okay. So I mention this primarily because I regularly see research paper submissions about how to do flow analysis, how to analyze programs with condition variable weights, and the answer is you don't. There's actually similar, somewhat less expected situation where try lock. This is a really weird program which uses locks backwards. It uses mutexes backwards. So what I do is I set X to 42. And when I'm done, I lock the mutex. The other thread waits for the mutex to be locked and then the question is can it conclude that X is now 42. The answer is no, and there's sort of a difference here between the official explanation and the real reason for it. The official explanation and the standard is that try lock, it can spuriously fail. So just the fact that try lock failed to acquire the lock doesn't mean it was actually available. The real explanation is that in order to -- you don't really want to implement try lock that way. On the other hand, in order to make this work correctly, you would have to prevent reordering of those two, and it turns out that's expensive on a bunch of hardware and useless for real code. It's only useful for code that you really don't want people to write. Double check locking, this is for many of you a well-known example. This used to be an advocated idiom for programming with threads. The issue here is I 20 want to initialize X on demand before I access it, but I want to do this in a way that I don't have to acquire a lock every time I access it. So I could obviously do this correctly if I just used code here, if I left off the conditional at the beginning here, if I just acquired the lock, protecting X, and then checked has it been initialized. If not, initialize it and so on. But it was recommended that you check at the beginning before you acquire the mutex. And then if it's not initialized, you reacquire the mutex so that only one thread can initialize it. And the answer is, as many of us know, I think, at this point, that's still incorrect. The problem is that this assignment to initialize the races with the initialized access outside the critical section. And, in fact, there's no real guarantee that this will work. In particular, I can interchange these, the compiler can interchange these and break the code. This one is sort of, I don't know how -- yeah? >>: So you advocate [inaudible]. >> Hans Boehm: Yeah, I should have mentioned the way out of that is to make the init flag atomic pool. Good point, yeah. So yeah, I was told that it's okay to run over here. Hopefully, that's -yeah. So here's another example, which is really of interest mostly to C++ programmers. It's much more C++ specific than the rest. It's sort of a trick question, do these things race. So what I'm doing here is while in a critical section, protected by a mutex M, I push something on to the front of a list and then I have an infinite loop which goes around and acquires the mutex occasionally and checks whether the list is entry and does something with the entry on the list if it's not empty. Does this have a race, a data race. The answer is yes, but not the one you expected, maybe. These two don't actually race because the accesses to X all here are inside the critical section protected by M. The problem is with having an infinite loop that's providing this service here and looking at the list regularly, the problem is at some point, X, when the program shuts down, X is going to be destroyed. The destructor for X is going to run, and this guy is still going to be running because, after all, it's an infinite loop. So you end up introducing a data race here between X's destructor and the infinite loop. 21 So in this model, basically having threads that run forever until the process shuts down is really not acceptable. It turns out C++11 provides some notion of detached threads which, for sort of approximation, you just shouldn't use. Let me quickly say something about the implementation consequences. I think we've already gone through a lot of this. So the main implementation consequence is implementations may not visibly introduce memory references that weren't there in the source. So one example of that was reading and rewriting an adjacent structure field when you're assigning to one field of a structure. So there's lots of implementations that actually do this. I'll give you another one really quickly. Other than that, basically, this model restricts reordering of memory operations around synchronization operations substantially, and the compiler has to be careful and the synchronization operations have to include memory fences and so on so make it, to ensure that. On the other hand, within a region of code that contains no synchronization operations, the compiler is free to order, basically, at will, because those look atomic because of the data race free property. >>: Is that true for both weak and strong atomics? >> Hans Boehm: That's true for both weak and strong atomics. But the weak atomics count as synchronization operations in terms of determining the synchronization free regions. So hardware requirements, we already said that we need byte stores. We also, and this is sort of a longer talk by itself, the hardware has to be able to enforce sequential consistency. If I end up writing all of my code using atomics, I have to be able to implement that to it looks sequentially consistent. And it turns out that introducing fences between every pair of selections often doesn't work. So, for example, on Itanium, that's not sufficient. generally have other mechanisms for enforcing that. But those architectures 22 >>: Is it easy to say why it's not sufficient? >> Hans Boehm: The basic problem -- yeah, do you know about the independent reads and independent writes example? Let me talk to you about it afterwards. I think that's a fairly long discussion. So compiler requirements. We don't get to introduce memory references. We don't get to introduce data races that weren't already there. And part of that is that struct fields and bit fields need very careful treatment. It turns out there are also other cases, though, where compilers naturally want to introduce stores that weren't there originally. So this was really sort of the example that motivated my looking at this in the beginning. So we have a loop here, which every once in a while checks am I multithreaded. If I'm multithreaded, I acquired a lock. So I should have said something dot lock here. This is an old slide. So if I'm multithreaded, acquire a lock. And again at the end if I'm multithreaded, acquire an unlock. Sorry, release the lock. In between there, I use some global variable G. So the problem is that with fairly traditional compiler optimizations, as you sometimes find in textbooks, you can optimize this in the flowing way, especially if you have profile feedback information that tells you that usually this program is single threaded. And so you know that typically, these lock and unlock operations are not executed. So what the compiler can do is promote the global variable to register, as far as the compiler is concerned, lock and unlock are function calls that it knows nothing about. So what it's going to do around these lock and unlock function calls, well, I load a G into a global register -- the global G into a register. I, around these function calls that I know nothing about, I take the register stored back into the global and reload it after the function call. Do the same thing down here, and at the end I take the register value and assign it back to the global. So sequential code, where lock and unlock are just function calls that I know nothing about, this is a perfectly good optimization and the code runs faster if, in fact, MT is usually false. 23 If I look at this as a multithreaded program and I understand this is checking whether this is multithreaded and these are lock and unlock operations, this is just -- the output here is complete gobbledy-gook, right. What I'm doing is he was accessed only inside the critical section. Now it's accessed repeatedly outside the critical section. So basically, don't do that. On the other hand, compilers did do that. They actually do it fairly frequently in this case, which is sort of easy to understand why they would do it. If I have a loop that counts the number of positive elements in a list, it's -- and say count is a global, it's tempting to promote counter register. So, say, a register equals count. Increment register in the loop. And then stored back at the end. Again, this is potentially introducing a reference if there are no positive elements in this list, I've just introduced the store to count where there wasn't one, and I've introduced the data race. So don't do that either, and compiler writers like that less. >>: So in the previous example, the implication, a compiler has to be conservative, right, if you have to function called -- you could implement lock from an underlying atomic. >> Hans Boehm: Right. >>: And a function call could be a locking operation or not a locking operation. >> Hans Boehm: Right. >>: So the compilers now have to be conservative with respect to whether calls might do synchronization operations? >> Hans Boehm: In a sense. I mean, in general, compilers are not justified in introducing stores to variables where there was no store in the source. That's really the role. So you have to be really careful about this sort of speculative register promotion. You don't get to ->>: But if you have a CSE involving memory -- never mind. >> Hans Boehm: I mean, actually, it's not too painful. The latest version of 24 GCC actually does this correctly. I don't know about Visual Studio. >>: So one way to fix it is do it at the end, when you do the write, if the register has changed. >> Hans Boehm: Right. And that scheme in general works, and that's what GCC ended up doing. There are cleverer schemes as well which turned out not to work as well. But so that one certainly seems to work. If you do this, but also in addition here set a flag whenever you actually increment register, so whenever you would have assigned to count, set a flag, and then do this stall back only if the flag is set, that solves the problem. It solves the problem in part because of the next slide here, but so sometimes adding data races at the object code level are okay, as long as they're not observable. So if you're writing a C++11 to C++11 transformation system, you're never allowed to add a data races, because that changes defined semantics to undefined semantics. But if you're exiling C++11 to some machine code, machine code does not have undefined semantics for data races, in general. It might effectively, because in some hypothetical ideal world, we might want to have the hardware checked for data races, but current hardware doesn't, unfortunately. So in a case like this, when we're only reading the global, it's actually acceptable to read it speculatively outside of the loop where we might not otherwise have read it, because we can show that sort of based on the semantics of the underlying hardware, this actually has no impact. The user can't see this. But as a C++11 to C++11 transformation, this transformation is not legal. The news isn't all bad, actually. I don't want to go into details here. As a result of clarity in the memory model, there are actually certain kinds of analyses that we traditionally haven't done which we now know for sure actually are correct. So, for example, if we have this program here, which assigns 2 to X and then executes this loop, which performs a critical section in here, but doesn't assign to X inside the loop, only references X there, we actually now know that X is constant. We can prove that based on the data race free assumption, in 25 spite of the fact that there's a critical section in here, so long as there's only one critical section in here. This is joint work with a student at the University of Washington and two of my colleagues at HP labs. So I'll conclude here with some sort of explanation. I've tried to convince you that data races are bad and the standard tells you basically, don't use them. However -- yeah? >>: If it needs a leading researcher and two grad students to figure out that a [indiscernible] is free of data races, how are we as programmers to manage our business? >> Hans Boehm: Not that this is free of data races. The fact that X actually is constant here. So this is something that the compiler will figure out. The programmer doesn't need to figure this out. It's an optimization problem. The other question you're asking is still a good one, though, which is how do you know that your program is actually free of data races. And the answer is, though I haven't put a good example of this, I think we actually should rely on data race checkers a lot more than we do. Personally, I think the right place, we should be headed, but it's going to be really difficult to get there, is to get to the state where the hardware actually does data race checking. But we're going to have to accept some small performance loss in order to do that and we're going to need hardware support in order to do it. So what I wanted to convince you is that even if you don't believe the language specification, there are actually all sorts of things that actually do go wrong in practice or could potentially go wrong in practice if you program with data races. The other way to look at it is these are the kinds of transformations that you may see that actually motivated the cache fire semantics for data races. So here's a simple example of where things can go wrong in very unexpected ways as a result of putting a data races in your program. I'm checking is X less than 3. If X is less than 3, I perform a switch on the three possible values of X. And now let's see what happens here when the compiler translates this. Let's say the compiler translates this sort of relatively naively with one exception here, which is it transforms this switch to a branch table, which is 26 common. So it will use the value of X to index into a table of branch targets and then branch to the right table entry. The one clever thing that it decides to do is that it knows that X is less than 3 from up here so it gets rid of the balance check on the branch table index. So now what happens, if there's a race, and X changes in the middle here to, let's say, 3, I access a branch table entry that you've bounds and I branch to nowhere. So that's one of the things that can happen. There are a bunch of other failure modes for code containing data races that we've already seen. So the invariant code moved out of loop is one failure that's actually somewhat reproducible. If, in the presence of data races, we have to worry about byte operations and unaligned operations being used to access variables so we have fractional updates. For something like a done flag here, if I don't declare done as atomic at all, what's going to happen in practice is that the compiler or, possibly, the hardware may reorder these two operations so that, in fact, when I see done set in the other thread here, I'm not guaranteed that data has actually been set to 42. One interesting case here that people always bring up of data races that are definitely benign and we should be able to use is the case of redundant writes. So if I have two threads, both of which set X to 17, that should be even better than having one of them set X to 17, right? So I should definitely know that X is 17 at the end. The answer is in this brave new world, not necessarily. >>: This could make me unhappy as somebody who writes parallel -- because I might want to do this for some reason. >> Hans Boehm: >>: But you can still do it using atomics, right? But do I know that I don't pay for extra fences or -- >> Hans Boehm: Well, I mean, if you're willing to live dangerously, but not this dangerously, you can always say memory order relaxed, which basically ->>: Yes, but then I cannot know what happens at all. If I want this allowed 27 but nothing else bad? >> Hans Boehm: I mean, it's different -- you don't really see a performance difference there, right. I mean, I think this is very hard to come up with architectural cases where that actually makes a performance difference. >>: So I could imagine that I have an array and computing [indiscernible] and nobody has computed it and then I compute it in [indiscernible] and I can't have stuff happening in the memory model in real machine. >> Hans Boehm: The problem is in order to actually get that guarantee, and it's very difficult to express that guarantee in a language, and it also ends up inhibiting some transformations that probably, in the long run, we want for a performance benefit that's really very nebulous, I think. >>: If I did not [indiscernible] in this case, it seemed to illustrate [indiscernible] as an algorithm. >> Hans Boehm: Except there's no performance -- I think you'll be hard pressed to come up with a case on mainstream hardware in which there's actually a performance difference between memory order relaxed and the race you store. >>: Yeah, yeah. I guess so what you're saying is use memory order relaxed and it will essentially just drop down to the hardware in terms of guarantees that I get. >> Hans Boehm: It's not quite the hardware. You get cache coherence, which is why [indiscernible] a little bit. But that's actually a useful guarantee. People are very surprised when they don't get cache coherence and on some hardware, you don't. >>: Okay. I guess I didn't understand what you actually -- >> Hans Boehm: Okay. >>: [indiscernible] would be sufficient. >>: So did you explain the problem here? >> Hans Boehm: No, I am about to. Sorry, and then I'm pretty much done here. 28 Sorry. So the problem here, the reason that why necessarily better is we've already seen a bunch introduced by the compiler caused problems. And introduce a self-assignment, but it introduces a assigning 17 to X twice isn't of cases with self-assignments the compiler really wanted to race. So it like in this structure case if we look at the other field, what we're doing is we're basically assigning the other field to itself behind the covers. Normally in this memory model, it's not okay to do that. Because if we introduce a self-assignment as we saw earlier, we can hide an assignment in the other thread. So if I just have the compiler introducing X equals X, while at thread is introducing X equals 17, that's not okay because the assignment to X here can be hidden, right. On the other hand, compilers sometimes want to do that. And the problem is that in a -- it actually turns out in this memory model, if I see X equals 17, it becomes legal to introduce this self-assignment. Because I know that there's no data races, nothing is racing with this, so therefore if I put X equals X after it, that's okay. So some of those dubious transformations that I told you about before actually can be re-enabled in cases in which I have a visible assignment with X without intervening synchronization. So X equals 17, again, can be safely transformed to this. >>: That's if you assume no data races? >> Hans Boehm: If I assume no data races, right. And the compiler will assume no data races, right. So the problem I have now is if I have X equals 17 here, it's actually legal to transform this to a self-assignment X equals X in thread 1, followed by X equals 17. And the same way, but in the opposite order here. And now I can select an interleaving where as a result of the redundant assignments of 17 to X, I actually see neither. Actually, let me skip this because I think we already motivated by self-assignments can be added. Let me just conclude with this quick slide here. So the other question is if you want to introduce data races, what they actually buy you, and this is sort of more or less asking the question here. This needs more explanation here. What I'm doing is I'm running a parallel Sieve of [indiscernible] program, which is sort of propagating different primes, eliminating multiples of 29 different primes by different threads. The top line, it's doing the elimination by storing to a byte array each time. The top line has the store to each byte array protected by a critical section, by a mutex lock unlock. The middle line uses a sequentially consistent atomic operation. The bottom line uses a plain store operation as you would get from either a race or a memory order relaxed operation. It turns out those probably would generate exactly the same code here since this is run on X-86. And, in fact, memory order release would also generate the same bottom line here. So what was interesting to me here, at least, is that if you look at the single threaded performance, and I sort of cut off the top here so you can't really see it, but, in fact, there's a huge overhead associated with either a mutex or sequentially consistent atomic, because this is really store-heavy code and it turns out on X-86 sequentially comes with [indiscernible] fence. So the fence sort of dominates the running time at the single-threaded end. As you scale up to higher thread counts, the difference on this machine, at least, essentially disappears, which initially people might find surprising. But in retrospect, actually, makes sense, because I think what I'm doing here is I'm scaling this up to sufficiently many threads. It's completely memory band was limited. And it turns out that all of the synchronization overhead I'm incurring to sort of protect the store with locks or the additional memory fences introduced primarily much more local behavior, which doesn't interfere with the other calls. So in some sense, the point here is that by going through all this effort, what I'm actually benefitting is primarily sort of performance at local accounts, rather than high call accounts, at least based on this example. >>: Do you know why the blue dots are all over the place? >> Hans Boehm: This is a highly non-deterministic example, because there are no data races in the blue one, but there are definitely races as to who gets which prime. So that's my explanation. I didn't get very detailed profiles. But it's not too surprising. So the summary, basically, because I'm way over time here is that basically 30 don't use data races. Data races are evil. Any questions? And this is my usual list of references the best description of the C++11 memory model is probably by of the Cambridge group here, which is also the most mathematically intense. So depending on which variant you want, this is a lot more precise than what the standard actually says. So in safe methods languages, like Java or dot-net, safe subset thereof, anyway, we try hard to ensure that data races do not cause violations of type safety. And I guess there's two motivations. One is limiting what your program can do if it's erroneous. And the other is debuggability in the face of data races. How available do you think those are, and how problematic is giving them up in the cache fire semantics, do you know? >> Hans Boehm: That's an interesting question. As you probably know, we have -- I didn't talk about Java here very much at all. So the state of the Java memory model is sort of clear in the data race free case. Everything works the same was in C++. There was an attempt to define what data races mean and I think that attempt was generally unsuccessful. So as you said, but I'm not sure that motivation is just to produce -- to preserve type safety. I mean, I think in the Java setting, the way people generally perceive it is the security model allows you to run on trusted code inside your address space. Once you say you're going to run on trusted code inside your address space, I think you have to preserve type safety. I think we could probably ensure that you preserve type safety in the presence of data races just by mandating that explicitly without saying anything terribly complicated. That itself is not sufficient for Java, because it turns out I also need the sort of absence of thin air results guarantee. I need to be able to guarantee that somebody can't manufacture perfectly type safe pointer to a password string that they weren't supposed to have access to. And ensuring that turns out to be incredibly difficult, but I think it's also quite important unless we really give up completely on the Java approach to security. >>: So one of the ways you could measure if the model is useful to programmers is to see if the programs are correct written in this model. And the clear path for the programs that stick to the strong stuff. Not so clear for the programs that really try to, you know, use more relaxed -- the code of more 31 relaxed accesses. Do you know of any successful attempts to prove correctness of such low level? >> Hans Boehm: I don't know of any, really. And I completely agree that's a concern. I mean, so long as you stick purely to the sequentially -- to the sequentially consistent subset and data race free programs, it's fine. And I think in many ways this simplifies matters a lot there, right, because the interleaving granularity issue. >>: [indiscernible] for sequential consistency and ignore the rest of the problem. This would allow you to actually do that. >> Hans Boehm: Right. In some sense, it sort of gives you a solid footing for what everybody has been doing anyway, which is proving the number of interleavings you need to consider live to only interleavings of synchronization free regions. And you can do that also for proving the absence of data races, I believe. So yeah, for that one, probably have a good story, but I completely agree that for the general model, we don't have really very good story. >> Jim Larus: Any other questions? Let's thank Hans.