>> Ben Zorn: So it's a great pleasure to introduce Emery Berger, an associate professor at the University of Massachusetts in Amherst. Emery got his PhD in 2002 from the University of Texas and I've been working with him since then on a number of different projects. I have to say Emery always surprises me by the creative of his work and also by the fact that he takes ideas that you think are just out -- out of this world and he makes them work in practice. So I think this is another example of that work. And today Emery's going to talk about parallel programming in Grace. >> Emery Berger: Thanks, Ben. Hi, everybody. All right. So you see the title of the talk, Grace: Safe Multithreaded Programming for C and C++. But what I want to talk about first is garbage collection. This may seem a little surprising. But, you know, garbage collection is a topic that obviously -- well, I'm not sure if it's near and dear to my heart, but at least it was at one time. But it's a well established technology, it's something that it now works, right, so here's -- here's the world's best garbage collector, Wall-E. He never stops. There's always garbage to be found. And you know, the idea of garbage collection, at least the idea for Wall-E is you've got a robot and the robot is going to find your garbage and take care of it. And in a garbage collected language, right, the runtime takes care of your memory errors. And this is great. You can deposit your garbage wherever you feel. The runtime will go after you clean it up. And so this is actually -- so that's Hans Boehm by the way, if you don't recognize him. I mean Hans is -- this is especially suitable here. So Hans built this conservative garbage collector for C that you all may have heard of and really it's exactly what I'm getting at, which is it's garbage collection that is intended to deal with programmer error. So you're in the world of C. You make some mistake in your memory management, you throw Hans at the problem and the problem goes away. Hans-E. Right. So you know the result of all of this work on garbage collection is it's a success story, right? They're now plenty of languages where garbage collection is now practically taken for granted. And you know, for many, many years it was considered very controversial. Who would ever adopt it. And now, you know, you can barely think of -- I think there are no dynamic scripting languages for example that don't have garbage collection, JavaScript of course, Python, Perl, Ruby, not to mention Java and C#. It only took 35 years. So I guess you know if you're patient like Wall-E, then all good things come to those who wait. Okay. So wouldn't it be nice if some day we could have garbage collection for concurrency. So this is sort of the vision of this talk. I would like to be in a world where I can take my program that has all kinds of bugs, right, I just throw, you know, write a concurrent program with just as many bugs as I want. I leave off locks, it's full of races and then Wall-E comes and will clean up after me. I think that would be great. So I just want you to, you know, think of the rest of this talk through this sort of view of garbage collection for concurrency. All right? Okay. One question immediately comes to mind though is wait a second, what does that even mean? What would it mean to garbage collect currency errors? We know what it means to garbage collect memory errors. Concurrency it's not so clear. Okay. A lot of this is not going to be new, but we'll go through it anyway. You know, in the beginning there was the core. The core was good. And every year the core got faster and faster, life was wonderful. And so if you wanted to write a sequential program right you'd just write your sequential program. Every year it got faster and faster. You just write a very simple program. You're not meant to understand this, but it's just a very, very simple loop. And it was great until one day things got a little too hot and now we're out of luck, right, they leveled off. And so then they say wait, wait, you know, we know how to build the one core. What if we give you four. And we said we don't want four. And they said sorry, you're getting four. And so now we have four cores. And the promise of course is that well, if you give me more cores then I can run faster because I can do more things at once. Right? The last little stripe has to lag along, [inaudible]. All right? So that's great. And if we could do this, if we could parallelize all our programs, right, it would be as if Moore's Law never stopped. Now, I'm being a little loose and fast with what Moore's law means, but you all get the idea. Okay. The problem of course is anybody in this room knows who is programmed current programs. How many of you have written concurrent programs? Okay. How many of you have written concurrent programs that have bugs? Okay. All right. Don is like what, bug, me? So here is the kind of problem you get, right? You run the program once, it works fine. You run it again, something else happens. Sometimes it works, sometimes it doesn't, right, so these are races of course. And if you look at this code, you can see that there is indeed a race if you imagine this being multi threaded down in this loop down here if this is a parallel four. The problem is that these are races, right? Color and row and draw box are being done really need locks around them. So you then throw locks around the updates to color and row. You would think everything would work. And it will prevent races, but unfortunately sometimes you get this or this, which is rather surprising. And so these are atomicity violations. So here the problem is you put locks around things but you made them too fine grained. You really wanted to draw this whole rectangle in a color and then do the next one and the next one and not change the color while you're drawing the row. So you need to move the lock up. And you know, for those of you who like transactional memory, you know, pretend that lock and unlock means, you know, atomic and curly braces, all right? Same thing. And I should observe in passing you know it's transactional memory does not solve any of these problems, right? It's really a question of where you put these braces. So if you make a mistake and you would put them where they were here, you would have exactly the same problem, you need to put the braces in the right place, and transactional memory does not deal with this problem, right? It still relies on the programmer to get it right. Okay. So here's another kind of problem. The one thing transactional memory does deal with which is deadlock, unfortunately sometimes it converts into live lock which is far, far worse, but we won't get into that. Okay. So here's another weird error if you see this. Here's the flag one, flag two. So this is what's called an order violation. It's a recently identified concurrency error, which I think is great. You know? Good, new errors. Fabulous. So the new error is programmers wrote this code, assumed that when you spawned a thread that that thread would be delayed, that it wouldn't just spawn immediately. And so they then did some other stuff right after they spawned the thread, and almost all the time that other stuff preceded the actual execution of the next thread but then they changed versions of the OS, the scheduler changed, all of a sudden the thread started executing first and weird things happened. So fun concurrency errors. So to recap, back in the day when we had the one core, it was great, everything was sequential, no bugs, right? Remember. All right. It may be easier, okay? And you know, it was a world of unicorns and rainbows cooperatively. So now we have races, atomicity violations, deadlock, order violations. So this is currency hell. So it's a bad situation. So what we'd like to be able to do is to come up with the world that takes us back to those unicorns and rainbows to some extent and so the object of race, which is what I'm going to talk about today, which is you can view as a sort of garbage collector for concurrency is intended to get rid of all of these errors, okay, and impose sequential semantics. So you'll take your multi-threaded program, and it will run as if you had written a sequential program. Okay? So I'll explain what that all means in some detail. The idea behind Grace, however, it's not a new compiler, it's a runtime solution. Roughly speaking to use Grace you swap out the P threads library for the Grace library. And you don't need to make any other changes to your code. Okay? This guarantees sequential semantics, gets rid of these errors, and I'll show you how you can actually get some performance out of. >>: Yes? >>: What do you mean by sequential semantics? [inaudible] contains spawn thread, I mean, there is no ->> Emery Berger: Yeah. I'll -- that's exactly the next thing, right? So in the rest of the talk, the very first thing I'm going to talk about is going to address that question, which is what the hell do I mean? Right? I've got multi threaded programs this are going to run on multiple cores and now I'm saying they're sequential. What do I mean by that? I'll talk a little bit about what goes on inside, right, what's under the hood, and then present some results. Okay? All right. So to the first question. So for the purposes of this talk, I'm going to use the syntax of the Cilk programming language. For those of you who are unfamiliar, it's a C like programming language that is extended with spawn, which is the moral equivalent of P thread create and sync, which is the moral equivalent of P thread join for all of the preceding children. Syntactically, it's just cleaner and makes the examples easier. But Grace actually implements the P threads API. Okay? So here is a little program that says spawn thread to run the function F of X and store the result in T1. Same thing with G of Y and T2. And sync means wait for both of those threads to complete. Okay? Very simple multi-threaded program. So the execution looks something like this. You say spawn F of X. F of X starts executing. It runs concurrently with the main program which now spawns G. Now, everything is running concurrently until they hit sync and sync is like a barrier, okay? All right. So that's your concurrent program. So what I want to do is I want to take this concurrent program and I want to impose an order on the execution. What I'm going to impose in particular is a left to first depth first ordering, left to right I mean depth first ordering. But what that corresponds to is actually much easier to explain syntactically. All we do is we get rid of the keywords spawn and Cilk. So this is sort of an elision of all of the concurrency operatives in the code and you're left with a sequential program. So where you had a function call before that was supposed to be spawned asynchronously in another thread, now it's going to get execute synchronously. So F of X will be executed to completion, then G of Y. So these are the semantics that Grace is going to provide. Okay? It's not really going to execute them this way because that would defeat the whole purpose of having multiple threads and multiple cores. Okay? Yes. >>: So you chose that there's certain parallel programs to be [inaudible]. >> Emery Berger: That's exactly right. So first let me say what this does, what the implications are, and then I'll talk about the limitations. And you're absolutely right. So first it should be clear that most of these problems go away, right, as soon as you move into a sequential world these sort of concurrency errors by definition stop really mat entering. So a race doesn't matter if threads aren't running in concurrency. Same thing with atomicity violations. We don't need locks anymore because there aren't multiple threads. So that gets rid of deadlock. And there's no order violations. The order is fixed, it's program order. Okay? However, there's some limitations. If you have threads that are intended to execute forever, so some sort of a reactive thread. So if F of X and G of Y when you spawn them they were both supposed to run forever now we compose them and F runs forever and G never runs, you have a bit of a problem. So you can't have infinite thread execution. The other thing which is perhaps a more crucial limitation is they can't communicate either, okay? They can communicate in the sense that the effects of one thread become visible to the next thread or to a child thread, but they can't use condition variables. So in fact in Grace condition variables are not exposed in a library, you try to link with the Grace library and you have condition variable calls. It will complain at compile time. All right? So this sort of communication is ruled out. Okay? >>: [inaudible]. >>: You conditional implement your own ->>: Implement my own channels [inaudible] shared memory [inaudible]. >> Emery Berger: I mean, there is -- yeah, yeah. You can't implement -- I mean any sort of thing that does that, you know, there's no way to actually implement it in Grace. Right? You're prevented from doing so. And the primitives aren't exposed. Okay? >>: [inaudible]. >> Emery Berger: Oh, well, I mean, okay. I see. So you want to violate the boundaries, right. Yeah. Okay. Sure. Yes. That's right. There are certain evil things you could do to try to get past this and then, you know, if you role your own or you include something that's not called P thread conned, whatever, it's called, you know, Manuel conned, then Grace doesn't know about it and it will not behave as you would like. Okay? What it's intended for is fork joined parallel code. So the whole object of writing these parallel programs ostensibly is to take advantage of all of these cores and take computations that were sequential and make them run faster and fork join is a popular paradigm that makes it fairly easy -- it's a nice easy abstraction to think about how you break these things up, break something into pieces, maybe even have nested fork join. That all works. And so there's a bunch of different libraries that exploit exactly this -- provide this sort of API so map-reduce is a fork-join framework. Cilk is this programming language I described earlier. Intel's thread building block's library. I should add Java also has a new fork-join library that's coming out. Again to just enable programmers to use this paradigm. Okay. So let's see. So what's going on? How is it possible for me to take a code that used to run sequentially and as I'll tell you later get any performance out of it? So magic. Not much of an answer. There are two pieces here, two crucial pieces. One is some magic which is really you could think of as hackery and the other is ordering. So we need ordering, and we need this hackery to make this work. So first I'll talk about the hackery. So this is going to sound terrifyingly evil. Or awesomely clever. You decide okay? So here's what we do first. So we take these threads, you know, you got this program and it spawns threads. We're going to turn this program that was written using threads into a program that forks. That is Unix level process creation. All right? So you may be thinking to yourself dear God, why would you do this? But it will become clear in a minute. And I should add right now this is implemented in Linux. Fork is actually quite efficient. In fact, thread creation is -- uses the exact same code base as fork. So the difference in terms of performance as you'll see is not really a material problem. What we do this for actually is to provide isolation. So the great thing about processes is processes have separate address basis. So what I can do is I can run one thread and run another thread and all of their local effects are isolated from each other. So we're going to use this to enable speculation. And that's how we're going to get performance. So for example, I can start running F of X and I can then run G of X completely in parallel and if one of them does something that would violate this sequential ordering, I can always throw away its results and execute it again. So in the worst case, I'll end up with the sequential execution, I will hopefully not be in that case, and I'll get some parallelism out. All right? And it turns out. So and it turns out nested threads are okay. I'm not going to explain the details here. But you may be asking yourself, wait a minute, these are in separate address spaces. You should be asking yourself this. They're in separate address spaces. You know, F does something, G does something, but they're totally separate processes, right? They can't see the results of what they did. It doesn't make any sense. So what we do is we actually use low level memory mapping primitive to basically expose the shared space and then each guy has a private copy of the same M map region. So this is something you can do with M map. When you go and you write to one of these pages it's copy on write. So everybody will end up with their local versions of the heap and they can always see say the heap and the globals that had been committed before. So it's a way of giving you threadlike convenience in terms of having shared memory but still get isolation. Now, when these threads are running, like I said, they're running in isolation, they get all their private versions. Eventually this memory has to be shared, right? So all the threads need to be able to see all the results of preceding threads. So what we do is we impose ordering on how the results are committed. So there's no ordering of the execution. The execution runs completely in parallel. But F of X precedes G of Y in program order, and this is the sequentiality we want to observe. So F of X's results need to happen first logically then G of Ys. So what we do is we say, well, F of X is first, and it's going to get commit before G even if G finishes first. F of X then checks all of the preceding -- any pages that it's read to make sure they haven't been made stale. If they haven't, it can commit. And then two can do the same thing. When G of Y goes and writes if it discovers that some of the pages that it's read have now been invalidated, which it does with some version numbering that's maintained down here as well, it has to roll back. Okay? So in the worst case you get back to serial. But the goal here again is to make sure that correctness comes first. Right? We want the program to behave correctly. This is think of it as an early unoptimized pregenerational garbage collector, all right? We want it to be correct. We got 35 years to spend making it run fast, okay? Well, hopefully it won't take that long. Okay? Well, if we can just get the TM people to stop working on that, then maybe things will go faster. So all right. I'm sorry. Okay. So there are some other issues to make this run faster. So and I'm not going to go into much detail about them. If you have a heap that is essentially just one big heap and it's a centralized data structure, this completely will not work because one allocation by one thread and one allocation by another will always conflict. So the heaps themselves, the data structure itself needs to be made decentralized. In addition you don't want memory allocations that are sort of adjacent like one thread calls malik [phonetic], the other thread calls malik, everything here is in a page grain. We're using mmap, right? And we're going to be tracking these things on a page level. If they're all intermingled, then you get a lot of false sharing. So what we do is we make sure that every thread has its own set of pages that it allocates from. So if you have two threads and they're allocating stuff, they're not going to conflict because they're all on distinct pages. We do something very, very simple with the GNU linker. Allows us to relocate whole sections of where things end up in the image. And so the linker we basically make the globals page aligned so there's a page before it and a page after it. This prevents some false conflicts that arise from code being right next to globals. Now, you can be in a situation where you have false sharing because two globals are next to each other in memory, and there's an update to one an update to another by another thread. We have ways of dealing with that. But I'm not going to go into it. Crucially we have to deal with I/O. So I mention that the idea here is basically optimistic concurrency, but optimistic concurrency means you're going to roll things back. And as we know, it's hard to role back certain operations. So Morris Herlihey's [phonetic] favorite example is launching the missile, right? You launch a missile. Morris clearly is a child of the Cold War. So you have this missile that you launch and you say oops, oops, X was 2. That means Russia's not attacking or whoever our enemy of the year is. So you -and that's too late. Darn. Right? So what we do, first there's a very straightforward easy thing to do for certain operations which is to buffer. Right? But there's some things that are just out and out irrevocable. So for example, if you have a network connection, right, so I want to initiate a connection to some remote server and get a response. I can't buffer that, right? I need that response. Right? So I really, really do need to execute it. So what do we do? So the trick here, all we have to do is we take advantage of the ordering. So we know for example if we're the -- if we have no predecessors that we are going to commit our state. So whenever we get to an irrevocable point we say hey, are there any predecessor threads that are supposed to commit before me? No. Okay. Is all my state completely clean? Yes. I can go. Because I know I'm not going to roll back. I'm guaranteed to commit. All right. So that allows us to actually handle all of this sort of arbitrary I/O difficulty, even though we're doing speculation. Okay. >>: [inaudible]. >> Emery Berger: Yeah? >>: [inaudible] with respect to exceptions like one thread throws an exception and other thread tries to commit after that, so at that time you guarantee that -and I see the exception I go to sequencing [inaudible]. >> Emery Berger: So if an exception happens right now, the exception -- and it's not caught or -- I'm trying to understand exactly the scenario that you're concerned with. >>: So I mean if I'm debugging this program, that is good, right, and I introduce some of the -- no one can [inaudible] I'm not saying memory or something [inaudible]. >> Emery Berger: Okay. >>: And I see, you know, a segmentation for it. Would the same -- I mean, if you run this sequentially you would see it at a certain point. With Grace do you end up seeing the same state that one does execute? >> Emery Berger: Okay. Okay. So let's -- let's be careful to distinguish exception being thrown from you know, segmentation violation, right? Because one's a programming language construct and the other is the result of something happening below the PL, right? So like de-referencing null or something, right? So we could talk about the former, but the latter case what will happen is the thread will detect it. Currently the way Grace works is it actually uses segmentation violations as read and write barrier. So the very first time you read a page and the very first time you write a page, you take segmentation violations so you knew which pages you've read and written, and after that the pages are unprotected. If you take a subsequent segmentation violation, then Grace will say wait a manipulate, you actually had a program error. And it will abort that thread. Which is actually a process. So weirdly it won't take down the program, it will just take down the thread. So it actually is arguably better than sequential in this case in the program -- in that the program will actually survive. So if you have a thread, imagine you spawn a thread and the thread immediately de-references null. It will be as if that thread had no effect. Right? Because none of its changes will ever get committed. >>: [inaudible]. >> Emery Berger: Yeah. It's an interesting question. Yeah, yeah, yeah. So we thought about this. We said do we want to enforce the semantics where when bad things happen we just abort everything or not? And it's a flag. It's trivial. We basically either say, you know, basically we have died and you send the signal to your parents and you basically return in, you know, in Unix terms you return negative one instead of zero. It discovers that something bad has happened and everybody fails. Or you just fail silently. And this is really a policy decision. >>: [inaudible] get the same segment violation in a sequential exception? >> Emery Berger: Yes. Absolutely. >>: So I don't understand the issue. >> Emery Berger: I think that Cartic [phonetic] is concerned about repeatability or preserving the semantics of execution. And I think it's an interesting question only -- well, not only but especially to me in this issue of if I've got code and it has some sort of an error, what do I want to do about it? And in sequential case the normal thing is you crash. Right? And if you have a multi-threaded case, the same thing happens. Something bad happens and -- but it happens at an undetermined time. It would be easy to tell -- like I said, the current runtime system actually behaves more nicely, but it's literally a matter of saying return zero or return negative one. Yeah? >>: [inaudible] your comment before about the negative effective [inaudible] into a live -- I mean, silently skipping it seems worse than -- okay. >> Emery Berger: So it's not turning deadlock into live lock, right? Okay. >>: No. >> Emery Berger: Okay. >>: In the sense that you're taking -- at least deadlock you know your machine's frozen. >> Emery Berger: Yes, absolutely. Absolutely. >>: So now you're taking something where you would have known it crashed [inaudible] it crashes. >> Emery Berger: Right. So you know ->>: I'm surprised that you're so willing to [inaudible] policy decision not want -since you seem to not want [inaudible]. >>: I'm sure it has to do with performance. >> Emery Berger: No, it does not have to do with performance. No, I mean actually, you know, I thought that this was -- I think that this is a philosophical argument. I guess I could put on my Martin Reinhard hat here and say, you know, if you have a code and the code is going to abort, what do you want to do? Do you want the code to abort or do you want it to try to continue to execute? And it depends. Right? This is one of those classic situations where in some cases you really want fail stop behavior, and in other cases you don't. >>: [inaudible] surprise to see you [inaudible]. >> Emery Berger: I'm so cavalier. Yeah. You know, I mean -- we can be -- we can be as anal as we want, and we can crash all the time, and that's not a problem. Okay? It's not really something that I focused on, but Cartic asked this question, so. Okay. So some results. The results here are mostly to show you that this is promising. There's some surprising results. They were certainly surprising to me. So let me explain this graph. This graph was meant to see what the overhead was of using processes instead of using threads. So we're at a very simple benchmark where the code just, you know, it spawns a bunch of threads, the threads do some work, it's an eight -running on an eight core machine and so basically -- and it spawns eight threads. Okay? So the top line which is P threads, everything is normalized to P threads. So here this means that Grace is performing better than threads. It's running in less time. Okay? And as the granularity increases on a log scale, right, yeah, yeah, bear with me, so as the granular increases, gradually a Grace eventually sort of asymptotes out. As you can make sense, right? You've got a process and a thread. You would think, as I thought, that, well, there's the initial penalty of spawning a process because you -- you know, you shoot down the TLB. You have -- you know, you have to copy file handles, you have stuff that you have to do. And then as the thread gets longer and longer, it would be amortized. Only that's not what happens. Okay? What happens is that for short lived threads, you actually got better performance. And it came basically -- this doesn't happen if you do, you know, one thread at a time. But it happens when you do eight threads. All right? So my student came and gave me this graph which as you can imagine I expected this bar to be up there and not down here. And I said, all right, I've done this to the students several times now and I really need to start learning. I said you did something wrong, go find out what you did wrong and fix it. Right? Like you got the graph axes backwards or something. And he read it again. He got the exact same results. And I said all right. Let's try this. I'll run it myself. So ran the code and got the exact same results. And I said huh. Okay. Something is very, very wrong, right? It's like I'm on candid camera. Something weird is happening. So it turns out we did everything. We said all right, let's look at the performance counters, let's see what the hell's going on. Eventually my student who's very, very good in the Linux kernel dove into the kernel to see what the hell was happening. And so it turns out that it's actually the Linux kernel that's doing it, but I spoke to a distinguished engineer at Sun, and Solaris does the same thing. And so I feel like, all right, it's not just an artifact of 14 year olds writing code. Okay? And the problem here, this is the idea. Linux has this intelligent thing that it does when you spawn threads. If you spawn threads and you don't say that there's a specific CPU that they have affinity to, then it keeps it on the same CPU where it was spawned for some period of time. The idea is, you know, the cache is all warmed up, it's a thread, it shares the same address space, let's optimize for that case and then as the thread runs for longer we'll load balance it off. All right? With processes, processes don't share address space. They should be distributed immediately. So in fact, this actually works in our favor because the processes are immediately load balanced. They're spewed across all available cords as fast as possible. And so that's why we get a speedup. So I'm certainly not advertising this as some technique to improve the speed of your programs, it was intended to just measure what the overhead was of using these processes, and it turns out that there really is effectively because of this artifact not even no overhead, it's negative overhead. Which is not bad. Yes? >>: Most fortunate parallelism systems will have some set of worker threads whose startup time is amortized once. >> Emery Berger: Absolutely. Absolutely. So if you think ->>: [inaudible] in this graph? >> Emery Berger: So I mean if you were to use Cilk, right, so these threads -- I mean, it would basically be, you know, consider the situation out here at the extreme edge, right where the thread has been running for a long time, here it's a second. Right? All of the startup overhead is gone, everything has been load balanced out, you know, this is the standard sort of, you know, kernel threads that are sitting here with their own like decks, right? And so, you know, you end up at the same place. And here the axes are coincident. So I think that's where you'd be. Okay. >>: When [inaudible]. >> Emery Berger: So I mean one millisecond I think that it would be -- you know, it might be around, I don't know, I mean right here the processes are faster because they're load balanced. I would expect it to be about the same, right, that the gap would go away. You know, if they're already running for a -- if they've already been running a long time, then they've already been load balanced out, right? So they're all spread out. >>: Mentioning the overhead of actually [inaudible] I think that Dave is asking is that if you have this worker threads stick around, the [inaudible] of starting new threads is actually ->> Emery Berger: Absolutely. Absolutely. >>: [inaudible] still have the [inaudible]. >> Emery Berger: That's right. >>: [inaudible] cost of starting up a new ->> Emery Berger: That's right. That's right. And well, I can talk about this some more when I get to some of the other results. I'm going to talk a little bit about Cilk. But you're absolutely right. I mean, if you think of a framework like Cilk or essentially it's follow-on which is the Intel thread building blocks, the idea is you have some set of kernel threads and then thread spawns really very, very lightweight continuations. So there's no OS involvement at all. So they're quite cheap. And the goal there was to optimize fine grain thread behavior. Okay. So that said, all the issues of process creation and stuff with the exception of very, very fine grain stuff, which I can talk about in a minute, are really not the overheads of Grace. The overhead of Grace is about rollback. And so we wrote a little microbenchmark that just simulates the effect of different rollback rates and as the rollback rates increase, unsurprisingly performance declines. The ideal is eight -- the dotted line here is with P threads which of course never roll back, and then here you get Grace so as rollbacks increase you eventually start heading more and more serial. The good news is that you can with stand some number -- some percentage of rollback rate and still get reasonable speedup. But, you know, all of the steps that we take to eliminate false sharing are really crucial. And if you have true sharing of course, it's not a good situation. Yes? >>: So when you're comparing your ->> Emery Berger: It is apples to oranges in this sense. Here -- it's the exact same program, okay. One of them is using P threads. We're using locks to modify some shared data. >>: Well, you already ->> Emery Berger: Yes. Yes. And so Grace ignores the locks and just treats the thing as sort of a transaction. >>: So they are computing the same thing? >> Emery Berger: They are. >>: Like the P threads just ->> Emery Berger: No, no, no. It's the exact same program. Absolutely. And in both cases these are the same programs. Just one is linked with P threads, one's linked with Grace. Okay. So, right. So here's my disclaimer. So these are the benchmarks that we're using. It turns out that it's actually quite hard to find benchmarks that exclusively rely on fork-join. A lot of the classic also crappy benchmarks like splash rely on barriers, and barriers, while they are conceptually at a high level like fork-join, they actually break up the threads. So it doesn't make sense for them to be -you know, if you think about it, you've got thread one and thread two and they have a barrier, they join and then you have the next threads. If they had been spawned to threads wait for them, spawn two threads, wait for them, that would be a perfect fit for Grace. But with a barrier breaking them up, the sequential composability thing doesn't work. All right? But so these are drawn from two sources. Mostly they're from a suite of benchmarks called Phoenix which are meant -- they're the benchmarks that are used to test Map/Reduce, a shared memory version of Map/Reduce. Matmul here is from Cilk. It turned out that the Matmul was the only one of the Cilk benchmarks that scaled at all with P threads. Because the Cilk benchmarks are designed to stretch super, super fine grain thread creation. So we subsequently came up with a clever hack that allows Grace to run very well even for those, but I'm not going to present those results. So the blue bars here are P threads. The red bars are Grace. These are speedups on an eight core machine. You can see that in some cases you get super linear speedups and these are cache effects. The good -- you know, the good thing to see of course, this is a little bit surprising. And the reason this happens basically is that lock overhead goes away with Grace. So if they -- if the memory spaces don't conflict and you have no rollback and you also have turned the locks into no op's then you get a performance benefit. Which is nice. Not what we would expect in practice. So the speedups are good. I have to note, though, we had to make some minor modifications to these programs. The bulk of the modifications, that's the mods that we had to make to most of them, were this particular issue, which is, you know, as I described before Grace survives the way P threads appear in the code. But if you do this, spawn thread update a global variable then spawn a thread and update a global variable, Grace will say, oh, I have to do the whole thread execution first and then do the update of the global variable, then I can spawn more threads. So it serializes all the thread spawns. So that was a real killer. So all we did was we hoisted that out of the loop. So we put these all in an array, spawned everything into an array. They each had separate result fields or something and everything worked out fine. So that was the most important thing. If we extended the P threads API with like, you know, create multiple threads, then that problem would go away. Okay. So here Kmeans is kind of an interesting case. Kmeans has a benign race, so Kmeans the well known clustering algorithm. It has a benign race where it throws stuff. It basically is counting with how many things belong to a particular cluster. It converges. So the race is not really a problem. It's just a number, so it's an atomic value. But for us, this is a killer because Grace sees these things and doesn't know how to distinguish a benign from a malignant race and sequentializes everything. Okay. >>: [inaudible]. >> Emery Berger: Yeah? >>: [inaudible] that in the same way, that is instead of putting all the values in the same place to ->> Emery Berger: Yup? >>: To put them in ->> Emery Berger: Yes, so ->>: The end? >> Emery Berger: That's right. So one of the things you know, we just decided, you know, for our purposes we said we're not going to make persuasive modifications to these codes. Clearly we could have changed Kmeans to avoid this problem, right? But I said, well, you know, let's see what we can do with local changes where we barely understand what the program is doing. Okay? And that was really the goal. So in fact, one of my students was like, ah I think we can fix Kmeans by doing X, Y, Z. And I said no, no, no. That's not the object here. The object here is to come up with minimal changes. And in fact what we're doing now is we're trying to come up with ways of avoiding even making those minimal changes, right? So the goal here would be you don't change your code at all and it still performance well. Right? Okay. All right. So that's the benign race. All right. That's basically it. So Grace I think again this is really about this sort of idea of garbage collection for concurrency in a way, right? I want to get rid of all of these errors just like I used to get rid of dangling pointer errors with garbage collection and that turns out to be a big win for a number of reasons. Getting rid of sequential errors, it may be more costly than running to the bare metal all the time. There are going to be different ways of programming potentially. But it seems like a laudable goal. I'm -- you know, these are promising initial results, right? So if you use a cleverly tailored algorithm to implement this kind of GC for concurrency, then you actually can get pretty good performance. You know, I should note, you know, this is sort of transactions, right, but there's no overhead on the main line, right? So you have these very coarse grained transactions. There's no logging. There's no compiler involvement. There's no read barriers beyond the initial have I read this page once, and all of that overhead is amortized. So that's how you get this performance. And there's hopeful Wall-E. Yeah, happy to take more questions. Manuel? >>: This is very nice. One [inaudible] your benchmarks is that you started with benchmarks that are already parallelized. >> Emery Berger: So this is a story of ->>: [inaudible]. >> Emery Berger: Multithreaded [inaudible]. >>: Make everybody be able to use threads and so on, so I would have to start really with this sequential program and start putting spawns and syncs into places, right? >> Emery Berger: That's right. >>: And now the question [inaudible]. >> Emery Berger: Yeah. I mean, I think, you know, there are deep problems, right, with writing concurrent code, beyond just the errors. Right? So getting it right is hard enough. But getting it to scale is a whole other question. So this is one of the reasons why I personally remain skeptical of the sort of vision of automatic parallelization is that it's very, very difficult -- it boils down to algorithm design, right? Automatic algorithm synthesis, right? So I'm going to take, you know -- you go ahead and you write your sequential, you know, random sort algorithm that works like this, randomly permute everything and see if it's sorted. Okay? I will discover that that's inefficient, and I'll turn that into quick start. To me that's -- it's tantamount to, you know, this sort of you start with the sequential program, I'll generate a parallel program that scales. And so I -- you know, given that that seems very, very far off, if it's even possible at all, I figure start with the current program. But the concurrency here is you spawn stuff. But you're still required -- there are two -- I mean, the bad part here is there's an additional requirement. Because not only do the data structures largely have to be distinct, they actually have to be distinct at sort of page level grain, which is a drawback of this particular implementation. But in the end, you have to break up data sharing. I mean that's part of the whole goal. The especially bad part about this, though, is that one shared -- one conflict can cause an entire rollback and that's actually something that we're working on. We have -- I'd be happy to talk to you offline about this. Yeah, Carter. >>: So you started off by [inaudible] all the problems in concurrent programs like [inaudible]. >> Emery Berger: Maybe not all. >>: Well, okay. A lot of them. So do you have any -- and you get the [inaudible]. >> Emery Berger: Yes. >>: So do you have any data on how -- what frequency of these [inaudible] this classic program? >> Emery Berger: Oh, this is a good question. So I mean I don't have, you know, data data. I have anecdotal evidence. So the Cilk folks have been working on this for a very long time, right? So Cilk is a fork-join model of parallelism. And they have written I think three different race detectors. And in one of the papers where they talk about it, they call it the non determinator. And one of the non determinator papers they say oh, they gave all these programs to a bunch of -- programming assignments to a bunch of MIT undergrads. And almost all of them had race conditions. So that's -- that to me is suggestive that it's not much easier given that MIT undergrads as a whole not too bad. So Dave? >>: Some of them [inaudible] was surprising and astounding things that you can get these sort of results at page level granularity of false sharing. >> Emery Berger: Yeah. >>: I remember reading the paper. And did you do something in matrix multiplying to make sure there were like a block matrix multiply and you ->> Emery Berger: Yeah, yeah, yeah. So -- yeah, so the one change that we did in matrix matrix multiply which is I guess the most pervasive change was the base case for the matrix matrix multiply was some arbitrarily chosen number. It turns out actually -- so it was like I forget, 16 by 16 or 32 by 32. And we initially fingered that as being a potential scalability problem because, you know, it's a very, very fine but in fact it turns out to of been tuned for a particular cache configuration. So we made it bigger. And making bigger actually improved the P threads based code but it also helped us because it meant that the blocks were -it was basically 4K by 4K or something. The result was that you were accessing things in page size chunks at the base. And the way that the memory allocator works, the memory allocator when you allocate a large object, it guarantees that it starts on a page boundary. So it makes the math very easy. So you know for sure this starts at a page so I know that this index range to this index range always lies on a page. >>: So [inaudible] a little bit of knowledge about I'd like to make things like a page size [inaudible] the code you ->> Emery Berger: That's right. I mean, the real problem with things like matrix matrix multiply is that you're dealing with arrays. And you have very limited freedom to move arrays around. With other objects we actually have a lot of leverage. So one of the things we can do so we have a prototype that does this right now. You run your program. It detects false sharing and then it says here's where the false sharing happened. So with globals, this basically means, you know, oh, add some padding. With heap objects it means segment things from different call sites, for example. >>: [inaudible] about to [inaudible] languages? >> Emery Berger: Doing what? >>: [inaudible]. >> Emery Berger: Yeah. Yeah. We have. Although -- yeah, I'm not sure if you're on my schedule, but yeah, we -- I think, you know, this is certainly -there's nothing that precludes this from being incorporated in a GC language. And GC gives you extra advantages as you described. And the ability to move objects around to sort of shuffle them out or say, you know, this is conflicting and then let's tease those apart would make things tremendously easier. So we're in a worst situation because we're in C and C++ than we would be if we could actually move things. Yeah? >>: Just [inaudible] suppose that you have a binary tree where the leaves of the tree happen to sit in different pages. And then you you're marching down the tree and you're spawning the threads with every internal node and as you're getting to the bottom at that point you update at least the [inaudible]. In that situation, are you going to do -- you start [inaudible] are you going to do N or 2N commits? That is do you commit with every level of the tree ->> Emery Berger: So each-so let's see. So if I spawn the -- I'm trying to think of exactly what you're asking. I mean, the commitments are directly related to the number of threads and when they spawn a thread. So when you spawn a thread that's a commit point. And when a thread ends, that's a commit point. >>: Okay. >> Emery Berger: Yes? >>: So most of the benchmark from [inaudible]. >> Emery Berger: Yes. >>: And did you compare the line with a number of line if we're using [inaudible]. >> Emery Berger: That's an interesting question. So I spoke to Christofs [phonetic] about this. It turns out that using their Map/Reduce implementation mostly impairs performance. And the number of lines of code is not materially different, it's actually often larger for Map/Reduce. It's not really a great story for Map/Reduce on shared memory. I mean, the idea of course is well this will make it easier to write programs. I mean, I corresponded with them. He basically things it's easier for software engineering reasons and it makes it easier to reason about the concurrency. But the performance story is not that great. So in fact, I didn't include any of the Phoenix Map/Reduce based versions for comparison because it seemed unfair. Because they're all worse than the P threads version. And P threads to me are the gold standard. Yes? >>: So this is a solution when you you have mostly [inaudible] right? >> Emery Berger: Yeah. >>: But how do you kind of extend this when you [inaudible] benchmarks are more [inaudible]. >> Emery Berger: Right. Right. So that's an excellent question. And I mentioned I was going to get back to this and now I will. So, you know, one of the things that we encountered when we tried to do this is I basically had a student change all the Cilk benchmarks to use P threads. Okay? It wasn't that hard. It's mostly mechanical. You see, you know, every place is a spawn, turns in a P thread create and then you have to, you know, save the results somewhere. That's about it, right? And the syncs become P thread joins, okay. And then we ran them and none of them scaled at all except for matrix matrix multiply. And when I say it didn't scale, I don't mean with Grace, I mean they didn't scale with P threads. Because they were so coarse. Okay? So then I couldn't include them and I was like damn, this sucks, right, this is a disaster. Well, it turns out we can run them now. And the way that we run them is through again either, you know, devilishly clever, you know, application of genius or a crappy hack. You know, depends on the eye of the beholder. The insight basically is this. If you're doing nested, you know, this nested fork join, sort of nested divide and conquer just the way Cilk works, that means that you spawn a thread, you spawn another thread, they spawn threads, they spawn threads, they spawn threads, right? So we reasoned, you know, once you spawn enough threads and the number of threads exceeds the number of cores by enough, just don't spawn anymore, just run the stuff. So we used the nesting depth, and we say all right, two to the nesting depth. Once two to the nesting depth is greater than twice the number of cores, we don't spawn anything else. You just run the function. And that means now we can run -- we can run for small versions even things like fib we actually scale now. And you can run almost as fast as Cilk. Because at some point there's no [inaudible]. Now, this doesn't work if there's load imbalance. Right? If there's extreme load imbalance so that one of the cores, you know, has all the work and all the others are idle, you know, in Cilk that gets resolved because of work stealing. For us it would not get resolved. There are possible approaches that we have not implemented. One we could detect when this is happening and roll back to the spawn point and then really spawn things. Or you could learn for subsequent invocations not to do that. Yeah? >>: Do you have another [inaudible] right, you have two [inaudible] and if you are [inaudible] very fine granularity then you are [inaudible]. >> Emery Berger: Okay. So I'm not sure what you're asking. In the ->>: So. >> Emery Berger: I mean when you say fine grain, I think fine grain means short lived threads. But are you saying something else? >>: [inaudible]. >> Emery Berger: Sure. >>: So [inaudible]. >> Emery Berger: Yeah. Yeah. >>: Then you have to [inaudible]. >> Emery Berger: Okay. So if you write a program that does this, right, spawn eight threads, wait for them, spawn eight threads, wait for them, spawn eight threads and those eight threads do hardly anything, then the approach I just described will not work. But if you do recursive divide and conquer where you say I'm going to carve this big space into half and then half and then half, and then the leaves are very, very fine grained computation, what we do is we turn that whole sort of forest below a certain point while the tree rooted right here into one big task. And so it works great. >>: So there's some underlying assumption that the data associated with that tree is contiguous and disjoint ->> Emery Berger: Disjoint. Contiguous not required. >>: Okay. >> Emery Berger: That's right. And that the conflicting -- potentially conflicting pieces are disjoint. >>: Right. >> Emery Berger: Right. Yeah. And if that doesn't happen, then it's a problem. And, you know, that's a threat. You know, there's a tension between how long you make these threads last and how -- what the likelihood of rollback is. So one of the things that we've done that's not in this paper is we've implemented a checkpointing. And the checkpointing is even more crazy because what we do is we basically -- here's the very bad case that we didn't want to have screw us. Do a whole bunch of work, update a global variable, the end. Okay? Right? And then oops, we have a conflict, rollback, waste the whole computation. Right? We do not want to do that. So we have a way around it, and the way it works is we periodically take a checkpoint by calling fork again and we leave that fork to child as a sort of place holder sitting there. So we fork, we fork, we fork, we get to a certain point and we discover a conflict, we roll back to the last fork. And the fork is reawakened. If we successfully commit, we just kill all those children. So, yeah. >>: So [inaudible] successful I think was that first we had the [inaudible] said okay forget the [inaudible] and we have infinite memory. >>: Yes. >>: That's your model and now [inaudible] right? >>: Yeah. >>: Now, if this were the [inaudible] you spend most of the talk about the technique how do you implement it, but at the beginning what's the model? So you're model is well all these things are no op's, spawn, sync, and so on, and sequential semantics, is that the right model that [inaudible] same kind of research that happened in GC to actually succeed or do we still need to treat the model ->> Emery Berger: Okay. So I'm by no means convinced that this is the end at all. But what we were striving to come up with was what does it mean for a concurrent program that has bugs to be correct? So for a program that has dangling pointer errors to make it correct you turn off delete. Right? And you have this infinite memory abstraction. And so that's very clear, easy to understand, easy to explain and, you know, completely solves the problem. The question of what makes a concurrent program correct seemed to be a difficult one. All right? What is the correct version of a concurrent program. And so this is what we ended up with. We said let's find something where there's an isomorphism sequential programming. And that's the correct program. But clearly this doesn't include all possible forms of concurrency. Right? We've restricted ourselves to fork join. And so it's a question whether we want to extend this to include condition variables and signaling, for example, but, you know, what is a correct program now that you basically have injected message passing, right? I mean, you know, condition variables are just messages. And you know, the correct version of a message passing program, I'm not sure what that is. So but it's a great question. Certainly one words discussing over beer. All right. Thanks everybody. [applause]