>> Ben Zorn: So it's a great pleasure to... professor at the University of Massachusetts in Amherst. Emery...

advertisement
>> Ben Zorn: So it's a great pleasure to introduce Emery Berger, an associate
professor at the University of Massachusetts in Amherst. Emery got his PhD in
2002 from the University of Texas and I've been working with him since then on a
number of different projects. I have to say Emery always surprises me by the
creative of his work and also by the fact that he takes ideas that you think are just
out -- out of this world and he makes them work in practice. So I think this is
another example of that work. And today Emery's going to talk about parallel
programming in Grace.
>> Emery Berger: Thanks, Ben. Hi, everybody. All right. So you see the title of
the talk, Grace: Safe Multithreaded Programming for C and C++. But what I
want to talk about first is garbage collection. This may seem a little surprising.
But, you know, garbage collection is a topic that obviously -- well, I'm not sure if
it's near and dear to my heart, but at least it was at one time.
But it's a well established technology, it's something that it now works, right, so
here's -- here's the world's best garbage collector, Wall-E. He never stops.
There's always garbage to be found. And you know, the idea of garbage
collection, at least the idea for Wall-E is you've got a robot and the robot is going
to find your garbage and take care of it. And in a garbage collected language,
right, the runtime takes care of your memory errors. And this is great. You can
deposit your garbage wherever you feel. The runtime will go after you clean it
up.
And so this is actually -- so that's Hans Boehm by the way, if you don't recognize
him. I mean Hans is -- this is especially suitable here. So Hans built this
conservative garbage collector for C that you all may have heard of and really it's
exactly what I'm getting at, which is it's garbage collection that is intended to deal
with programmer error. So you're in the world of C. You make some mistake in
your memory management, you throw Hans at the problem and the problem
goes away. Hans-E. Right. So you know the result of all of this work on
garbage collection is it's a success story, right? They're now plenty of languages
where garbage collection is now practically taken for granted. And you know, for
many, many years it was considered very controversial. Who would ever adopt
it. And now, you know, you can barely think of -- I think there are no dynamic
scripting languages for example that don't have garbage collection, JavaScript of
course, Python, Perl, Ruby, not to mention Java and C#. It only took 35 years.
So I guess you know if you're patient like Wall-E, then all good things come to
those who wait. Okay. So wouldn't it be nice if some day we could have
garbage collection for concurrency. So this is sort of the vision of this talk. I
would like to be in a world where I can take my program that has all kinds of
bugs, right, I just throw, you know, write a concurrent program with just as many
bugs as I want. I leave off locks, it's full of races and then Wall-E comes and will
clean up after me. I think that would be great.
So I just want you to, you know, think of the rest of this talk through this sort of
view of garbage collection for concurrency. All right? Okay. One question
immediately comes to mind though is wait a second, what does that even mean?
What would it mean to garbage collect currency errors? We know what it means
to garbage collect memory errors. Concurrency it's not so clear. Okay. A lot of
this is not going to be new, but we'll go through it anyway.
You know, in the beginning there was the core. The core was good. And every
year the core got faster and faster, life was wonderful. And so if you wanted to
write a sequential program right you'd just write your sequential program. Every
year it got faster and faster. You just write a very simple program. You're not
meant to understand this, but it's just a very, very simple loop. And it was great
until one day things got a little too hot and now we're out of luck, right, they
leveled off. And so then they say wait, wait, you know, we know how to build the
one core. What if we give you four. And we said we don't want four. And they
said sorry, you're getting four.
And so now we have four cores. And the promise of course is that well, if you
give me more cores then I can run faster because I can do more things at once.
Right? The last little stripe has to lag along, [inaudible]. All right? So that's
great. And if we could do this, if we could parallelize all our programs, right, it
would be as if Moore's Law never stopped.
Now, I'm being a little loose and fast with what Moore's law means, but you all
get the idea.
Okay. The problem of course is anybody in this room knows who is programmed
current programs. How many of you have written concurrent programs? Okay.
How many of you have written concurrent programs that have bugs? Okay. All
right. Don is like what, bug, me?
So here is the kind of problem you get, right? You run the program once, it
works fine. You run it again, something else happens. Sometimes it works,
sometimes it doesn't, right, so these are races of course. And if you look at this
code, you can see that there is indeed a race if you imagine this being multi
threaded down in this loop down here if this is a parallel four. The problem is that
these are races, right? Color and row and draw box are being done really need
locks around them.
So you then throw locks around the updates to color and row. You would think
everything would work. And it will prevent races, but unfortunately sometimes
you get this or this, which is rather surprising. And so these are atomicity
violations. So here the problem is you put locks around things but you made
them too fine grained. You really wanted to draw this whole rectangle in a color
and then do the next one and the next one and not change the color while you're
drawing the row. So you need to move the lock up.
And you know, for those of you who like transactional memory, you know,
pretend that lock and unlock means, you know, atomic and curly braces, all
right? Same thing.
And I should observe in passing you know it's transactional memory does not
solve any of these problems, right? It's really a question of where you put these
braces. So if you make a mistake and you would put them where they were
here, you would have exactly the same problem, you need to put the braces in
the right place, and transactional memory does not deal with this problem, right?
It still relies on the programmer to get it right. Okay.
So here's another kind of problem. The one thing transactional memory does
deal with which is deadlock, unfortunately sometimes it converts into live lock
which is far, far worse, but we won't get into that. Okay.
So here's another weird error if you see this. Here's the flag one, flag two. So
this is what's called an order violation. It's a recently identified concurrency error,
which I think is great. You know? Good, new errors. Fabulous. So the new
error is programmers wrote this code, assumed that when you spawned a thread
that that thread would be delayed, that it wouldn't just spawn immediately. And
so they then did some other stuff right after they spawned the thread, and almost
all the time that other stuff preceded the actual execution of the next thread but
then they changed versions of the OS, the scheduler changed, all of a sudden
the thread started executing first and weird things happened.
So fun concurrency errors. So to recap, back in the day when we had the one
core, it was great, everything was sequential, no bugs, right? Remember. All
right. It may be easier, okay? And you know, it was a world of unicorns and
rainbows cooperatively. So now we have races, atomicity violations, deadlock,
order violations. So this is currency hell. So it's a bad situation. So what we'd
like to be able to do is to come up with the world that takes us back to those
unicorns and rainbows to some extent and so the object of race, which is what
I'm going to talk about today, which is you can view as a sort of garbage collector
for concurrency is intended to get rid of all of these errors, okay, and impose
sequential semantics. So you'll take your multi-threaded program, and it will run
as if you had written a sequential program. Okay? So I'll explain what that all
means in some detail. The idea behind Grace, however, it's not a new compiler,
it's a runtime solution. Roughly speaking to use Grace you swap out the P
threads library for the Grace library. And you don't need to make any other
changes to your code. Okay? This guarantees sequential semantics, gets rid of
these errors, and I'll show you how you can actually get some performance out
of.
>>: Yes?
>>: What do you mean by sequential semantics? [inaudible] contains spawn
thread, I mean, there is no ->> Emery Berger: Yeah. I'll -- that's exactly the next thing, right? So in the rest
of the talk, the very first thing I'm going to talk about is going to address that
question, which is what the hell do I mean? Right? I've got multi threaded
programs this are going to run on multiple cores and now I'm saying they're
sequential. What do I mean by that?
I'll talk a little bit about what goes on inside, right, what's under the hood, and
then present some results. Okay? All right. So to the first question. So for the
purposes of this talk, I'm going to use the syntax of the Cilk programming
language. For those of you who are unfamiliar, it's a C like programming
language that is extended with spawn, which is the moral equivalent of P thread
create and sync, which is the moral equivalent of P thread join for all of the
preceding children. Syntactically, it's just cleaner and makes the examples
easier. But Grace actually implements the P threads API. Okay?
So here is a little program that says spawn thread to run the function F of X and
store the result in T1. Same thing with G of Y and T2. And sync means wait for
both of those threads to complete. Okay? Very simple multi-threaded program.
So the execution looks something like this.
You say spawn F of X. F of X starts executing. It runs concurrently with the
main program which now spawns G. Now, everything is running concurrently
until they hit sync and sync is like a barrier, okay? All right. So that's your
concurrent program.
So what I want to do is I want to take this concurrent program and I want to
impose an order on the execution. What I'm going to impose in particular is a left
to first depth first ordering, left to right I mean depth first ordering. But what that
corresponds to is actually much easier to explain syntactically. All we do is we
get rid of the keywords spawn and Cilk. So this is sort of an elision of all of the
concurrency operatives in the code and you're left with a sequential program. So
where you had a function call before that was supposed to be spawned
asynchronously in another thread, now it's going to get execute synchronously.
So F of X will be executed to completion, then G of Y.
So these are the semantics that Grace is going to provide. Okay? It's not really
going to execute them this way because that would defeat the whole purpose of
having multiple threads and multiple cores. Okay? Yes.
>>: So you chose that there's certain parallel programs to be [inaudible].
>> Emery Berger: That's exactly right. So first let me say what this does, what
the implications are, and then I'll talk about the limitations. And you're absolutely
right. So first it should be clear that most of these problems go away, right, as
soon as you move into a sequential world these sort of concurrency errors by
definition stop really mat entering. So a race doesn't matter if threads aren't
running in concurrency. Same thing with atomicity violations. We don't need
locks anymore because there aren't multiple threads. So that gets rid of
deadlock. And there's no order violations. The order is fixed, it's program order.
Okay? However, there's some limitations.
If you have threads that are intended to execute forever, so some sort of a
reactive thread. So if F of X and G of Y when you spawn them they were both
supposed to run forever now we compose them and F runs forever and G never
runs, you have a bit of a problem. So you can't have infinite thread execution.
The other thing which is perhaps a more crucial limitation is they can't
communicate either, okay? They can communicate in the sense that the effects
of one thread become visible to the next thread or to a child thread, but they can't
use condition variables. So in fact in Grace condition variables are not exposed
in a library, you try to link with the Grace library and you have condition variable
calls. It will complain at compile time. All right? So this sort of communication is
ruled out. Okay?
>>: [inaudible].
>>: You conditional implement your own ->>: Implement my own channels [inaudible] shared memory [inaudible].
>> Emery Berger: I mean, there is -- yeah, yeah. You can't implement -- I mean
any sort of thing that does that, you know, there's no way to actually implement it
in Grace. Right? You're prevented from doing so.
And the primitives aren't exposed. Okay?
>>: [inaudible].
>> Emery Berger: Oh, well, I mean, okay. I see. So you want to violate the
boundaries, right. Yeah. Okay. Sure. Yes. That's right. There are certain evil
things you could do to try to get past this and then, you know, if you role your
own or you include something that's not called P thread conned, whatever, it's
called, you know, Manuel conned, then Grace doesn't know about it and it will not
behave as you would like. Okay?
What it's intended for is fork joined parallel code. So the whole object of writing
these parallel programs ostensibly is to take advantage of all of these cores and
take computations that were sequential and make them run faster and fork join is
a popular paradigm that makes it fairly easy -- it's a nice easy abstraction to think
about how you break these things up, break something into pieces, maybe even
have nested fork join. That all works. And so there's a bunch of different
libraries that exploit exactly this -- provide this sort of API so map-reduce is a
fork-join framework. Cilk is this programming language I described earlier.
Intel's thread building block's library. I should add Java also has a new fork-join
library that's coming out. Again to just enable programmers to use this paradigm.
Okay.
So let's see. So what's going on? How is it possible for me to take a code that
used to run sequentially and as I'll tell you later get any performance out of it?
So magic. Not much of an answer. There are two pieces here, two crucial
pieces. One is some magic which is really you could think of as hackery and the
other is ordering. So we need ordering, and we need this hackery to make this
work.
So first I'll talk about the hackery. So this is going to sound terrifyingly evil. Or
awesomely clever. You decide okay? So here's what we do first. So we take
these threads, you know, you got this program and it spawns threads. We're
going to turn this program that was written using threads into a program that
forks. That is Unix level process creation. All right? So you may be thinking to
yourself dear God, why would you do this? But it will become clear in a minute.
And I should add right now this is implemented in Linux. Fork is actually quite
efficient. In fact, thread creation is -- uses the exact same code base as fork. So
the difference in terms of performance as you'll see is not really a material
problem. What we do this for actually is to provide isolation.
So the great thing about processes is processes have separate address basis.
So what I can do is I can run one thread and run another thread and all of their
local effects are isolated from each other. So we're going to use this to enable
speculation. And that's how we're going to get performance.
So for example, I can start running F of X and I can then run G of X completely in
parallel and if one of them does something that would violate this sequential
ordering, I can always throw away its results and execute it again. So in the
worst case, I'll end up with the sequential execution, I will hopefully not be in that
case, and I'll get some parallelism out. All right? And it turns out. So and it turns
out nested threads are okay. I'm not going to explain the details here. But you
may be asking yourself, wait a minute, these are in separate address spaces.
You should be asking yourself this. They're in separate address spaces. You
know, F does something, G does something, but they're totally separate
processes, right? They can't see the results of what they did. It doesn't make
any sense. So what we do is we actually use low level memory mapping
primitive to basically expose the shared space and then each guy has a private
copy of the same M map region. So this is something you can do with M map.
When you go and you write to one of these pages it's copy on write. So
everybody will end up with their local versions of the heap and they can always
see say the heap and the globals that had been committed before. So it's a way
of giving you threadlike convenience in terms of having shared memory but still
get isolation.
Now, when these threads are running, like I said, they're running in isolation, they
get all their private versions. Eventually this memory has to be shared, right? So
all the threads need to be able to see all the results of preceding threads. So
what we do is we impose ordering on how the results are committed. So there's
no ordering of the execution. The execution runs completely in parallel. But F of
X precedes G of Y in program order, and this is the sequentiality we want to
observe. So F of X's results need to happen first logically then G of Ys.
So what we do is we say, well, F of X is first, and it's going to get commit before
G even if G finishes first. F of X then checks all of the preceding -- any pages
that it's read to make sure they haven't been made stale. If they haven't, it can
commit. And then two can do the same thing.
When G of Y goes and writes if it discovers that some of the pages that it's read
have now been invalidated, which it does with some version numbering that's
maintained down here as well, it has to roll back. Okay? So in the worst case
you get back to serial. But the goal here again is to make sure that correctness
comes first. Right? We want the program to behave correctly. This is think of it
as an early unoptimized pregenerational garbage collector, all right? We want it
to be correct. We got 35 years to spend making it run fast, okay? Well, hopefully
it won't take that long. Okay? Well, if we can just get the TM people to stop
working on that, then maybe things will go faster.
So all right. I'm sorry. Okay. So there are some other issues to make this run
faster. So and I'm not going to go into much detail about them. If you have a
heap that is essentially just one big heap and it's a centralized data structure, this
completely will not work because one allocation by one thread and one allocation
by another will always conflict. So the heaps themselves, the data structure itself
needs to be made decentralized. In addition you don't want memory allocations
that are sort of adjacent like one thread calls malik [phonetic], the other thread
calls malik, everything here is in a page grain. We're using mmap, right? And
we're going to be tracking these things on a page level.
If they're all intermingled, then you get a lot of false sharing. So what we do is
we make sure that every thread has its own set of pages that it allocates from.
So if you have two threads and they're allocating stuff, they're not going to
conflict because they're all on distinct pages.
We do something very, very simple with the GNU linker. Allows us to relocate
whole sections of where things end up in the image. And so the linker we
basically make the globals page aligned so there's a page before it and a page
after it. This prevents some false conflicts that arise from code being right next to
globals.
Now, you can be in a situation where you have false sharing because two globals
are next to each other in memory, and there's an update to one an update to
another by another thread. We have ways of dealing with that. But I'm not going
to go into it. Crucially we have to deal with I/O. So I mention that the idea here
is basically optimistic concurrency, but optimistic concurrency means you're
going to roll things back. And as we know, it's hard to role back certain
operations. So Morris Herlihey's [phonetic] favorite example is launching the
missile, right? You launch a missile. Morris clearly is a child of the Cold War.
So you have this missile that you launch and you say oops, oops, X was 2. That
means Russia's not attacking or whoever our enemy of the year is. So you -and that's too late. Darn. Right? So what we do, first there's a very
straightforward easy thing to do for certain operations which is to buffer. Right?
But there's some things that are just out and out irrevocable. So for example, if
you have a network connection, right, so I want to initiate a connection to some
remote server and get a response. I can't buffer that, right? I need that
response. Right? So I really, really do need to execute it. So what do we do?
So the trick here, all we have to do is we take advantage of the ordering. So we
know for example if we're the -- if we have no predecessors that we are going to
commit our state. So whenever we get to an irrevocable point we say hey, are
there any predecessor threads that are supposed to commit before me? No.
Okay. Is all my state completely clean? Yes. I can go. Because I know I'm not
going to roll back. I'm guaranteed to commit. All right. So that allows us to
actually handle all of this sort of arbitrary I/O difficulty, even though we're doing
speculation.
Okay.
>>: [inaudible].
>> Emery Berger: Yeah?
>>: [inaudible] with respect to exceptions like one thread throws an exception
and other thread tries to commit after that, so at that time you guarantee that -and I see the exception I go to sequencing [inaudible].
>> Emery Berger: So if an exception happens right now, the exception -- and it's
not caught or -- I'm trying to understand exactly the scenario that you're
concerned with.
>>: So I mean if I'm debugging this program, that is good, right, and I introduce
some of the -- no one can [inaudible] I'm not saying memory or something
[inaudible].
>> Emery Berger: Okay.
>>: And I see, you know, a segmentation for it. Would the same -- I mean, if you
run this sequentially you would see it at a certain point. With Grace do you end
up seeing the same state that one does execute?
>> Emery Berger: Okay. Okay. So let's -- let's be careful to distinguish
exception being thrown from you know, segmentation violation, right? Because
one's a programming language construct and the other is the result of something
happening below the PL, right? So like de-referencing null or something, right?
So we could talk about the former, but the latter case what will happen is the
thread will detect it. Currently the way Grace works is it actually uses
segmentation violations as read and write barrier. So the very first time you read
a page and the very first time you write a page, you take segmentation violations
so you knew which pages you've read and written, and after that the pages are
unprotected.
If you take a subsequent segmentation violation, then Grace will say wait a
manipulate, you actually had a program error. And it will abort that thread.
Which is actually a process. So weirdly it won't take down the program, it will just
take down the thread.
So it actually is arguably better than sequential in this case in the program -- in
that the program will actually survive. So if you have a thread, imagine you
spawn a thread and the thread immediately de-references null. It will be as if that
thread had no effect. Right? Because none of its changes will ever get
committed.
>>: [inaudible].
>> Emery Berger: Yeah. It's an interesting question. Yeah, yeah, yeah. So we
thought about this. We said do we want to enforce the semantics where when
bad things happen we just abort everything or not? And it's a flag. It's trivial.
We basically either say, you know, basically we have died and you send the
signal to your parents and you basically return in, you know, in Unix terms you
return negative one instead of zero. It discovers that something bad has
happened and everybody fails. Or you just fail silently. And this is really a policy
decision.
>>: [inaudible] get the same segment violation in a sequential exception?
>> Emery Berger: Yes. Absolutely.
>>: So I don't understand the issue.
>> Emery Berger: I think that Cartic [phonetic] is concerned about repeatability
or preserving the semantics of execution. And I think it's an interesting question
only -- well, not only but especially to me in this issue of if I've got code and it has
some sort of an error, what do I want to do about it? And in sequential case the
normal thing is you crash. Right? And if you have a multi-threaded case, the
same thing happens. Something bad happens and -- but it happens at an
undetermined time. It would be easy to tell -- like I said, the current runtime
system actually behaves more nicely, but it's literally a matter of saying return
zero or return negative one. Yeah?
>>: [inaudible] your comment before about the negative effective [inaudible] into
a live -- I mean, silently skipping it seems worse than -- okay.
>> Emery Berger: So it's not turning deadlock into live lock, right? Okay.
>>: No.
>> Emery Berger: Okay.
>>: In the sense that you're taking -- at least deadlock you know your machine's
frozen.
>> Emery Berger: Yes, absolutely. Absolutely.
>>: So now you're taking something where you would have known it crashed
[inaudible] it crashes.
>> Emery Berger: Right. So you know ->>: I'm surprised that you're so willing to [inaudible] policy decision not want -since you seem to not want [inaudible].
>>: I'm sure it has to do with performance.
>> Emery Berger: No, it does not have to do with performance. No, I mean
actually, you know, I thought that this was -- I think that this is a philosophical
argument. I guess I could put on my Martin Reinhard hat here and say, you
know, if you have a code and the code is going to abort, what do you want to do?
Do you want the code to abort or do you want it to try to continue to execute?
And it depends. Right? This is one of those classic situations where in some
cases you really want fail stop behavior, and in other cases you don't.
>>: [inaudible] surprise to see you [inaudible].
>> Emery Berger: I'm so cavalier. Yeah. You know, I mean -- we can be -- we
can be as anal as we want, and we can crash all the time, and that's not a
problem. Okay? It's not really something that I focused on, but Cartic asked this
question, so. Okay.
So some results. The results here are mostly to show you that this is promising.
There's some surprising results. They were certainly surprising to me. So let me
explain this graph.
This graph was meant to see what the overhead was of using processes instead
of using threads. So we're at a very simple benchmark where the code just, you
know, it spawns a bunch of threads, the threads do some work, it's an eight -running on an eight core machine and so basically -- and it spawns eight threads.
Okay?
So the top line which is P threads, everything is normalized to P threads.
So here this means that Grace is performing better than threads. It's running in
less time. Okay? And as the granularity increases on a log scale, right, yeah,
yeah, bear with me, so as the granular increases, gradually a Grace eventually
sort of asymptotes out. As you can make sense, right? You've got a process
and a thread. You would think, as I thought, that, well, there's the initial penalty
of spawning a process because you -- you know, you shoot down the TLB. You
have -- you know, you have to copy file handles, you have stuff that you have to
do.
And then as the thread gets longer and longer, it would be amortized. Only that's
not what happens. Okay? What happens is that for short lived threads, you
actually got better performance. And it came basically -- this doesn't happen if
you do, you know, one thread at a time. But it happens when you do eight
threads. All right? So my student came and gave me this graph which as you
can imagine I expected this bar to be up there and not down here. And I said, all
right, I've done this to the students several times now and I really need to start
learning. I said you did something wrong, go find out what you did wrong and fix
it. Right? Like you got the graph axes backwards or something. And he read it
again. He got the exact same results.
And I said all right. Let's try this. I'll run it myself. So ran the code and got the
exact same results. And I said huh. Okay. Something is very, very wrong,
right? It's like I'm on candid camera. Something weird is happening. So it turns
out we did everything. We said all right, let's look at the performance counters,
let's see what the hell's going on. Eventually my student who's very, very good in
the Linux kernel dove into the kernel to see what the hell was happening.
And so it turns out that it's actually the Linux kernel that's doing it, but I spoke to
a distinguished engineer at Sun, and Solaris does the same thing. And so I feel
like, all right, it's not just an artifact of 14 year olds writing code. Okay? And the
problem here, this is the idea. Linux has this intelligent thing that it does when
you spawn threads. If you spawn threads and you don't say that there's a
specific CPU that they have affinity to, then it keeps it on the same CPU where it
was spawned for some period of time. The idea is, you know, the cache is all
warmed up, it's a thread, it shares the same address space, let's optimize for that
case and then as the thread runs for longer we'll load balance it off. All right?
With processes, processes don't share address space. They should be
distributed immediately.
So in fact, this actually works in our favor because the processes are
immediately load balanced. They're spewed across all available cords as fast as
possible. And so that's why we get a speedup.
So I'm certainly not advertising this as some technique to improve the speed of
your programs, it was intended to just measure what the overhead was of using
these processes, and it turns out that there really is effectively because of this
artifact not even no overhead, it's negative overhead. Which is not bad. Yes?
>>: Most fortunate parallelism systems will have some set of worker threads
whose startup time is amortized once.
>> Emery Berger: Absolutely. Absolutely. So if you think ->>: [inaudible] in this graph?
>> Emery Berger: So I mean if you were to use Cilk, right, so these threads -- I
mean, it would basically be, you know, consider the situation out here at the
extreme edge, right where the thread has been running for a long time, here it's a
second. Right? All of the startup overhead is gone, everything has been load
balanced out, you know, this is the standard sort of, you know, kernel threads
that are sitting here with their own like decks, right? And so, you know, you end
up at the same place. And here the axes are coincident. So I think that's where
you'd be. Okay.
>>: When [inaudible].
>> Emery Berger: So I mean one millisecond I think that it would be -- you know,
it might be around, I don't know, I mean right here the processes are faster
because they're load balanced. I would expect it to be about the same, right, that
the gap would go away. You know, if they're already running for a -- if they've
already been running a long time, then they've already been load balanced out,
right? So they're all spread out.
>>: Mentioning the overhead of actually [inaudible] I think that Dave is asking is
that if you have this worker threads stick around, the [inaudible] of starting new
threads is actually ->> Emery Berger: Absolutely. Absolutely.
>>: [inaudible] still have the [inaudible].
>> Emery Berger: That's right.
>>: [inaudible] cost of starting up a new ->> Emery Berger: That's right. That's right. And well, I can talk about this some
more when I get to some of the other results. I'm going to talk a little bit about
Cilk. But you're absolutely right. I mean, if you think of a framework like Cilk or
essentially it's follow-on which is the Intel thread building blocks, the idea is you
have some set of kernel threads and then thread spawns really very, very
lightweight continuations. So there's no OS involvement at all. So they're quite
cheap.
And the goal there was to optimize fine grain thread behavior. Okay. So that
said, all the issues of process creation and stuff with the exception of very, very
fine grain stuff, which I can talk about in a minute, are really not the overheads of
Grace. The overhead of Grace is about rollback. And so we wrote a little
microbenchmark that just simulates the effect of different rollback rates and as
the rollback rates increase, unsurprisingly performance declines. The ideal is
eight -- the dotted line here is with P threads which of course never roll back, and
then here you get Grace so as rollbacks increase you eventually start heading
more and more serial.
The good news is that you can with stand some number -- some percentage of
rollback rate and still get reasonable speedup. But, you know, all of the steps
that we take to eliminate false sharing are really crucial. And if you have true
sharing of course, it's not a good situation. Yes?
>>: So when you're comparing your ->> Emery Berger: It is apples to oranges in this sense. Here -- it's the exact
same program, okay. One of them is using P threads. We're using locks to
modify some shared data.
>>: Well, you already ->> Emery Berger: Yes. Yes. And so Grace ignores the locks and just treats the
thing as sort of a transaction.
>>: So they are computing the same thing?
>> Emery Berger: They are.
>>: Like the P threads just ->> Emery Berger: No, no, no. It's the exact same program. Absolutely. And in
both cases these are the same programs. Just one is linked with P threads,
one's linked with Grace. Okay.
So, right. So here's my disclaimer. So these are the benchmarks that we're
using. It turns out that it's actually quite hard to find benchmarks that exclusively
rely on fork-join. A lot of the classic also crappy benchmarks like splash rely on
barriers, and barriers, while they are conceptually at a high level like fork-join,
they actually break up the threads. So it doesn't make sense for them to be -you know, if you think about it, you've got thread one and thread two and they
have a barrier, they join and then you have the next threads. If they had been
spawned to threads wait for them, spawn two threads, wait for them, that would
be a perfect fit for Grace. But with a barrier breaking them up, the sequential
composability thing doesn't work. All right?
But so these are drawn from two sources. Mostly they're from a suite of
benchmarks called Phoenix which are meant -- they're the benchmarks that are
used to test Map/Reduce, a shared memory version of Map/Reduce. Matmul
here is from Cilk. It turned out that the Matmul was the only one of the Cilk
benchmarks that scaled at all with P threads. Because the Cilk benchmarks are
designed to stretch super, super fine grain thread creation.
So we subsequently came up with a clever hack that allows Grace to run very
well even for those, but I'm not going to present those results. So the blue bars
here are P threads. The red bars are Grace. These are speedups on an eight
core machine. You can see that in some cases you get super linear speedups
and these are cache effects. The good -- you know, the good thing to see of
course, this is a little bit surprising. And the reason this happens basically is that
lock overhead goes away with Grace. So if they -- if the memory spaces don't
conflict and you have no rollback and you also have turned the locks into no op's
then you get a performance benefit. Which is nice. Not what we would expect in
practice.
So the speedups are good. I have to note, though, we had to make some minor
modifications to these programs. The bulk of the modifications, that's the mods
that we had to make to most of them, were this particular issue, which is, you
know, as I described before Grace survives the way P threads appear in the
code. But if you do this, spawn thread update a global variable then spawn a
thread and update a global variable, Grace will say, oh, I have to do the whole
thread execution first and then do the update of the global variable, then I can
spawn more threads.
So it serializes all the thread spawns. So that was a real killer. So all we did was
we hoisted that out of the loop. So we put these all in an array, spawned
everything into an array. They each had separate result fields or something and
everything worked out fine. So that was the most important thing.
If we extended the P threads API with like, you know, create multiple threads,
then that problem would go away. Okay. So here Kmeans is kind of an
interesting case. Kmeans has a benign race, so Kmeans the well known
clustering algorithm. It has a benign race where it throws stuff. It basically is
counting with how many things belong to a particular cluster. It converges. So
the race is not really a problem. It's just a number, so it's an atomic value. But
for us, this is a killer because Grace sees these things and doesn't know how to
distinguish a benign from a malignant race and sequentializes everything.
Okay.
>>: [inaudible].
>> Emery Berger: Yeah?
>>: [inaudible] that in the same way, that is instead of putting all the values in the
same place to ->> Emery Berger: Yup?
>>: To put them in ->> Emery Berger: Yes, so ->>: The end?
>> Emery Berger: That's right. So one of the things you know, we just decided,
you know, for our purposes we said we're not going to make persuasive
modifications to these codes. Clearly we could have changed Kmeans to avoid
this problem, right? But I said, well, you know, let's see what we can do with
local changes where we barely understand what the program is doing. Okay?
And that was really the goal. So in fact, one of my students was like, ah I think
we can fix Kmeans by doing X, Y, Z. And I said no, no, no. That's not the object
here.
The object here is to come up with minimal changes. And in fact what we're
doing now is we're trying to come up with ways of avoiding even making those
minimal changes, right? So the goal here would be you don't change your code
at all and it still performance well. Right? Okay. All right. So that's the benign
race. All right. That's basically it. So Grace I think again this is really about this
sort of idea of garbage collection for concurrency in a way, right? I want to get
rid of all of these errors just like I used to get rid of dangling pointer errors with
garbage collection and that turns out to be a big win for a number of reasons.
Getting rid of sequential errors, it may be more costly than running to the bare
metal all the time. There are going to be different ways of programming
potentially. But it seems like a laudable goal.
I'm -- you know, these are promising initial results, right? So if you use a cleverly
tailored algorithm to implement this kind of GC for concurrency, then you actually
can get pretty good performance. You know, I should note, you know, this is sort
of transactions, right, but there's no overhead on the main line, right? So you
have these very coarse grained transactions. There's no logging. There's no
compiler involvement. There's no read barriers beyond the initial have I read this
page once, and all of that overhead is amortized. So that's how you get this
performance.
And there's hopeful Wall-E. Yeah, happy to take more questions. Manuel?
>>: This is very nice. One [inaudible] your benchmarks is that you started with
benchmarks that are already parallelized.
>> Emery Berger: So this is a story of ->>: [inaudible].
>> Emery Berger: Multithreaded [inaudible].
>>: Make everybody be able to use threads and so on, so I would have to start
really with this sequential program and start putting spawns and syncs into
places, right?
>> Emery Berger: That's right.
>>: And now the question [inaudible].
>> Emery Berger: Yeah. I mean, I think, you know, there are deep problems,
right, with writing concurrent code, beyond just the errors. Right? So getting it
right is hard enough. But getting it to scale is a whole other question. So this is
one of the reasons why I personally remain skeptical of the sort of vision of
automatic parallelization is that it's very, very difficult -- it boils down to algorithm
design, right? Automatic algorithm synthesis, right?
So I'm going to take, you know -- you go ahead and you write your sequential,
you know, random sort algorithm that works like this, randomly permute
everything and see if it's sorted. Okay? I will discover that that's inefficient, and
I'll turn that into quick start. To me that's -- it's tantamount to, you know, this sort
of you start with the sequential program, I'll generate a parallel program that
scales. And so I -- you know, given that that seems very, very far off, if it's even
possible at all, I figure start with the current program. But the concurrency here
is you spawn stuff.
But you're still required -- there are two -- I mean, the bad part here is there's an
additional requirement. Because not only do the data structures largely have to
be distinct, they actually have to be distinct at sort of page level grain, which is a
drawback of this particular implementation.
But in the end, you have to break up data sharing. I mean that's part of the
whole goal. The especially bad part about this, though, is that one shared -- one
conflict can cause an entire rollback and that's actually something that we're
working on. We have -- I'd be happy to talk to you offline about this. Yeah,
Carter.
>>: So you started off by [inaudible] all the problems in concurrent programs like
[inaudible].
>> Emery Berger: Maybe not all.
>>: Well, okay. A lot of them. So do you have any -- and you get the [inaudible].
>> Emery Berger: Yes.
>>: So do you have any data on how -- what frequency of these [inaudible] this
classic program?
>> Emery Berger: Oh, this is a good question. So I mean I don't have, you
know, data data. I have anecdotal evidence. So the Cilk folks have been
working on this for a very long time, right? So Cilk is a fork-join model of
parallelism. And they have written I think three different race detectors. And in
one of the papers where they talk about it, they call it the non determinator. And
one of the non determinator papers they say oh, they gave all these programs to
a bunch of -- programming assignments to a bunch of MIT undergrads. And
almost all of them had race conditions.
So that's -- that to me is suggestive that it's not much easier given that MIT
undergrads as a whole not too bad. So Dave?
>>: Some of them [inaudible] was surprising and astounding things that you can
get these sort of results at page level granularity of false sharing.
>> Emery Berger: Yeah.
>>: I remember reading the paper. And did you do something in matrix
multiplying to make sure there were like a block matrix multiply and you ->> Emery Berger: Yeah, yeah, yeah. So -- yeah, so the one change that we did
in matrix matrix multiply which is I guess the most pervasive change was the
base case for the matrix matrix multiply was some arbitrarily chosen number. It
turns out actually -- so it was like I forget, 16 by 16 or 32 by 32. And we initially
fingered that as being a potential scalability problem because, you know, it's a
very, very fine but in fact it turns out to of been tuned for a particular cache
configuration. So we made it bigger. And making bigger actually improved the P
threads based code but it also helped us because it meant that the blocks were -it was basically 4K by 4K or something. The result was that you were accessing
things in page size chunks at the base. And the way that the memory allocator
works, the memory allocator when you allocate a large object, it guarantees that
it starts on a page boundary. So it makes the math very easy. So you know for
sure this starts at a page so I know that this index range to this index range
always lies on a page.
>>: So [inaudible] a little bit of knowledge about I'd like to make things like a
page size [inaudible] the code you ->> Emery Berger: That's right. I mean, the real problem with things like matrix
matrix multiply is that you're dealing with arrays. And you have very limited
freedom to move arrays around. With other objects we actually have a lot of
leverage. So one of the things we can do so we have a prototype that does this
right now. You run your program. It detects false sharing and then it says here's
where the false sharing happened.
So with globals, this basically means, you know, oh, add some padding. With
heap objects it means segment things from different call sites, for example.
>>: [inaudible] about to [inaudible] languages?
>> Emery Berger: Doing what?
>>: [inaudible].
>> Emery Berger: Yeah. Yeah. We have. Although -- yeah, I'm not sure if
you're on my schedule, but yeah, we -- I think, you know, this is certainly -there's nothing that precludes this from being incorporated in a GC language.
And GC gives you extra advantages as you described. And the ability to move
objects around to sort of shuffle them out or say, you know, this is conflicting and
then let's tease those apart would make things tremendously easier. So we're in
a worst situation because we're in C and C++ than we would be if we could
actually move things. Yeah?
>>: Just [inaudible] suppose that you have a binary tree where the leaves of the
tree happen to sit in different pages. And then you you're marching down the
tree and you're spawning the threads with every internal node and as you're
getting to the bottom at that point you update at least the [inaudible]. In that
situation, are you going to do -- you start [inaudible] are you going to do N or 2N
commits? That is do you commit with every level of the tree ->> Emery Berger: So each-so let's see. So if I spawn the -- I'm trying to think of
exactly what you're asking. I mean, the commitments are directly related to the
number of threads and when they spawn a thread. So when you spawn a thread
that's a commit point. And when a thread ends, that's a commit point.
>>: Okay.
>> Emery Berger: Yes?
>>: So most of the benchmark from [inaudible].
>> Emery Berger: Yes.
>>: And did you compare the line with a number of line if we're using [inaudible].
>> Emery Berger: That's an interesting question. So I spoke to Christofs
[phonetic] about this. It turns out that using their Map/Reduce implementation
mostly impairs performance. And the number of lines of code is not materially
different, it's actually often larger for Map/Reduce. It's not really a great story for
Map/Reduce on shared memory. I mean, the idea of course is well this will make
it easier to write programs. I mean, I corresponded with them. He basically
things it's easier for software engineering reasons and it makes it easier to
reason about the concurrency. But the performance story is not that great.
So in fact, I didn't include any of the Phoenix Map/Reduce based versions for
comparison because it seemed unfair. Because they're all worse than the P
threads version. And P threads to me are the gold standard. Yes?
>>: So this is a solution when you you have mostly [inaudible] right?
>> Emery Berger: Yeah.
>>: But how do you kind of extend this when you [inaudible] benchmarks are
more [inaudible].
>> Emery Berger: Right. Right. So that's an excellent question. And I
mentioned I was going to get back to this and now I will. So, you know, one of
the things that we encountered when we tried to do this is I basically had a
student change all the Cilk benchmarks to use P threads. Okay? It wasn't that
hard. It's mostly mechanical. You see, you know, every place is a spawn, turns
in a P thread create and then you have to, you know, save the results
somewhere. That's about it, right? And the syncs become P thread joins, okay.
And then we ran them and none of them scaled at all except for matrix matrix
multiply.
And when I say it didn't scale, I don't mean with Grace, I mean they didn't scale
with P threads. Because they were so coarse. Okay? So then I couldn't include
them and I was like damn, this sucks, right, this is a disaster.
Well, it turns out we can run them now. And the way that we run them is through
again either, you know, devilishly clever, you know, application of genius or a
crappy hack. You know, depends on the eye of the beholder. The insight
basically is this. If you're doing nested, you know, this nested fork join, sort of
nested divide and conquer just the way Cilk works, that means that you spawn a
thread, you spawn another thread, they spawn threads, they spawn threads, they
spawn threads, right? So we reasoned, you know, once you spawn enough
threads and the number of threads exceeds the number of cores by enough, just
don't spawn anymore, just run the stuff.
So we used the nesting depth, and we say all right, two to the nesting depth.
Once two to the nesting depth is greater than twice the number of cores, we don't
spawn anything else. You just run the function. And that means now we can run
-- we can run for small versions even things like fib we actually scale now. And
you can run almost as fast as Cilk. Because at some point there's no [inaudible].
Now, this doesn't work if there's load imbalance. Right? If there's extreme load
imbalance so that one of the cores, you know, has all the work and all the others
are idle, you know, in Cilk that gets resolved because of work stealing. For us it
would not get resolved. There are possible approaches that we have not
implemented. One we could detect when this is happening and roll back to the
spawn point and then really spawn things. Or you could learn for subsequent
invocations not to do that. Yeah?
>>: Do you have another [inaudible] right, you have two [inaudible] and if you are
[inaudible] very fine granularity then you are [inaudible].
>> Emery Berger: Okay. So I'm not sure what you're asking. In the ->>: So.
>> Emery Berger: I mean when you say fine grain, I think fine grain means short
lived threads. But are you saying something else?
>>: [inaudible].
>> Emery Berger: Sure.
>>: So [inaudible].
>> Emery Berger: Yeah. Yeah.
>>: Then you have to [inaudible].
>> Emery Berger: Okay. So if you write a program that does this, right, spawn
eight threads, wait for them, spawn eight threads, wait for them, spawn eight
threads and those eight threads do hardly anything, then the approach I just
described will not work. But if you do recursive divide and conquer where you
say I'm going to carve this big space into half and then half and then half, and
then the leaves are very, very fine grained computation, what we do is we turn
that whole sort of forest below a certain point while the tree rooted right here into
one big task.
And so it works great.
>>: So there's some underlying assumption that the data associated with that
tree is contiguous and disjoint ->> Emery Berger: Disjoint. Contiguous not required.
>>: Okay.
>> Emery Berger: That's right. And that the conflicting -- potentially conflicting
pieces are disjoint.
>>: Right.
>> Emery Berger: Right. Yeah. And if that doesn't happen, then it's a problem.
And, you know, that's a threat.
You know, there's a tension between how long you make these threads last and
how -- what the likelihood of rollback is. So one of the things that we've done
that's not in this paper is we've implemented a checkpointing. And the
checkpointing is even more crazy because what we do is we basically -- here's
the very bad case that we didn't want to have screw us. Do a whole bunch of
work, update a global variable, the end. Okay? Right? And then oops, we have
a conflict, rollback, waste the whole computation. Right? We do not want to do
that.
So we have a way around it, and the way it works is we periodically take a
checkpoint by calling fork again and we leave that fork to child as a sort of place
holder sitting there. So we fork, we fork, we fork, we get to a certain point and
we discover a conflict, we roll back to the last fork. And the fork is reawakened.
If we successfully commit, we just kill all those children.
So, yeah.
>>: So [inaudible] successful I think was that first we had the [inaudible] said
okay forget the [inaudible] and we have infinite memory.
>>: Yes.
>>: That's your model and now [inaudible] right?
>>: Yeah.
>>: Now, if this were the [inaudible] you spend most of the talk about the
technique how do you implement it, but at the beginning what's the model? So
you're model is well all these things are no op's, spawn, sync, and so on, and
sequential semantics, is that the right model that [inaudible] same kind of
research that happened in GC to actually succeed or do we still need to treat the
model ->> Emery Berger: Okay. So I'm by no means convinced that this is the end at
all. But what we were striving to come up with was what does it mean for a
concurrent program that has bugs to be correct? So for a program that has
dangling pointer errors to make it correct you turn off delete. Right? And you
have this infinite memory abstraction. And so that's very clear, easy to
understand, easy to explain and, you know, completely solves the problem.
The question of what makes a concurrent program correct seemed to be a
difficult one. All right? What is the correct version of a concurrent program. And
so this is what we ended up with. We said let's find something where there's an
isomorphism sequential programming. And that's the correct program. But
clearly this doesn't include all possible forms of concurrency. Right? We've
restricted ourselves to fork join. And so it's a question whether we want to
extend this to include condition variables and signaling, for example, but, you
know, what is a correct program now that you basically have injected message
passing, right? I mean, you know, condition variables are just messages. And
you know, the correct version of a message passing program, I'm not sure what
that is. So but it's a great question. Certainly one words discussing over beer.
All right. Thanks everybody.
[applause]
Download