21186 >> Ben Zorn: It's my pleasure to introduce Emery...

advertisement
21186
>> Ben Zorn: It's my pleasure to introduce Emery Berger. Emery Berger is a professor at the
University of Massachusetts in Amherst. Got his Ph.D. at the University of Texas. He's been
working in the areas of performance, parallelism, concurrency, as you see in this talk today. He's
also working on security and predictability of systems. He's got a wide range of skills, interesting
ideas. You've seen him here many times before. So happy to have him speak again.
>> Emery Berger: Thanks, Ben. I'll add correctness and operating system stuff and turtles. So
you can go to my Web page and see what I mean.
Okay. So let's see. So I'm not going to bore you with this. Right? So this is the whole multicore
yada yada yada. Everybody's on board, right? Good.
So it's bad, right, why is it bad? So it's bad for a variety of reasons. For correctness, right, it's
notoriously difficult to get right. You have races. You have atomicity violations. You can you
have deadlock, order violations. People actually discovering new ways of talking about ways
your program can be wrong, which is kind of fun and exciting.
But, of course, get the multi-threaded program correct is really only half the battle. You don't
write a multi-threaded program just to get the thing right, just to overcome these errors, otherwise
you wouldn't have written a multi-threaded program. You actually do it, of course, for
performance. This is a plug for some other work that addresses all of this stuff called Grace. I'm
not going to talk about that much today.
So you've got this multi-core rocket and you're trying to make something fast. If it's not fast it just
doesn't matter. If you have a multi-threaded program and it doesn't take advantage of your cores,
then who cares? You did something wrong.
So how can it go wrong? So there's a bunch of ways your program can go wrong in a
performance sense. So you can have too many -- here you can have too few threads. So too
few threads means say you have eight cores and you have four threads.
Clearly you can't take advantage of all the cores. You can have too many threads. So you can
have hundreds or thousands of threads. You can be spawning threads all the time, and you pay
the overhead of scheduling and context switches and you pay the overhead also of thread
start-up and thread tear down. So this is like a very short primer for those who have not had to
deal with performance tuning concurrent programs.
What else can happen? You can have contention. So that's a lock. So if you have one big lock,
you have a problem, right? Because all the threads are constantly contending for this lock.
Only one can get it at a time. Only the big dog, right? So only one can get it at a time, which
means everything becomes sequential. So that's no good. So solution. Lots of locks, except
that's bad, too, because in fact lock acquisition and lock release is relatively expensive process,
as well as atomic updates being fairly expensive.
So you can see that there's all of these issues with performance tuning, concurrent code. But the
good thing about all these issues is that they're sort of obvious in the code. Right? Or they're
easily -- they're easily exposable with existing tools. So if you see one big lock and you run some
performance tuning thing, you'll discover, hey, there's a ton of contention on lock 12.
All right. Likewise, if you have this death by small cuts, you'll see thousands and thousands of
lock acquisitions. You look in your code. You'll be like, okay, I see those locks.
All things that you can address. Oh, man, that's really dark. So it is appropriate, I guess. So
there are more insidious errors. And the insidious performance error I'm going to talk about today
is false sharing.
So why is false sharing insidious? As opposed to these others which are just bad. So it's
insidious, because you can end up sharing something that you didn't really realize you were
sharing. So how many of you have seen the movie "Brazil"? This is bad I'm showing my age, I
guess. All of you should immediately rent/stream "Brazil" Terry Gilliam film, fantastic, dystopia
quasi sci fi. Totally up your alleys. Immediately rent it. You're culturally deprived not having rend
it.
There's an amazing scene where he's put into this -- he basically joins this company put in this
incredibly insanely tiny office. It's like one-12th the size of a Microsoft intern space.
He gets in. He starts setting things up. He sets things on his desk and his desk starts
disappearing into the wall. And then he's -- that shouldn't happen. Things start falling off. So he
starts pulling on it. Then you see on the other side of the wall there's another guy sharing the
same desk.
And so they're both basically playing tug of war trying to get more desk space. The problem is,
right, he came in and he thought he had a desk but in fact he was sharing it with some other guy,
and he had no idea. This is pretty much what happens with false sharing. Although it's maybe
lessen containing.
So in false sharing, you've got the thing the dogs are fighting over, right? This is a cache line.
And cache lines are getting relatively large. 64 bytes or 128 bytes are reasonable cache line
sizes.
And you know to sort of remind everybody of their architecture class, sharing and communication
across core/processors happens at the cache line level.
So you don't share a byte. You don't say I'm going to alter this bite and so on. You actually alter
a whole cache line. If I make a change to a cache line that you have in your cache, your copy
gets invalidated. So this is the whole cache coherence thing.
The problem occurs when I've got a piece of a cache line and you've got a piece of the same
cache line. And as we make updates, secretly performance is going to hell because when I
access my cache line if you've already modified your piece, my piece is now invalid.
So I have to go all the way down to the bus and get the thing. It's like a cache miss all the way
down and then fetch it back again. So, of course, this is one to two orders of magnitude slower
than the direct access that you thought you had, and every single time you make a modification,
you will also have to fetch the invalidated copy all the way back.
All right. So this can cause a program to silently serialize itself. So if I have a program that's
doing a lot of updates to shared state, with no locks, right? No visible contention, the program
can actually in effect run at serial speed or slower, or slower. Right? Because individually these
things were actually hitting the cache in L0 all the time now it's missing all the way down to REM.
So here are some examples. This is going to be a pretty breezy high level talk, by the way. So I
encourage you to ask questions if you want some details, because the details are not on the
slides.
That's what papers are for, by the way. Yes, I know it's shocking. So this work, by the way, is in
submission. It's a tech report. I'll put it on my Web page when we get back.
All right. So here's some examples of ways you can get false sharing. So one of them is you
have these globals. So in each of these cases I have something and you have something. And
that means I'm a thread and you're a different thread.
So here I make an update to this global. You make an update to this global. Surprise, globals
tend to be co-located. All right. So it's not like it lays them out 64, 128 bytes apart. Two integers,
for example, will almost certainly be laid out directly next to each other in memory which makes it
very likely they'll be on the same cache line.
Similarly, or perhaps a little bit more surprisingly, this can happen, where two separate heap
allocations can actually end up being co-located on the same cache line. If you use the Windows
allocator, this is more or less guaranteed.
And from some point of view this makes a lot of sense, right? I have memory. You asked for
memory. I'll give you a piece. Somebody else asks for memory, I'll give them the next piece.
There are memory allocators, including one that I've written that tends to avoid this. But nothing
eliminates it all together.
Here is, I think, something that's more likely to happen, where you have a class -- I shouldn't say
more likely to happen. It's something that's a little trickier to deal with.
You have a class. One thread is updating one field and another thread is updating another field.
It's also the case that if you just have one big object, and you share out pieces of it to different
threads then those will of course all be next to each other, and therefore could lead to false
sharing.
Finally, and this is something that was actually, has been dealt with long ago in the Fortran world
is array index-based false sharing. So especially when you do automatic parallelization of
Fortran programs, which is more or less reasonable, surprising as that may sound, because
Fortran is this tremendously easy programming language, for analysis point of view, if you have
these arrays and it can resolve where these indices come from or if you're parallelizing a loop,
you can basically do a static analysis and get the chunks that are going to be on different threads
to not end up on the same accessing same things on the cache lines. I'm not really going to talk
about arrays today. I'm just going to talk about this other stuff.
>>: So isn't it a matter of design in I mean, for example, if you have two threads updating
adjacent fields in a record, structure, whatever, that seems like just a bad idea, right? And in
Fortran, for example, the standard way to acknowledge that is you have two parallel array and so
each array would represent one of the records, array of structures.
>> Emery Berger: Right. Right. I mean, is it a bad design in I think that the real question is
whether this is something that's obvious. And I should add that you know this cache line size
business and the way things get laid out also means that on some architectures you won't get
false sharing and some you will.
So there are programs that used to not exhibit false sharing, because the cache line size was 32
bytes. Now 64, 128, and actually I mean the heap-based false sharing is definitely an issue that's
very difficult to resolve, because this happens below the level of programmer's abstraction. So
the programmer just says new. They have no control over where the memory goes.
>>: There are mechanisms like heaps, right? You could have in Windows you could have a
separate heap for food and bar.
>> Emery Berger: You could. That's right. So I would refer you to my classic 2000 paper on
Horde which discusses these issues in depth. It's a problem, right, if you have separate heaps,
because the amount of memory you consume grows with the number of threads.
Unless you have some way of balancing. In windows heaps you really can't do this. There's
another kind of false sharing that can occur from recycling. I don't want to like go down the rabbit
hole on this. But in short, if I have an object, suppose I allocate something. I'm me. You allocate
something and conveniently they're very far apart. Then you allocate a bunch of stuff and it's
close together, that's good, right? On your thread.
Then you pass those things to me. Then I will end up falsely sharing them with you. And this
kind of thing is very difficult to deal with. And you know I should add that coping with it at a
programming language level. For the programmer this is dicy, you really want to, to the extent
possible, resolve that time, if you can.
All right.
>>: I noticed at some level, I'm curious about your philosophy about the question of should this
all be done automatically or should the programmer have some abstraction and be involved in the
process?
>> Emery Berger: I always lean on the side of automatic. The programmers have it hard
enough. I would rather -- as programming language people, right, in theory we should all be
saying, you know, make life easy for the programmer. Make them more productive. Hide the
grubby details down below. That's definitely my philosophy.
I think there's issues with portability and complexity. It's easy to say put this here and this here
but when you have very difficult programs, this becomes difficult.
Okay. So what happens with previous tools? So there's not much work in this area, actually.
Somewhat surprisingly. So one of the approaches is basically instrument every access. And you
know serialize all the threads and watch what all the threads do to every single read and write.
This is in theory this will catch everything. It's slow.
>>: To detect do what, though?
>> Emery Berger: To detect false sharings. Specifically, yes for false sharing. Nothing exists to
prevent it.
That's a good point. This is detection story. Nothing to prevent false sharing except the static
analysis for Fortran programs that I described earlier when you're doing auto parallelization.
One of the big issues is false positives. When you get a ton of reports of sharing, it doesn't really
know, for example, in most of these tools, if you call Malech and then you free it and somebody
else calls Malech, it sees the same address being used by different threads and it will report that
as sharing when in fact it's not. There's no sharing happening at all.
So that's the issue with instrumentation. There is a tool called PTU. Don't ask me what it stands
for. Performance something. From Intel.
This is actually what happens on one of our test programs. You get a dump of all of the cache
lines that were responsible for roughly speaking coherence traffic. And so it does this at the
address level. You get offset something, thread something. And at best it gives you a function,
which is relatively good.
You get a bunch of these messages. It can't distinguish between false and true sharing either.
So really you just get a ton of stuff and you have to sift through it and hopefully find out what
happened.
So that's the state of the art.
>>: So now what do you think? Do you think that say it reflects there's a hard problem or there's
no interest?
>> Emery Berger: Okay. I think it is -- it's certainly interesting. I mean, way, way back a long,
long time ago I do -- people were -- there was some company that licensed horde, specifically
because of the false sharing. So what happened, it was quite interesting. It was a scientific
code. It was a boundary element method solver which turned out to have lots of applications for
a bunch of different sort of physics and engineering problems.
So it would allocate a bunch of memory and it would churn on it. It would do a bunch of
computations. It never called Malech in that section of the code. It called a bunch of Maleches
and a bunch of the execution time which was a lot it would spread it around multiple threads and
do chunk, chunk, chunk. The problem was it was using a conventional allocator and had massive
false sharing.
So the scaling did this. So which wasn't good. And he plugged in Horde he had no idea what his
problem was. This is the take home message of all the false sharing stuff. Nobody knows if they
have false sharing. They expect the other things, but false sharing is this invisible performance
caring that's very difficult to get a handle on. Even if you go and plug in one of these tools it just
reports a bunch of sharing stuff. It doesn't really tell you here's some false sharing.
So I think people have not had a good handle on even being aware of the problem. And then
how on earth to track it down.
>>: But you know Intel at least ostensibly has made the case that it's providing performance tools
for concurrency, right? They want to improve. They're giving you multicores, want to make sure
you use them. You would think they would have a vested interest of making the quality of the
tools better. You can do better than this without rocket science, right?
>> Emery Berger: It's hard to see how, actually. So this is using performance counters.
>>: Right.
>> Emery Berger: It has all these performance counters. It doesn't know about Malech. It
doesn't know about fields. It just see cache lines. So if you operate below a certain level of
abstraction, it's very difficult to know what the hell's going on.
It's cheap, right? And it gathers these statistics at very low cost. But in this particular case it's not
very useful.
>>: They have people working there?
>> Emery Berger: Sure.
>>: They could figure out how to raise these things up to more meaningful ->> Emery Berger: Let me put it this way. I'm going to present what I've done. There might be
other ways of doing these things. But I don't know what those are yet. Especially with low
overhead, which is where I'm going. So here's Sheriff, which is what I'm going to talk about
today. Sheriff has two modes. Sheriff has a detective mode where it will find these false sharing
problems. And it will report them pretty precisely. So it will arrest the culprit. In particular it has
no false positives.
I will say up front, there is the possibility of false negatives, which I can get into some detail on,
but in general it's quite good at catching these things.
And when it catches something, it narrows it down to the object that was the culprit. If it's a
global, it will actually tell you X and Y are responsible for false sharing. There's also a prevention
mode. And prevention mode will actually eliminate false sharing all together, which seems like a
bizarre thing. Yeah. So I view this as something -- this is sort of a call in the calvary, I don't know
what to do. But it's maybe a very complicated program and you don't want to try to use these
detection mechanisms to help you find it and fix it.
You can actually, for some programs, just plug in Sheriff and your program will run faster. But I
would recommend this mostly for catastrophic cases like Okay Corral kind of situations.
So how does it work? As Ben alluded, there's some magic involved. This magic is partially
reuses some magic from Grace. So I'm going to talk about the Grace magic and the new extra
magic that's involved.
Okay. So how many of you are familiar with Grace? Okay. All right. So there's enough people
who are not raising their hands. So just briefly the idea behind Grace is take a program that is
concurrent, make it run without any errors.
That is, no concurrency errors. If you say print 12 and it should have been 13, can't help you.
But no concurrency errors. It does this through a variety of mechanisms that enforce a
deterministic execution that corresponds to a serial version of your program.
So if you said P thread creates something, then it would be as if you just synchronously called
that function. That's the effect you get out of Grace. You actually can get good performance.
It's specifically designed for fork-join style parallelism. It was not at least as reported able to
handle general purpose parallelism. All right. So one of the tricks that Grace relies on, so here's
a multi-threaded program. I'm using the syntax which is borrowed from Silk, instead of the gory P
thread API or the far gorier Windows API. But here this just means create asynchronous thread.
So F of X starts running. The result will eventually be stored in the T1. Likewise with G of Y and
T2. And sync means wait for all spawn threads to complete. This is a very simple multi-threaded
program.
What Grace will do this and by extension what Sheriff will do, it will actually convert these threads
into processes. So all of these threads will, roughly speaking, be turned into actual calls to fork in
Linux. And they'll go and they'll run. Now, clearly this doesn't make sense as I described it,
because processes don't share memory.
So we use M map hackery to provide a shared image and in Grace you actually keep track of
which pages you modified and there's a sequential commit protocol for actually writing those
results back into the shared space so they become visible to everybody.
The beauty of doing this is that every process is totally independent. And so the changes that
happen in one process are isolated from that of the other. All right. So Sheriff extends this
mechanism to work for general purpose programs. That is, programs with barriers, programs
with conditioned variables. Not just fork and join programs.
So now I'm going to go through an example of the other key mechanism of Sheriff. And so what
we're going to do, I'm going to use this totally stupid multi-threaded example to explain to you
how it works. All right.
The idea is I spawn two threads. They do nothing but write one into X and two into Y over and
over again. It's manifestly not a real program. But it's a good example of what would really kill
you with false sharing. So suppose X and Y are on the same cache line. If you run these
programs, they will definitely run far slower than if you just ran them in sequence, because it will
just be cache miss, cache miss, all the time.
All right. Okay. So that's the threaded version. And the Sheriff version, what we're going to do,
we're going to run these things as separate processes. And here I'm going to say that initially the
values of X and Y were 0. Just for the purposes of this example.
So when this thread starts running, it's going to repeatedly be writing 1 into X. Over and over
again. And likewise, when this guy starts running, he'll be repeatedly writing 2 into Y. Now, note
these are separate processes. So what happens with cache coherence in separate processes?
Nothing. Okay. Nothing. There's no cache coherence because there's no shared memory. So
in fact all of these executions happen privately. So there's no false sharing at this point. So it will
just happily run. These will all be cache hits. 1, 1, 1. Boom, boom, boom. Same thing here.
So this gives you sort of a hint as to how we can prevent the false sharing by using these
processes. All right. So now we've done all the stuff with the processes and it's time to commit
the results. So the threads have ended. So what we do is we take this page is actually a twin.
It's a copy of the page before we ever wrote it. So we use page protection and as soon as the
signal handler is triggered, copies this page and saves it off to the side.
So it goes and it looks for every page that's written. It compares its contents to the twin, and it
discovers, hey, this is different. This one is different from a 0. And then it just writes that value.
It writes the DIF into memory.
Okay. The same thing is going to happen here. It's going to discover that the 2 is different from
the 0. And then it's going to write the 2 in.
Okay. So this actually, in essence, is the prevention story. So you basically run all of your
regions of code essentially between lock and unlock, unlock and the next lock, any of these
commit boundaries, if you want to think about it as where there would be edges and it happens
before relationship, that's where these boundaries are.
All the rights happen locally. And then at the end of one of those things they're all committed.
Yes?
>>: [inaudible].
>> Emery Berger: Excellent observation. It does. It assumes that the program -- let's put it this
way. It assumes that the observed execution is race-free. All right. If it is not race-free, then it
could -- so if you had something making modification here, another thing making modification
here to the same item, clearly one will win.
So the results are just as undefined in Sheriff as they would be in a conventional multi-threaded
program, with the caveat that you're more likely to see the race.
So imagine that in this program -- so I have a program -- say the program that this big section
lasts for a week. Okay. On Monday, I write X equals 1. On Tuesday, you write X equals 2. On
Friday we commit. X equals 2.
Okay. So you will definitely get some sort of a race. And it's going to happen in a quasi
deterministic way. But we're going to assume that your execution for the purposes of this, for the
tool, whether for prevention or detection, you don't observe races.
>>: So that doesn't invalidate any correctness term because you're already starting with the
semantically ambiguous program?
>> Emery Berger: That's right. You're exactly right. It could, nonetheless, have a practical
impact. Because you could have a race that happens sort of very, very rarely in the real
execution, but because the effects are delayed, the race could become manifest where otherwise
it would be hidden.
It might be a good thing. In fact, we initially were looking at this as a way of detecting data races
as well it does detect the data races. The problem is it says you had a data race last week. And
you know here's the object you have the data race on, which is something -- it's sort of useful, but
it's not really what you want.
>>: How does true sharing work? I guess in this context? You have ->> Emery Berger: True sharing is not going to appear here, because when you have in a correct
program when you have true sharing you have a lock. So when you have a lock, at the end of
the lock, the DIF is committed. And only one actually gets to do it at a time.
>>: Okay. So when you're inside a lock, only one of these things -- you don't have to spawn
this?
>> Emery Berger: If they're actually sharing, they're sharing X over here. No, no, you don't
spawn new processes.
>>: It's only one process ->> Emery Berger: Yeah, every thread equals one process. Not more processes. All right. So
this is essentially the prevention story. Yeah?
>>: Is there anything you're doing here that the hardware couldn't do? For instance?
>> Emery Berger: The hardware is helping me.
>>: So you're delaying merging these two different states. Why can't the hard part just buffer the
rights on each process [inaudible].
>> Emery Berger: I mean, I think buffering won't work. You would have to do something more
like this, because what happens with buffering is you would overflow a buffer. This could last for
millions or billions of rights. You'll only record -- the changes are happening here. Suppose it's
writing one to a bazillion and back over and over again, right? So when it does that, it will just
constantly be making these modifications.
If you buffered them at some point you'd run out of memory. Here the rights just happen directly.
>>: Memory consumption increases, with the number of threads.
>> Emery Berger: Here it increases in order of the number of pages modified.
>>: Oh. Then you run a process.
>> Emery Berger: Yes. But it's copy on write.
>>: Oh.
>> Emery Berger: So basically all of the read pages, the pages you haven't modified, are shared.
There's just one copy. But as soon as you modify one, you get your own copy. That's a good
question.
Okay. So that's essentially the prevention story. The detection story is a very simple extension.
What you do is whenever you go to write something into memory, you look at the cache line
around it. So you look at the other contents.
And if the other contents differ from what you saw before, it means that there was some false
sharing.
>>: That's demonstrated as a false negative, right? If process wrote one to X and 0 to X again,
you wouldn't see ->> Emery Berger: That's exactly right. So for the sort of ABA type situation, you know, you
couldn't -- you couldn't, what should I say -- your observation is basically correct the only way
we'll detect things is by difference in value. If you don't have a difference in value we won't detect
anything.
The prevention story still works the same way but the detection story, if everybody -- if this thread
actually wrote a 0 in here instead of a 1, we're just comparing the contents. And so if the
contents haven't changed, we assume that nothing has happened.
>>: Could you detect that by saying know that process had to do a copy on write so something
must have changed?
>> Emery Berger: What we actually do is, because that's too coarse of a grain because of
[inaudible] what we do in fact is we actually don't delay until the end. We don't wait for the end of
the week. We do this every ten milliseconds. So every ten milliseconds you do a check. And we
use this for two reasons. One, to detect things that change and change back, but, two, more
importantly, to find things that are false sharing examples that matter.
If I do this one time, who cares. Right? I don't want to report this. This would be crazy if I only
did this one time. Because I had one invalidation for a week. Don't care. Okay. But if I have
millions of invalidations over a week, then we're talking something real. So what we do is
essentially sample it by comparing these results over and over again at these intervals.
>>: How long does it take to do a comparison, approximately?
>> Emery Berger: I mean, it's all based on the number of pages modified. It just does a quick
MEM compare.
>>: Can you sample on a regular basis trying to make it look more random?
>> Emery Berger: That's a good question. In reality you should do things more randomly but
now -- I think random sampling is the only real -- the only way you can go where you can get a
provable result. But regular sampling works pretty well, too. But you're right, we should actually
approach that.
Okay. So basically that's the story in terms of the mechanism. So, of course, there's
complexities, there's some other details I've alluded. One key detail is that we also hijacked the
memory allocator. So we know things about the memory allocator. We deliberately segregate all
objects from the same call site into the same set of pages. This means whenever we get false
sharing on a page we know it's from objects drawn from the same call sites.
So this allows us to actually identify heap objects by call sites trivially. And like I said it identifies
globals by name, through ordinary sort of elf tracking.
So performance overhead. So, like I said, it's going to be breezy and quick. But that's okay. So
this is a collection of two benchmark suites. One is called Phoenix, the other is called PARSEC.
The PARSEC benchmarks are bigger. Phoenix is essentially, they're programs that were meant
to show the value of this Map-Reduce for multicore. So it's a bunch of these little programs.
And then these are a bunch of big programs specifically designed to exercise multicore in
complex ways. So the black bar -- this is on eight cores, the black bar is with P threads. The
yellowish bar, which is supposed to be green, is the runtime of Sheriff in the prevention mode.
And the red bar is with detection.
So there's added overhead for sampling which slows things down a bit. The interesting things to
see -- so first you see that the geometric mean overhead is very low. But this includes one pretty
big improvement.
The pretty big improvement is this benchmark called linear regression. So I should add we found
false sharing in a number of programs. Here's one. Right?
If you see that it goes faster, there's false sharing. Here things went really fast. So this sped up
by, I think, 11 X, which is a lot. And so just the act of using a Sheriff fixes the problem. But we
then use the detection thing. And it turns out that they were Maleching an object, basically a
chunk of memory and parcels it out to each thread and each thread was updating those pieces.
So they had no idea. I just communicated this to Christos Cosorocus [phonetic] head of the
project at Stanford and said hey you have this false sharing program and sent them the code. So
one line fix. So just Malech the object separately. With some panning, and then you're good.
You can also just increase the size of the SCRUCT they were sharing, throw in 64 bytes and you
also dramatically improve performance. So the performance you get once you fix it is a little bit
faster than the performance that you get with Sheriff. I think sheriff's improvement is actually on
the order of 9 X and when you change the code it's 11 X.
So not using Sheriff is even better than using Sheriff in that case. So that's basically it. We are
killed here in ferret but killed relative to a tool that takes 200 X, I'm not too upset. But as a
runtime system you probably don't want to slow down your program by a factor of 4. What's
happening in ferret is that ferret has tons of critical sections, which is probably a bad thing, in fact.
Because it's acquiring and releasing locks over and over again.
But that's particularly expensive for Sheriff. Because what Sheriff has to do at the start of any
transaction barrier memory is protect all memory say it's all read only. Which means a system
call, means updating the TLB and you take protection faults for any updates. So for a lock there's
for sure going to be some update. Almost certainly. And then the locks that we use are
cross-process locks, which are also more expensive.
So you can see -- although not that bad. Linux is -- I wish -- there's an OS person over here.
Windows is very, very inefficient in a number of regards. Its locking is very, very slow. And Linux
has these things called FUTEXs which essentially in the fast path do nothing more than an atomic
check.
Even though you're doing -- so it avoids a kernel process, which is good. Yeah. All right. So I
hope you're convinced by the performance results and the mechanism stuff. Let me give you a
few caveats.
>>: I'm really surprised by the low overhead. It seems like ->> Emery Berger: It's awesome, right.
>>: These things don't do much locking and unlocking at all. I mean, almost all ->> Emery Berger: They do locking. Right, it's a question of how frequently they do it. So
basically there's this big, giant pain that you take when you acquire a lock. But if you then do a
bunch of computation outside of lock, or in the lock, which would be stupid, you amortize the cost
of the pain.
>>: But, okay, it's just an interesting result. Is it a reflection that there's a lot of -- I mean most of
these programs are highly unshared to begin with. In other words, to be scaleable, they can't ->> Emery Berger: That's right.
>>: They can't have two-lock contentions.
>> Emery Berger: Yeah, I mean, this is really like -- excuse me, these programs were designed
to scale. They weren't written by idiots. They were written by Stanford and Princeton grad
students.
Draw your own conclusions. But, no, I think they're all perfectly capable people. I know that they
recently discovered a race in one of the Parsec benchmarks and we discovered a race that we've
not communicated to them yet. But, you know, these things happen, right?
But in terms of performance, they tend to scale pretty well. So they've got that part down. The
linear regression story, you know they have results in the Phoenix paper. It's like wow Phoenix
kills linear regression. Not when you get rid of the false sharing. The Map-Reduce thing looks
great for linear regression. If you go back, put this result in, then that advantage goes away.
It was one of their big wins.
>>: So we can fix linear regression by [inaudible] or something else. But when somebody just
increases the cache line size, it's all that ->> Emery Berger: That's right.
>>: So do you mean that basically we have to run Sheriff all the time, or just change the way we
do things?
>> Emery Berger: So well, I don't know about change the way we do things. It would be great if
we could run Sheriff all the time. I would like to get this to run faster. We have ideas on doing
this.
One of the things about Sheriff is that the cache line size is just a parameter. It's a runtime
parameter. So you can easily sort of future-proof your programs by running it with Sheriff and
doing false sharing detection for 256, 512. That's just how much it actually checks for false
sharing.
So I think that that probably makes more sense from a practical point of view. Use this to say, all
right, I'm going to make sure that there's no false sharing when Intel comes out with its Jimanji
processor that has 124,000 byte size.
>>: If you think about it the argument you make along the lines the overhead is so low on some
of these, and it does kind of make sense. The more scaleable a program, the less it's going to be
doing locking and unlocking to begin with. The less overhead you're going to have. Right? So
there's no reason -- if you're stream match or word catch, whatever, there's no reason to run ->> Emery Berger: There are two things. There's the lock stuff. The other issue is -- it basically is
the size of the transactions in the end.
So if I go and I touch a whole bunch of pages very, very sparsely that will kill Sheriff, because it
will incur a page fault for every one and the comparison phase is, right now it's serialized. The
comparison phase is it goes and scans all the pages looking for DIFs, for detection, and for the
commit it still has to go through all the pages and compare to see what it's going to write.
So that stuff we could parallelize and that would, I believe that's actually the problem in reverse
index, the problem is not the locks. But it's tons and tons of strings. And that's what kills us.
>>: [inaudible].
>> Emery Berger: Yeah, well, yeah, it's certainly embarrassingly parallel, no doubt about that.
One of the things -- so the student on this, Tong Ping, he said to me, hey, we can parallelize it
because they're all multi-processes, let's just do multiple threads in the processes to do this.
I said like, stop, I want to get this actually published because I want it to work first. So I said we'll
tune it a little more. The results are still pretty good.
>>: So my question would be, when you think about some improvement in the current, say,
operating system, services or hardware services which will help us somehow out of the false
sharing business altogether?
>> Emery Berger: Right. So there are -- there have been hardware proposals. It's been a while.
Actually Doug Berger wrote a paper many years ago here at MSR about dealing with false
sharing and making it cheaper. Essentially moving around this.
But there's some something you could do. I think one of the more interesting questions is if you
look -- if you believe that the future is pointed to by this new Intel processor, this prototype that
has no cache coherence, so you have every core has its own memory and then you commit to
a -- there's a shared bus but there's no cache coherence across them. Cache coherence is a
problem. It's always been a problem for scaling SMPs.
It's certain to become a problem if we really insist in going up to gigantic numbers of cores. One
of the cool things about Sheriff is that you actually don't need cache coherence at all.
It doesn't rely on cache coherence. Everything is a separate process. So this in effect puts
cache coherence in for you. Pretty cheaply. So if you had no cache coherence, this would
perform the same. I don't think P thread versions would perform as well.
So that's an interesting question. We were thinking about -- part of the problem I have writing
papers on this topic is that if, say, I have this great mechanism. Look, it's a great mechanism. It's
good for this. That's a whole paper. It's also good for this. That can't possibly be a whole paper.
I mean, for me it seems wrong. Oh, yeah, just use this mechanism. Look at how awesome it is.
I don't know. Maybe I should be more LPU focused. I can't imagine.
>>: With having mechanisms that were finer grained than the page size, you had a cash line size
way to detect protection and things would that be helpful.
>>: Finer grained is always helpful. The problem with finer grain obviously you have to store that
information somewhere. If you make everything fine grained then it's impossible. So, of course,
it's the Mandran stuff that does it sort of adaptively and different sizes. If you do everything
fine-grained you're stuck you have to have ->>: There's some results that basically you can use -- you can use ACC beds. There's things -Emery Berger: Sure. That's a great paper.
>>: That's with stock hardware. Imagine putting the same mechanism into the hardware directly.
>> Emery Berger: Yes, of course. So that paper which I think it's called safe MIM by YY Jo
wonderfully cited, careful trick to use the ECC bit to still the bit and obtain, basically just get some
sort of memory safety out of it.
Very, very clever. I highly recommend you read it. I don't know how you would use it in this
context. In their context, they deliberately flip the ECC bits so that you get an ECC complaint.
And that's this kind of low level, it's a very lightweight interrupt.
But it doesn't have anything to say about threads, thread sharing.
>>: Can you imagine the language level thing. So you see you have volatile -- the register, can
you imagine status in ->> Emery Berger: It says the opposite, volatile does.
>>: That's right.
>> Emery Berger: I know you know that.
>>: Could you annotate variables and say look I know these are going to be -- these are going to
be divvied out to a bunch of threads, make sure they don't -- so we don't have to have constantly
chasing the next cache run size, panic.
>> Emery Berger: Right. I've not really considered it. I mean, the way that people get around
this right now is they actually -- they align things. So there's pragmas or deckel spec things you
can do in G++ or in C++ that allow you to say I want this aligned in a certain boundary. That's
useful for SSC things, but it's also useful for preventing false sharing.
So you can tell a field I want this field to be aligned at this boundary, this field to be aligned at this
boundary, but this does not address your concern about being future-proof. It seems reasonable
enough to say, right, I want to declare some sort of conflict. I want to say this field and this field
are going to be shared.
But on the other hand we could probably do some of this with static analysis. Right? Certainly
we could have thread aware static analysis that actually look at sharing patterns. The challenge
is coming up with false sharing things that matter. And I think that some of this stuff that Sumit
has done on speed that allows you to figure out how often certain chunks of code are running,
that might actually be useful as a way to weigh the importance of these interleavings.
>>: Does this work, the Java runtime and such?
>> Emery Berger: So this is all a C++ story. Would it work with a Java runtime? I mean, I have
to imagine it would immediately crash because it's such a gigantic mess. But, I'm not sure it
would help much even if it did work, because it would only discover things that are happening at
the Java runtime level. You'd have to expose the heap. You really need to expose the Java
heap. It needs to be aware of what objects are what, have to connect it up with a garbage
collector. You would have to plug it in.
>>: Garbage collector, separate process.
>> Emery Berger: Maybe that's where it should be. So, yeah, that part doesn't really matter,
right? It will still work.
>>: That's true.
>> Emery Berger: But the real problem is it could report tons of false sharing so you think of the
nursery, for example. If you're doing a nursery and you're allocating objects and you're
constantly recycling the same buffer of memory, all of those things would be considered, well,
potentially could be -- let me take that back. In fact, most Java runtimes nowadays have per
thread nurseries.
So actually that would be okay. So it's only when you do a collection. But you're definitely going
to collect the same regions of memory. It's going to discover things and it's going to say false
sharing.
>>: So you need [inaudible] for say the Java runtime?
>> Emery Berger: I think you'd have to do a lot more engineering. It would be much more
intrusive, because Java is this monolith of stuff. And to get in edge-wise, I think you would
actually have to put in hooks into the garbage collector, at minimum.
>>: [inaudible] the language [inaudible] virtual machine monitor, so wherever you go this
abstraction layer between the heaps that you care about and the heaps that the system sees, this
translation in between.
>> Emery Berger: Right. So your point basically is as soon as you have some extra thing hiding,
what a heap is, from anything lying below, you would have to get in edge-wise and put in
something. I'm not going to do that.
All right. So that's the end. So there's Sheriff, right. So detection, prevention, reasonable,
performance, and you can barely see Darth said yes but we want to get rid of the insidious false
sharing. Anyway, thanks for coming.
[applause]
Download