>>Todd Mowry: I'm going to be talking about this... interesting and timely, which is how we manage to take...

advertisement
>>Todd Mowry: I'm going to be talking about this thing that I thought would be very
interesting and timely, which is how we manage to take a large piece of legacy software
and make it run twice as fast on a quad core. Before I get into that, I thought I would just
provide a little context for where this project fits in with the other research I've done.
The initial projects I did in my group were focused on performance things, in particular
how do we extract thread level parallelism. In fact, that's what this talk is about today.
The other thing we worked on was how do we make better use of caches.
So I've given talks about -- having worked on both of these things generically, we then
focus specifically on how to do it for databases. How do we redesign databases to make
better use of caches. That's something I've given talks about here at Microsoft in the past.
So Shimin Chen was one of my students and he was a Microsoft fellow and Microsoft
intern and all that stuff.
Today I'm actually working on two projects. So the first is we call log-based
architectures, where the goal is to detect and fix software bugs on production systems
without slowing down the application. So I gave a talk on this project the last time I was
up here, about a year and a half ago, and I'm hoping to come back sometime in the next
couple months and give you an update on some more results we have on this project.
Finally I'm working on this sort of far-out science-fiction-sounding project that I'll be
happy to talk to you about after the talk if you're curious where we want to physically
render moving objects and build a Holodeck.
But what I'm talking about today is the sort of capstone of a project we did at CMU that
we called Stampede. This was really Chris Colohan's Ph.D. thesis work. This is the first
time I've given a talk that's not on something we're actively working on at this moment.
This is something we completed a couple years ago. The reason I'm talking about it now
is I happen to never have talked about it here at Microsoft and I think it's also very
relevant today with multicore and people trying to parallelize software. Also it's very
timely because people at hardware companies are making decisions right now about
including the hardware support that I'll be talking about. So if people from, say,
Microsoft thought that support was interesting, that might affect the world.
So the starting point for this is multicore. So today we have quad-cores and the
projection is that we'll have more and more cores over time. So we can't crank up clock
rates any more, but we can stamp out more and more processors on a chip. So clearly if
you're interested in increasing throughput, there's more core to run things on, and it's
good for that. But what about simply making things run faster? One view of the world is
don't worry, everything will be fine with multicore because from now on everyone will
write parallel software and it will all speed up perfectly. So that's an optimistic view.
It would be great if that was true, but I've spent 14 plus years teaching people how to
write parallel software. The empirical data suggest it's difficult to do that, to make it both
correct and fast. So hopefully people will get much better at this, but that's a challenging
thing to do. We don't expect that to be fixed overnight. And there's also this issue of
legacy software. So what about things that are important that have already been written
that aren't parallel or what about pieces of things that maybe they have been parallelized
but there are important parts that haven't been parallelized yet? So that's actually what I'll
discuss today.
Another sort of rosy view of the world is don't worry, multicore is fine because we'll have
these fancy compilers that will automatically parallelize everything so it will all speed up
perfectly. So I've actually worked on projects before where we were trying to do this,
and that was the original motivation of the thing I'm talking about today. It's very
difficult for a compilers to do that. Other than for sort of scientific programs, where the
dependences are very straightforward to think about, this is very difficult, in particular
ambiguous data dependences are a major stumbling block to try to do this type of thing.
So this motivated us -- question?
>>: I often hear don't worry, these chips are used for web servers, and everything is
inherently divided into so many chunks that we can keep these cores busy or they're
running so many processes we're going to keep these cores busy.
>>Todd Mowry: Fortunately, there are enough of those things around already that there
will be some immediate value to having these processors. But I think over time it will be
nice to have more and more things taking advantage of them and not the embarrassingly
parallel work loads. Fortunately, there are embarrassingly parallel work loads, so that
gives us something that's ready to go on all these processors.
So our motivation was what do you do about those other things, what about the things
you care about that aren't embarrassingly parallel, those things do exist.
So the idea behind our Stampede project at CMU was we wanted to have a new -through a combination of hardware and compiler support, we wanted to create this
capability us could run things in parallel without knowing for sure whether that was safe.
Several other groups, like Illinois, Stanford, Wisconsin, other places, have looked at
similar ideas. And the idea is you create the threads, you hope that there aren't
dependences, but the hardware will monitor whether there really are, and if there are,
things will roll back. It's like a transaction. People have been talking a lot about
transactional memory. But the difference is with transactions you already have
concurrent threads and you're just trying to get them not to trample each other. Here we
start off with serial execution and we know the order in which everything should occur so
if there's any conflict between two things, you know which one is supposed to win. It's
the one that is supposed to have executed earlier. But there's actually a lot of similarity in
terms of the hardware to make the stuff work.
So our initial motivation was the previous slide, which is we're going to automatically
parallelize everything in the world and it will all speed up nicely. We did use this support
to parallelize the spec energy benchmarks and found they did run about 20, 35% faster,
which is actually respectable if you compare that with typical architectural optimizations.
Usually an architectural idea is successful if you give 10% improvement. Compared to
that it's not too bad, but it's not exactly the linear speedup you were hoping for.
What I'm talking about today is something different, which is rather than this fully
automatic approach, we're looking at, first of all, a much larger and more interesting
piece of software. We're looking at BerkeleyDB, which is a database, part of a database,
and rather than doing something fully automatic, we're doing something semi-automatic.
So the programmer actually is involved. There's this iterative process I'll talk about
where the system tries something, it doesn't work very well, gives feedback to the
programmer, they make a decision to tweak something small, and they go back and try
again.
And I think that this approach is actually extremely promising and you'll see that it
actually worked quite well in this case. So this is what I want people to hear about, this
sort of semi-automatic approach.
So the application that we're looking at, it is BerkeleyDB, so it's a database engine. Now
databases are already run in parallel. What we're trying to do here is run them in parallel
in a new way, which is we want to take individual transactions and break them into
parallel, split them into parallel chunks.
The way that you typically get parallelism, you get parallelism across separate
transactions. We're still doing, that but we wanted even more parallelism than that, so
we're breaking them up. The reason why I think this is interesting beyond databases is
really you can just think of this as an example of a big, complicated piece of software,
where the thing that we're trying to parallelize was not written with parallelism in mind.
In fact, it was written under the explicit assumption that you would never try to run
pieces of it in parallel, within a given transaction. It is not safe to do that at all.
So if you tried to actually sit down and rewrite the code so that you could exploit
intratransaction parallelism, this would take a long time, it would be painful. You would
have to understand a large amount of code and it would probably take years of effort I
think to do that. I think you'd actually really have to start over almost from scratch with a
lot of fundamental structures. So but we're not going to do that.
All right. So switching into -- so now we're going to dive into BerkeleyDB and talk a
little bit about transactions in particular. So here is my student Chris who did the work. I
guess this is his idea of what database users look like. We have our cavemen here who
are writing their sequel code and sending it into this database server. So here come their
transactions. And then what happens is we've got this, you know, sequel server, big
DBMS here, data stored. How do we get improved performance? Today what happens
is separate transactions get to run in parallel on different cores, and we get throughput
from that, and that's good for performance.
Now, why do we want to make transactions run faster? Well, first of all, people do care
about latency. If it too long to run my transaction, I get annoyed. Also, if one of the
transactions is grabbing a lock, that prevents other transactions that need the same lock
from executing, so that actually does hurt throughput.
So what can we do about transaction latency? If you look at TPCC, so this is a
benchmark for transaction processing, the most -- the transaction within TPCC that takes
up the most time is something called New Order. So you're placing a New Order. Here
is a pseudocode version of that transaction. You're basically, you have a basket of items
and you're going to update the stock information for all those items.
So here is one query within it. We're going to do this select. One thing that you can do
that databases do today is try to extract parallelism within a particular query. I can take
this select and break it across multiple processors. This idea works really well for
decision support. So if I'm launching queries that are going to take a long time, then I
can get some really nice performance out of this. But unfortunately for transaction
processing, these queries tend to be very short-lived. They don't run for they long so I
don't get a whole lot of performance this way. That's not so great. What I prefer to do is
actually get parallelism across separate queries or within one transaction. So, for
example, I could take different iterations of this loop and try to run those iterations in
parallel. So that sounds good. But this doesn't work. This would break BerkeleyDB and
probably a lot of other systems because they're not written with this in mind. None of the
internal structures like locks and latches and all those things are set up to work this way.
But as you'll see in this talk, the safety net of thread level speculation makes it much
easier to fix this problem. So let me talk just briefly about thread level speculation and
how that works.
So if you imagine some original thread, so this thing is executing in time, and when we
try to run it in parallel, what we're really doing is taking something that was supposed to
occur later and we're running it concurrently with this logically early chunk of work and
we're dereferencing pointers. We want to label all of these with some logical timestamp
and the system does that automatically, so we realize that this one is supposed to occur
before this one.
But sometimes bad things happen. So, for example, we do a load from a particular
address, and we loaded it before it was stored over here. So way saw the wrong value
and the same thing happened here. So this is bad because it's not possible, so what
happens is the hardware detects this, it notices that this bad thing happened, it squashes
the thread and restarts it. The second time through good things happen. So now the data
dependences get forwarded correctly and it works.
What we see is we use these logical timestamps called epochs to number the different
threads and the hardware is going to detect problems and restart things. This is a little bit
like recovering from a transaction, like I mentioned before.
Yes?
>>: Is your experience that most of them do not have data dependences between
iterations?
>>Todd Mowry: That's a really good question. And the next couple slides are going to
dive into that, 'cause that actually -- the answer to that I think is very interesting, so... so
hold that thought for one second.
The idea with the mechanism is hopefully in the worst case we run about as fast a
sequential execution and actually we run a little bit slower that because there's some
overheads. In the best case, if there's no dependences, we get parallelism and speedup.
How often do these dependences exist?
Now, this is going to start addressing Ed's question here. On the motivation for this
hardware support was actually spec type of program. So little things where we have
relatively small threads from loops, and in those cases dependences tend to be fairly
infrequent and the threads tend to be relatively small. So if it's painful to recover from
speculation failing somewhere, it's not that big of a deal because most of the time there
aren't any problems.
But when we look at databases, when we look at transactions, parallelizing them, the
threads are much larger. These are maybe hundreds or maybe a thousand or so
instructions, and these are tens and maybes hundreds of thousands of instructions here in
these things. And there are many dependences. There are typically dozens of real
dynamic dependences between these threads, so we see in contrast to the easier case, we
have lots of dependences. So we can't -- it's not okay if we only win when there are no
dependences. We'll never win. They're always there. So we have to find a way to
tolerate the dependences. So I'll talk about that. So we actually learn something new
about how to design the system from doing this. We also have to have much more
buffering to store these larger threads. And in the second part of the talk, I'll discuss how
we address those things.
We also have concurrent transactions that we're speculatively parallelizing, so that has to
work and there's a nice way to make that work, too. Okay, so there are a lot of
dependences. But there's some good news here, which is the hardware is already tracking
dependences. It has to do that to realize when it needs to restart things. And the
information that it's capturing happens to be exactly what you'd like to know if you were
going to fix a problem. So it knows, oh, that didn't work because of this particular load
and store pair. So the idea is when you see a problem, you can actually give feedback to
the user and say, that's the immediate problem right there. This speculation failed
because of this. And if you can fix that, then you'll get to see the next one, and then the
next one and the next one.
So if you choose to, and this is what we did in this study, you can incrementally fix
dependence by dependence, and you don't have to make them all go away, it's each time
you do it, you might see slightly better and slightly better performance. So that's the idea.
Okay. Now, there's one non-intuitive thing that can happen, which is sometimes
eliminating a dependence can make things worse. So in this case we have on the left this
original case where there's a dependence between P and Q, and P triggers a problem first.
So if we eliminate the problem with P, and now Q triggers the restart, we actually restart
later, it takes us longer to realize there's a problem, and we actually have worse
performance. So that's annoying. So one of the things that we did is we fixed this. And
really the fundamental problem is having to get an all or nothing -- having all or nothing
as the mode of execution isn't so nice.
So to fix that one of the new things that we modified in the system is we essentially
added an extremely super lightweight checkpointing mechanism, where we can keep
these different checkpoints such that if something goes wrong, we only have to back up
to the most recent checkpoint. That was really important for making any of this work.
I'll show you numbers on that later. So now we have a more robust system design that
will work much better because of this.
Okay, so, and this was a critical to getting the relative speedup from having this was
really important. With this support, we could get speedups of one point nine to two point
nine.
>>: So do you synchronize the frequent checkpoints across threads? You could have a
case where the thread on the left runs behind and ends up conflicting with something
previous?
>>Todd Mowry: Multiple of these things?
>>: Yes. Can you guarantee that doesn't happen?
>>Todd Mowry: I do have some backup slides on that. But basically we do have this
issue where you have a whole chain of things you have to cancel. I don't know if that's
what you're asking about. We do have a way to -- we can selectively rollback to the most
recent checkpoint in each of the threads if we have a chain of things that we have to
cancel.
>>: Rollback more than one checkpoint and one thread?
>>Todd Mowry: So what we do is we're lazy about rolling back threads. So we don't
roll them back aggressively, so we only rollback once, if at all. We don't rollback
multiple times, because you could imagine discovering more and more dependences and
eventually rolling the whole thing back. So we didn't do it aggressively, we did it lazily.
It's only when we would commit a thread that -- we only triggered it at commit time, so...
this picture looks a little funny. This isn't actually quite accurate. It doesn't happen at the
moment this occurs, it actually happens a little later.
>>: I assumed that. Could I have a violation that would roll me back three of the
checkpoints on the right-hand side?
>>Todd Mowry: Oh, I see. Yes.
>>: Yes, okay. So you have to rollback to any number of them?
>>Todd Mowry: Yes.
>>: Okay, great.
>>Todd Mowry: We recognize -- so in this picture, what it's trying to show is that you
actually -- it's hard to see, but we actually rolled back to this checkpoint. This load was
above this checkpoint. It's right here, and here is the checkpoint. We skipped this one,
went back to this one. We know which checkpoint is the most recent one ahead of you.
>>: Okay.
>>Todd Mowry: Yes.
>>: [Indiscernible]?
>>Todd Mowry: Yes. It's a trivial mechanism for checkpointing that works because
we're already supporting thread level speculation. We bump a counter somewhere.
That's all we do.
Okay, so to make all of this work, there are different layers here. There are transactions
running on top of a DBMS running on top of a system. There's an operating system in
here, too, of course. It's an operating systems company. I forgot to mention of operating
system. I'll be shown to the door now.
In particular what we're runnings is TPCC on top of BerkeleyDB because we have the
code on top of assimilated machine. The hardware support didn't exist.
This involves different people changing what they're doing a little bit. So the transaction
programmer, we don't want them to change anything that they're doing. If they choose
to, they can explicitly indicate where they want to break their transactions into different
threads. In our experiments we simply took every loop in a transaction and ran loop
iterations in parallel, and they didn't need to do anything to do that.
So there are a lot of people who do this, so we don't want them to -- we don't want to put
any of the burden on them. There's a lot more interesting stuff that happens for database
developers. So they do have to change things. That's actually most of what the talk is
about, is how we fix that.
Now, that was the part that took a month. So there's more work to do here than here, but
this still isn't a lot of work. And then there is some work for hardware people to do,
which is to put this into the hardware. That's been sort of very well-studied. Right now
it's just a value proposition argument that people are actively debating right now.
>>: Would you say that TLS hardware research is mostly done or are there new
capabilities you think might pop out? It's hard to predict the future. But you've worked
in the area a long time.
>>Todd Mowry: I don't think anyone's really actively pushing on it right now. I think
it's mostly cooled down at the moment, but it could become interesting again. Yeah, so...
For the moment, it's paused I think.
>>: But in terms of -- do you see significant unsolved problems remaining from the TLS
implementation side?
>>Todd Mowry: No.
>>: Or it's basically most of the problems are solved, if you can find the right sort of
stuff in the higher layers, it will take off?
>>Todd Mowry: The reason why it's not in today's hardware is not because designers
don't think that there's some major problem that would prevent them from implementing
it. It's just because they haven't been convinced yet that people want it. But they know
how to design it. I've seen many different designs that all seem reasonable. So people
aren't worried about whether it will work, just whether they should bother to do it.
I think once it really exists in hardware, it will probably make the research interesting
again potentially because there will be much more interesting software built around it and
then that might make life interesting again.
Okay, so if you're, say, looking at the world from a database perspective, the new and
interesting thing is we have a way to extract parallelism in a transaction without changing
the transactions, with small changes to the database management system. You don't have
to change anything about locking. We didn't introduce any bugs. We got nice
performance. We made transactions run twice as fast on four cores.
Okay, so here is an outline of the rest of the talk. So there are really two parts to this.
The first part I'm going to talk about how the database changed, how the database layers
changed to take advantage of thread level speculation, then I'll briefly talk a bit about the
hardware support. So these are the new things we learned by doing this as opposed to
spec benchmarks.
But the first part is the transaction programmer. So we have to divide the transactions
into separate threads. So looking at New Order again, it has this loop, and if you look at
this loop, it takes up -- so New Order is the most important transaction. This loop takes
up about 80% of the time in the transaction. So this is the hotspot. And if we can make
this run in parallel, that's a good thing to do.
And, in fact, if you focus on the code, it looks like this is almost embarrassingly parallel well, almost, not quite. So it's possible that different iterations will depend on each other
if the same item appears multiple times as separate entries in the list. That's possible.
For the benchmark it turns out that the definition, given the way we configured it, is this
will happen once every hundred thousand times. So this looks like a great opportunity
for thread level speculation because it will only fail once every hundred thousand times.
At this level it looks very good. So all that you need to do is decide that you want to run
this parallel -- this loop in parallel, and the programmer can either explicitly annotate that
or not, or you could just have the system decide to do that. So there's not much work for
the transaction programmer to do, and that's good.
The next part is what about the database developers? How do we remove problems from
the database management system? So what we'll see is, in fact, just simply doing this
didn't work very well at all. There were a lot of problems, and they didn't occur at the
sequel level, they occurred inside of the database itself. So I had this picture before
where I showed that what happens is we try to run things in parallel, and they don't run
very well in parallel because there really are a lot of dependences. What we're going to
do is use the hardware support allows us, gives us profiling feedback where we can tell
exactly what the pieces of our code were that caused this most critical dependence, then
we can do hopefully a small simple thing to fix it, we get a little bit better performance, a
little bit better, so on. So that's the idea.
Yes?
>>: At the beginning of the talk, you mentioned transactional memory, you still have a
version of your code that is in a way explicitly parallelized by the programmers.
Nevertheless, at the end of the day it seems you are going to make modifications to
existing applications. It would seem you would take this database and try to parallelize it
[indiscernible].
>>Todd Mowry: I think the amount of effort to take this -- okay, I think there's a very
big difference in terms of the amount of effort involved because transactional memory,
the starting point is that you already have a concurrent program, and that it's mostly -- it's
correct, and you're just trying to take a critical section and get better performance out of
it. That's an important thing to do. But you're starting with something that's already
concurrent.
In this case we're starting with something that's sequential. We never get to the point
where we really have a properly threaded program. Instead, we're just fixing -- we're
relaxing some aspect of the program and making it a little more parallel, but we're not
bothering to fix all the other things in the thread. So I didn't do the actual study of trying
to rewrite BerkeleyDB where we would rewrite the threads. But, as I said before, I think
that involves almost starting from scratch in terms of the writing the code. It's a really
fundamental assumption throughout the software that there are never multiple threads
within one transaction. I think that would be measured in years of effort, not a month of
effort, to do that.
So transactional memory is a good thing, but it's for a different purpose than this I think.
So what I'm going to do now is one concrete example of something -- of a case where we
found a problem and fixed it, just to illustrate the types of things we found here. So one
important part of the database is the buffer pool manager. So this is a little bit like the
virtual memory system of the database. When we access data, so the buffer pool is the
section of physical memory. If you access something that's on disc, you say, get_page, in
BerkeleyDB, it grabs something off the disc, puts it in the memory, and then to make sure
that it doesn't get displaced until you're finished, it bumps a reference counter so we
realize somebody is actively using this. Then you get to access it. When you're finished,
to release it, you say put_page, now that page can be reclaimed by the buffer pool
manager if you need more physical memory.
Okay. So what happens when we try to run the code in parallel and looking at these two
primitives? So let's say that two different -- so we were originally going to run this
thread, then this thread is supposed to run after it in the original program. We're
optimistically running them concurrently. They both want to access page five. So they
do a get_page, they do some work, then they do a put_page. It turns out that speculation
will always fail here. And the reason is because of this reference counter. So when we
decrement the reference count here, that's going to cause the reference count value that
we read here to look like it was wrong because it just changed and we read an older
value. It will squash speculation and this will never succeed. If you stop and think about
it for a minute, that's a little -- although that is strictly what the sequential execution
would have done, we don't really need it to behave this way. We want buffer pool
management to work, but we don't care about this artifact here.
So I don't really care -- in particular, I don't care which of these threads grabs the page
first. I just want to make sure that they have it pinned in memory the right way when
they access it.
So what we can do in this case to fix this is we can temporarily turn off the speculation
tracking mechanism. We can say, we're about to do something and we don't want you to
treat this as something that's a problem, so we're going to go ahead and change the
reference count. In order for this to be okay, we have to realize that it's possible that this
thread may get squashed for another reason, and if that happens, we have to have a way
to undo the side effect we just did.
So fortunately that's easy to do in this case. To undo a get_page, all we have to do is
decrement the counter and everything will be fine. So what we do is we escape
speculation. We do the operation. We store an undo function. We turn speculation back
on again.
Yes?
>>: [Indiscernible]?
>>Todd Mowry: I meant [indiscernible].
So to show you in the actual code, here's what we did. We implemented this again by
hand. This isn't automatic. Chris decided that this was the thing to do. So we put a
wrapper function around get_page. In the middle of it it calls get_page with this ID that
we pass in. In addition, as I mentioned, we turn off speculation. We have an instruction
to our hardware that says turn off speculation, don't track it, then we turn it back on here.
We also do some extra checking of the arguments to make sure they're okay because
when you're running speculatively, you may get a bad ID and we want to make sure we
don't do anything bad inside this routine, so we do some extra checking here. Also we
don't know how to do multiple get_pages concurrently, so we add a mutex to make sure
there's only one active call to get_page at a time.
Then finally we say that the way that you undo this is you need to call put_page in order
to undo get_page. And this works because the effects here are isolated. It's not going to
cause a cascading chain of aborts if we do this. We have a way to undo all of our side
effects. This could be used for other things like memory cursor management or malloc,
for example.
Yes?
>>: Do you run into problems if the operating system deschedules one of the threads in
this speculative team?
>>Todd Mowry: Yes. So we don't want -- I mean, well, the problem is that nothing
crashes and burns. What happens is speculation will fail. That's actually really the only
thing that goes wrong, I think. I mean, we wanted things to be getting scheduled if you
want it to, would. In general, if you have a parallel application, you swap out one of the
threads, unpleasant things can happen if you're holding a lock.
>>: You'll still be able to detect conflicts?
>>Todd Mowry: Yes. Well, okay, what happens is we have a limited ability to detect
conflicts. If things ever get knocked out of the cache, we say there's conflicts. Our
default is we squash everything. So we can always fall back to that. That's the reason
why it still works if the operating system reclaims something. It's not fast anymore, but
it's functional.
We know how to fix the problem with get_page, which we just described. Now what
about put_page? So here we are decrementing the reference count. Can we play the
same trick here? We can't 'cause get_page is not a way -- is not a proper way to undo -you can't undo put_page with get_page. It may end up going somewhere else and this
could cause very different behavior. So instead what we do is we just postpone doing the
put_page until the thread receives this token that says you are no longer speculative
because everyone ahead of you has already committed, and it's definitely safe to do that.
The downside of that is we're actually increasing the lifetime in which this is pinned in
memory. That's a tradeoff that seemed okay. We're increasing it by a couple thousands
instructions and memory is large, so we're not too worried about that in this case.
If we do these two things, we can now actually run code with get_pages and put_pages in
parallel, and that won't cause speculation to fail. So this is one example of the type of
thing that we were doing.
Okay, so in general delaying something until the end of the epoch when you're not
speculative is another trick. So Chris went through this, in the title of talks as I said, he
went through a month's worth of effort finding the next thing and fixing it. The other
thing I forgot to mention is that Chris had never seen BerkeleyDB before he started doing
this, and he doesn't know much about databases. He's a systems guy, not a database guy.
He never took a database course or anything like that. So to him this was just a big giant
piece of software that he didn't understand.
But of all the fixes that he made, these basically fell into three buckets. We saw the two
things we saw, escaping speculation, delaying an operation, and then the other bucket
was there were several cases where there was a central structure, and the solution was
simply to distribute -- have a parallel version of it across different cores, for things like,
say, gathering statistics you could gather local statistics and then just add them together
later rather than having one counter. That's one not simple example here. These are the
things that he did.
Okay, so now I'm going to show you some performance results here. So we simulated a
quad-core processor. It was an out-of-order super scaler. Here is the memory hierarchy
we were modelling. We were running TPCC in core. We weren't worried about disc
accesses. Yeah, we're also measuring the latency of a transaction, not its throughput.
Okay, so here are results. So first I'm showing you the original unmodified version of the
application running on one core on a quad-core. So that means three of the cores are
doing nothing because it's a sequential piece of software, at least one transaction is. So
the way this is color coded is that black means that there's a CPU that's doing nothing. So
three of the processors, this is renormalized to excuse time, are sitting idle. The one
processor that's busy, the blue part is the time when it's doing useful work and the green
part is the time when it's stalled for cache misses.
Okay, so now we've got this wonderful thread level speculation mechanism that we spent
years developing and we turned it on and parallelized this embarrassingly parallel
sequential loop and we got a slowdown. So things were worse than they were originally
by a little bit. By about 8 or so percent. This is without any of the optimizations that I
just mentioned. The reason is that the speculation is always failing. So the red part is
time when you've done work that you have to throw away because it wasn't useful. It's a
little bit worse in part because the cache this time actually went up a little bit because
previously all of the data was on one core's cache and now we've spread it across all the
cores, so now our cache misses are more expensive. We lose a bit here.
So then Chris started going through this process of iteratively fixing things. So first he
fixed the problem with the latches, and things actually got even a little bit worse. And he
fixed the problem with locks, and we're still not speeding up. Then he fixed the problem
with malloc and things still are not good. Then he fixed the problem with the buffer pool,
suddenly we saw a nice jump in performance. More than a quarter of the latency
disappeared. You might wonder, why did he bother to do the first three things, and jump
to the fourth step? The answer is you had to do all of them. They were four problems
lurking underneath each other. When you eliminated all of them, you saw this
improvement. Then he kept going, did something with the cursors, fixed some false
sharing, fixed logging. We eventually got down to almost a nearly twofold speedup here.
Yes?
>>: Are all of the techniques on that?
>>Todd Mowry: They keep adding on.
>>: I assumed that. But are all of them necessary to get close to that final performance?
You said they were all lurking. Are every single one of them important?
>>Todd Mowry: Yes. So the way we get the feedback is that the hardware always tells
us the most critical one. So if we didn't fix that one, it wouldn't matter if we fixed
anything else underneath. It might matter if we eventually fixed the critical one, but until
you fix the critical one, you won't see any improvement.
>>: I understand. But if you take the end point that has everything.
>>Todd Mowry: These last two didn't improve upon this one, if that's what you're
asking. We didn't really need to go all the way out to this point. Like once we got to that
point, it kind of levelled off.
>>: [Indiscernible]?
>>Todd Mowry: Uh-huh.
>>: Could you, for example [indiscernible]?
>>Todd Mowry: No. I mean, you wouldn't do any better than that. I mean, if you
wanted to stop here, fixing these things without fixing that would not improve anything
beyond this. They're sort of strictly ordered in terms of being the more critical
dependences. If you don't fix them, you don't get to move to improve the performance.
Yes?
>>: [Indiscernible]?
>>Todd Mowry: Yes.
>>: So I have a comment about this.
>>Todd Mowry: Right. Unfair is fine. I may give you an unfair answer [laughter].
>>: [Indiscernible]?
>>Todd Mowry: Yeah, so the one big caveat I think for interpreting this work is there
actually are legitimate arguments you can make about why you may not like this feature
all the time, certainly in a real database system. If you didn't have enough transactions to
keep the entire system busy, there's certainly no reason to turn this on. You would only
want to turn this on if you actually had idle resources and you're willing to pay the power
cost of turning them on.
So but, yeah, you have to keep that in mind. You don't want to make non-speculative
work slower by doing this or it will hurt throughput and potentially even latency.
But the other way I like to look at it is, pretend this isn't a database system. Just take a
transaction and say, well, ignore what it's doing, this is a pretty complicated piece of
software, and we managed to make it run a lot faster.
>>: How heavily loaded are typical systems that run this application?
>>Todd Mowry: It's very bursty. That's my impression. Sometimes they are saturated,
but most of the time it's not saturated. It's very bursty. So you want to monitor the bursts
and not turn this off during a burst. It's easy to turn this off. You can turn this off in one
instruction, so...
>>: How much does your programmer need to worry, A, about the underlying
microarchitecture versus just the code? For example, false sharing is an example,
hardware details, then, B, about the specifics of the speculative threading protocol?
>>Todd Mowry: Okay, false sharing was really the only one about the memory system
in this case. Other than that, there weren't any other like system parameters that you
would need to know other than like your ability to buffer threads, you would want to
realize that you shouldn't make threads that are way larger than your system's ability to
buffer things. But we never actually thought about it. We just tried to design something.
I mean, we just used the -- well, I'll talk later about the hardware support for this, why we
can buffer enough.
But, yeah, when Chris did this, he wasn't thinking at all about any system parameters. He
just realized that, oh, that's not parallel code. I could make that parallel code by doing
this little thing. And the key thing that I probably didn't emphasize is the reason why this
works so well and you can do it so quickly is you don't actually have to go off and
understand all of the code when you make one of these changes, you can just zoom in on
this specific thing you're working on, you and you can still have lots of other problems
going on. In fact, there are. Even as he's doing this, there's dozens of dependences sitting
underneath this. We never really get full parallelism. We get partial parallelism. Things
are somewhat overlapped. That's why we get a speedup of two as opposed to four-onfour cores. May be possible to push it further, but we didn't work on that.
Okay, so at this point we actually can't improve this much more because the only part of
execution time that gets better with better speculation is this violation time, and that's
already fairly small. The reason why we aren't running faster is the number of iterations
in this loop isn't very large. So we just don't have enough parallel work to keep all the
cores busy in many cases. That's really the bottleneck. We tried an artificial experiment
where we increased the number of iterations artificially and we got much better speedup.
But that's what you would expect.
Again, the important part is it took Chris 30 days of effort and he changed less than 1200
lines of code out of 200,000 and he didn't understand the application and didn't really
know -- I don't want to make Chris sound like he's not a bright guy, he's a very bright
guy, but you don't need to be an expert on this to get this kind of improvement.
So we looked at the other -- New Order is one of the five transactions in TPCC. We did
the same process to the other four. Two of them sped up very nicely, even more than
New Order. So delivery is running -- we've eliminated two-thirds of the execution time
here and stock level is running more than twice as fast. The other two cases did not
improve. And in these cases, there weren't loops. There were a lot of dependences. We
couldn't find any obvious way to break them into smaller chunks. So this doesn't always
work. But it worked well in these three cases.
Okay, so for a database type audience, the conclusions from this are that we found this
new form of parallelism. We can now break up transactions in different chunks. We did
have to change the database management system, but it didn't take very much effort to do
that. We think this could be applied to more than transactions, not only within databases,
but other types of software.
Okay, so that's the main part of the talk. I've got another little part.
Yes?
>>: Did you look at energy efficiency?
>>Todd Mowry: No.
Okay. So, I mean, you just have to decide what your preferences were. If you are
incredibly stingy about energy, then don't speculate. But relative to your preference for
latency, improving performance. There's a tradeoff there and you always have to decide
how much you care about one thing versus the other. Without this, it's not obvious to me
at least how you would make transactions run faster 'cause they're not concurrent, and
clock rates aren't getting any faster.
Yes?
>>: [Indiscernible] when you turn speculation off, how much extra do you pay for this, if
you have a peak, for example, you don't want to have speculation running?
>>Todd Mowry: Let's see. So, well, you can figure that out actually by just looking at
the height of the blue bar. The software overhead shows up in this blue section. It's
actually increasing a little bit. But it's only a couple percent. So you're not paying very
much. So, yeah, that's the basic answer.
Yes?
>>: [Indiscernible] single client?
>>Todd Mowry: Yes.
>>: Did Chris look as you increase the number of clients [indiscernible]?
>>Todd Mowry: We didn't look at that. That would be interesting to look at, about you
we didn't do that, so...
Yes?
>>: [Indiscernible]?
>>Todd Mowry: That's a good -- we did not do that, no. I don't -- yeah. I don't know
what the answer would be. I believe that the things that we did here -- the nice thing,
they're all generally useful, I think. I didn't make this clear. But what Chris did is he
started with New Order and he made these improvements, and then he simply ran these
cases without changing the database further, database management system. So that the
fixes for New Order are also very useful for these two cases. There weren't any other
bottlenecks he had to fix. TPCH, I don't know how it would stress the database
management system. There's a decent chance that the improvements here would be
useful for it. I mean, they have much longer running queries. In a sense, part of the
reason why we didn't do it because the intraquery parallelism worked so well generally
we didn't think it was as interesting, but there's no reason not to try it. That would be
another experiment to do.
Okay. So the second part of the talk, I'll just go through this quickly, but this is about -what we learned about changing the underlying design of the system to make this work. I
know that the majority of you aren't architects, but you might find this a little bit
interesting.
So, okay, we have to buffer large speculative threads, and we have to deal with
subthreads. Those are the main two things. When we were parallelizing [indiscernible],
our threads were tens to hundreds or thousands of instructions long. We were just
parallelizing loops in C code, individual iterations typically aren't that big. But here our
thread sizes, even though we're parallelizing a loop, it's a loop in sequel. Beneath that is
a database management system engine that's invoking lots and lots of other code. That
turns into tens of thousands of instructions. Delivery was almost half a million
instructions. This is the number of dynamic data dependences on average between those
threads. So these are 75 or tens of dependences. We start off with 75, and he fixed about
a dozen of them. So there are still about 60 or so dependences still there that he didn't
fix, but they just happened to be less critical and we could get some performance
improvement.
So one of the challenges is we have to buffer larger threads. Well, so, there's related
work here. The big difference is in our original design, we buffered everything in the
first level caches and we now realize we need to buffer it deeper in the memory
hierarchy. This involved adding another bit to our first level cache to keep track of some
extra state and adding a few bit to the tags of our second level caches. Also, when there
are collisions, an interesting thing that happens is when you're physically sharing the
cache, different threads are writing to different copies of things, you end up with different
versions of something. So what we did is we would actually allow replication with
associative cache. So that was another change. We added a small victim cache in case
we had some pathologically bad cache mapping. That worked well.
We also had to support subthreads. So the only trick for supporting subthreads is we
already have, external to the cache, these epoch numbers. These are logical timestamps.
We sit on the side because there are only a handful of them at a time.
>>: [Indiscernible]?
>>Todd Mowry: Yes. This is how we implemented incremental checkpointing. All we
had to do was add support for some more virtual timestamps. The trick is you just bump
the timestamp number, that's all you have to do. That takes no time basically. Just adds a
few more bits to the tags.
Another question that we thought might be -- something we thought might be an
interesting research question was when do we create -- where do we take the
checkpoints? So our intuition was the optimal answer would be that you would take -let's say that this is what our code looked like and we're going to have a dependence
violation here, the optimal thing would be to take a checkpoint just before a very risky
load, one that's likely to fail. So we thought maybe we need to build some predictor to
predict which are the risky loads and take the checkpoint just before them. We were
excited to write a paper on this brilliant idea. But then we discovered just actually doing
something very simpleminded works quite well. If you just take a checkpoint every five
thousand instructions, it works really well, just as well as this.
Okay, so here are some results here. So going back to -- oh, these are all the different
things, plus this artificial version of New Order where we increased the number of
iterations by a factor of 10.
Now, this is not the number I showed you before. This is what the world would look like
if we did not have our checkpointing support. Remember, New Order was running twice
as fast before. Now it's running faster, but not nearly as fast. Delivery was running three
times faster, and now it's only running a little bit faster. Sorry, this delivery outer was
running three times faster out here. Two different versions of this. So subthreads are
important. If we had no dependence violations, if this was perfect, we would only get the
bars on the right. So the checkpointing allowed us to cover much of the distance between
the baseline case and the perfect case. So there's not a huge amount of room for further
improvement.
Okay, so that's important. And subthreads didn't actually affect the cache performance,
not that we would expect it to affect it very much.
>>: Let me ask you about that. If you have a violation which ends up rolling back a
speculative thread to a large number of what you call subthreads or checkpoints, you
would expect to see a large performance gap between that and no dependence violations.
So what you're saying here is that ->>Todd Mowry: Yeah, that shows up as this -- the red part in the middle bar is that time.
>>: Yeah. So what that is saying, with this technique, almost all of the time you rollback
a small number of subthreads.
>>Todd Mowry: Yes.
>>: So this idea we discussed today of doing selective replay and only playing the stuff
that hasn't changed in terms of data wouldn't be all that useful because there's not very
much more to get?
>>Todd Mowry: Yes, that's right. Yeah, the difference between the height of the red
bars is the amount we've got -- actually, in some cases it didn't help very much. Well,
these cases are never very good on the right. But in these cases, yeah, there's not much to
get. In these cases we've gotten much of what there is to get.
>>: [Indiscernible]?
>>Todd Mowry: Gee, I have to remember. I have to look it up. Something like four or
eight. Just multiply that number by two and that's the number of bits you have in the
cache tag. That's the cost of it.
>>: [Indiscernible]?
>>Todd Mowry: No. We -- oh. Let's see. Let's see. How did we do that? I forget the
answer to that. I'll have to look that up after the talk. I know we did that properly.
>>: [Indiscernible]?
>>Todd Mowry: Yeah. I don't remember the answer to that. I remember discussing
that, and I forget what the answer is. I'll look it up at the end of the talk. But I don't
remember off the top of my head how we did that. I remember how we did it generally,
but I know we didn't do it that way here.
I said we have a victim cache. The question here is, how large would the victim cache
need to be with different amounts of associate activity to capture everything that gets
evicted? This is a one megabyte L2 cache. You would need dozens of entries here. If
it's eight or 16, then you don't need very many entries. So it can be relatively small and
work.
This graph just shows the size of the subthreads with a fixed size, starting from 250
dynamic instructions going out to 25,000. The sweet spot happens to be at around 5,000,
for some reason. So if you make this too small, the tradeoff is you end up using up all of
your checkpoints early on in the thread and then you run for a long time and you can't
checkpoint anymore. If you make them too large, you have to go back further to get to a
checkpoint. So given just the particular sizes of our threads, where the dependences were
occurring, this is where it was happening, you could easily imagine changing this
dynamically. This is something in the hardware that it decides how often to do this. So
this could be dynamically adjusted.
>>: Were all of your -- this is a case where reasoning about the microarchitecture, the
size of the L2, would be something that the programmer really had to do if you weren't
going to do some sort of automatic adaptation of the subthread size? In other words,
turning that around, if I supported a certain subthread size in hardware, the programmer
would need to reason about that in terms of choosing the ->>Todd Mowry: We didn't bother to do this, but if the hardware just tracked maybe after
the first iteration the dynamic size of each thread, and just divided that evenly between
the number of checkpoints that you have, I think that would work well, probably pretty
well. And it turned out that with 5,000 instructions per checkpoint, I think we had maybe
eight checkpoints, that was enough instructions that you got 40,000 into the thread, and
that was -- you gained enough with that that everything was good.
A question?
>>: Did you think about recycling checkpoints? You run out of check points, you throw
away the first one, stick it on the end?
>>Todd Mowry: I'd have to go back and think about that. I think there may have been a
reason why that wouldn't have worked, why that would have been painful to implement.
But it seems like a good idea. Other than there being some detail in our design why that
would be hard to do, it would make a lot of sense to sort of exponentially back off the
checkpoints and keep them more recently and fewer of them as you get back in the past.
That's probably another good improvement on this.
But it was already working well enough that -- oh, yeah, these are a number of
subthreads. Here we go, two, four and eight subthreads. So we found it was working
well enough that, you know, it seemed like there wasn't that much to game by being
fancier than we were being.
Okay, so, you know, my last two slides here. So for an architecture audience what we
learned are that the lightweight checkpoints or subthreads are important to get speedup in
cases where there are a lot of dependences between the threads. The fact that we get this
feedback about the critical dependence is important because that allows the programmer
to iteratively fix things.
Now, the reason why I mention that is it's possible to build TLS and not make this visible
to software. This is something that we would want hardware people to make visible to
software, because if you don't do that, you're losing a big opportunity to improve your
code. It might even argue that even if TLS doesn't functionally work, if you simply
profiled this and made that visible to the programmer, that would be very interesting
information. It would tell you just dynamically what is going wrong if I tried to run this
in parallel.
So we found a way to deal with large threads, and this is a pretty simple set of extensions
to existing design.
So final slide. I think we got a fairly compelling amount of speedup on a large piece of
software with a fairly small amount of effort. And basically what we need are subthreads
and the speed back that I just mentioned. There's hope for dusty dead codes on multicore
without having to invest years of time to go back and rewrite something from scratch.
So if you like this idea, you might want to mention it to your friends at Intel. Hint, hint.
>>: Don't you have ties into Intel?
>>Todd Mowry: I consult for Intel. I'd be happy to introduce you to the right people.
>>: [Indiscernible]?
>>Todd Mowry: Yeah. Actually we had a paper several years ago showing exactly that.
So we can -- we designed our support so that the chip boundaries didn't mean anything.
So it just worked on top of coherence, and coherence can be implemented. It will just
work on any scale of system.
Now, the only thing that limits that being interested is how much speedup you get. There
are cases where you can get interesting speedup across multiple chips where it's
speculating across the whole thing.
>>: Do you have to do anything special with inflight messages when taking your
checkpoints?
>>Todd Mowry: No. You just need to -- there's a logical timestamp, when a thread is
executing, it has a logical timestamp. The trick we're doing is we're bumping its logical
timestamp. We gave it a range of timestamps that are contiguous. We say, now you're
the first one, second one, next one. All you need to do is to make sure you atomically
change that with respect to any cache actions. The messages can still be coming in, you
just need to hold them off for a minute while you change the count and let it proceed,
then everything should be okay.
>>: [Indiscernible]?
>>Todd Mowry: Yes.
>>: [Indiscernible]? Do you think the situation might change when you implant it in real
hardware, adding bits to the cache might be a problem in hardware because size increases
and latency increases and it might look like a good idea in software simulation where it's
cheap, but in hardware it might not be cheap?
>>Todd Mowry: I don't believe -- let's see. I don't believe the absolute performance of
our simulator. Actually I wrote a lot of the initial part of the simulator and we tried to
design everything so that whenever we were cutting corners on something that we didn't
think it would change any of the relative performance gains of anything. That's why I
always present normalized execution time. Because I think the ballpark speedup
improvement, I do believe that, plus or minus 5%, whatever.
I wouldn't expect it to suddenly be the case that this was slowing down a lot rather than
speeding up a lot. In terms of going from the bar where we slow down a little bit in our
initial -- that's not the one I want. It's back here. So, like, in terms of how much is this
slowdown, I don't know exactly. It could very well be worse than that because we didn't
model the extra latency to the primary cache because we had more bits in it or something
like that. That's something -- there were things that we didn't model.
Great, thanks.
Download