>>Todd Mowry: I'm going to be talking about this... interesting and timely, which is how we manage to take...

>>Todd Mowry: I'm going to be talking about this thing that I thought would be very interesting and timely, which is how we manage to take a large piece of legacy software and make it run twice as fast on a quad core. Before I get into that, I thought I would just provide a little context for where this project fits in with the other research I've done. The initial projects I did in my group were focused on performance things, in particular how do we extract thread level parallelism. In fact, that's what this talk is about today. The other thing we worked on was how do we make better use of caches. So I've given talks about -- having worked on both of these things generically, we then focus specifically on how to do it for databases. How do we redesign databases to make better use of caches. That's something I've given talks about here at Microsoft in the past. So Shimin Chen was one of my students and he was a Microsoft fellow and Microsoft intern and all that stuff. Today I'm actually working on two projects. So the first is we call log-based architectures, where the goal is to detect and fix software bugs on production systems without slowing down the application. So I gave a talk on this project the last time I was up here, about a year and a half ago, and I'm hoping to come back sometime in the next couple months and give you an update on some more results we have on this project. Finally I'm working on this sort of far-out science-fiction-sounding project that I'll be happy to talk to you about after the talk if you're curious where we want to physically render moving objects and build a Holodeck. But what I'm talking about today is the sort of capstone of a project we did at CMU that we called Stampede. This was really Chris Colohan's Ph.D. thesis work. This is the first time I've given a talk that's not on something we're actively working on at this moment. This is something we completed a couple years ago. The reason I'm talking about it now is I happen to never have talked about it here at Microsoft and I think it's also very relevant today with multicore and people trying to parallelize software. Also it's very timely because people at hardware companies are making decisions right now about including the hardware support that I'll be talking about. So if people from, say, Microsoft thought that support was interesting, that might affect the world. So the starting point for this is multicore. So today we have quad-cores and the projection is that we'll have more and more cores over time. So we can't crank up clock rates any more, but we can stamp out more and more processors on a chip. So clearly if you're interested in increasing throughput, there's more core to run things on, and it's good for that. But what about simply making things run faster? One view of the world is don't worry, everything will be fine with multicore because from now on everyone will write parallel software and it will all speed up perfectly. So that's an optimistic view. It would be great if that was true, but I've spent 14 plus years teaching people how to write parallel software. The empirical data suggest it's difficult to do that, to make it both correct and fast. So hopefully people will get much better at this, but that's a challenging thing to do. We don't expect that to be fixed overnight. And there's also this issue of legacy software. So what about things that are important that have already been written that aren't parallel or what about pieces of things that maybe they have been parallelized but there are important parts that haven't been parallelized yet? So that's actually what I'll discuss today. Another sort of rosy view of the world is don't worry, multicore is fine because we'll have these fancy compilers that will automatically parallelize everything so it will all speed up perfectly. So I've actually worked on projects before where we were trying to do this, and that was the original motivation of the thing I'm talking about today. It's very difficult for a compilers to do that. Other than for sort of scientific programs, where the dependences are very straightforward to think about, this is very difficult, in particular ambiguous data dependences are a major stumbling block to try to do this type of thing. So this motivated us -- question? >>: I often hear don't worry, these chips are used for web servers, and everything is inherently divided into so many chunks that we can keep these cores busy or they're running so many processes we're going to keep these cores busy. >>Todd Mowry: Fortunately, there are enough of those things around already that there will be some immediate value to having these processors. But I think over time it will be nice to have more and more things taking advantage of them and not the embarrassingly parallel work loads. Fortunately, there are embarrassingly parallel work loads, so that gives us something that's ready to go on all these processors. So our motivation was what do you do about those other things, what about the things you care about that aren't embarrassingly parallel, those things do exist. So the idea behind our Stampede project at CMU was we wanted to have a new -through a combination of hardware and compiler support, we wanted to create this capability us could run things in parallel without knowing for sure whether that was safe. Several other groups, like Illinois, Stanford, Wisconsin, other places, have looked at similar ideas. And the idea is you create the threads, you hope that there aren't dependences, but the hardware will monitor whether there really are, and if there are, things will roll back. It's like a transaction. People have been talking a lot about transactional memory. But the difference is with transactions you already have concurrent threads and you're just trying to get them not to trample each other. Here we start off with serial execution and we know the order in which everything should occur so if there's any conflict between two things, you know which one is supposed to win. It's the one that is supposed to have executed earlier. But there's actually a lot of similarity in terms of the hardware to make the stuff work. So our initial motivation was the previous slide, which is we're going to automatically parallelize everything in the world and it will all speed up nicely. We did use this support to parallelize the spec energy benchmarks and found they did run about 20, 35% faster, which is actually respectable if you compare that with typical architectural optimizations. Usually an architectural idea is successful if you give 10% improvement. Compared to that it's not too bad, but it's not exactly the linear speedup you were hoping for. What I'm talking about today is something different, which is rather than this fully automatic approach, we're looking at, first of all, a much larger and more interesting piece of software. We're looking at BerkeleyDB, which is a database, part of a database, and rather than doing something fully automatic, we're doing something semi-automatic. So the programmer actually is involved. There's this iterative process I'll talk about where the system tries something, it doesn't work very well, gives feedback to the programmer, they make a decision to tweak something small, and they go back and try again. And I think that this approach is actually extremely promising and you'll see that it actually worked quite well in this case. So this is what I want people to hear about, this sort of semi-automatic approach. So the application that we're looking at, it is BerkeleyDB, so it's a database engine. Now databases are already run in parallel. What we're trying to do here is run them in parallel in a new way, which is we want to take individual transactions and break them into parallel, split them into parallel chunks. The way that you typically get parallelism, you get parallelism across separate transactions. We're still doing, that but we wanted even more parallelism than that, so we're breaking them up. The reason why I think this is interesting beyond databases is really you can just think of this as an example of a big, complicated piece of software, where the thing that we're trying to parallelize was not written with parallelism in mind. In fact, it was written under the explicit assumption that you would never try to run pieces of it in parallel, within a given transaction. It is not safe to do that at all. So if you tried to actually sit down and rewrite the code so that you could exploit intratransaction parallelism, this would take a long time, it would be painful. You would have to understand a large amount of code and it would probably take years of effort I think to do that. I think you'd actually really have to start over almost from scratch with a lot of fundamental structures. So but we're not going to do that. All right. So switching into -- so now we're going to dive into BerkeleyDB and talk a little bit about transactions in particular. So here is my student Chris who did the work. I guess this is his idea of what database users look like. We have our cavemen here who are writing their sequel code and sending it into this database server. So here come their transactions. And then what happens is we've got this, you know, sequel server, big DBMS here, data stored. How do we get improved performance? Today what happens is separate transactions get to run in parallel on different cores, and we get throughput from that, and that's good for performance. Now, why do we want to make transactions run faster? Well, first of all, people do care about latency. If it too long to run my transaction, I get annoyed. Also, if one of the transactions is grabbing a lock, that prevents other transactions that need the same lock from executing, so that actually does hurt throughput. So what can we do about transaction latency? If you look at TPCC, so this is a benchmark for transaction processing, the most -- the transaction within TPCC that takes up the most time is something called New Order. So you're placing a New Order. Here is a pseudocode version of that transaction. You're basically, you have a basket of items and you're going to update the stock information for all those items. So here is one query within it. We're going to do this select. One thing that you can do that databases do today is try to extract parallelism within a particular query. I can take this select and break it across multiple processors. This idea works really well for decision support. So if I'm launching queries that are going to take a long time, then I can get some really nice performance out of this. But unfortunately for transaction processing, these queries tend to be very short-lived. They don't run for they long so I don't get a whole lot of performance this way. That's not so great. What I prefer to do is actually get parallelism across separate queries or within one transaction. So, for example, I could take different iterations of this loop and try to run those iterations in parallel. So that sounds good. But this doesn't work. This would break BerkeleyDB and probably a lot of other systems because they're not written with this in mind. None of the internal structures like locks and latches and all those things are set up to work this way. But as you'll see in this talk, the safety net of thread level speculation makes it much easier to fix this problem. So let me talk just briefly about thread level speculation and how that works. So if you imagine some original thread, so this thing is executing in time, and when we try to run it in parallel, what we're really doing is taking something that was supposed to occur later and we're running it concurrently with this logically early chunk of work and we're dereferencing pointers. We want to label all of these with some logical timestamp and the system does that automatically, so we realize that this one is supposed to occur before this one. But sometimes bad things happen. So, for example, we do a load from a particular address, and we loaded it before it was stored over here. So way saw the wrong value and the same thing happened here. So this is bad because it's not possible, so what happens is the hardware detects this, it notices that this bad thing happened, it squashes the thread and restarts it. The second time through good things happen. So now the data dependences get forwarded correctly and it works. What we see is we use these logical timestamps called epochs to number the different threads and the hardware is going to detect problems and restart things. This is a little bit like recovering from a transaction, like I mentioned before. Yes? >>: Is your experience that most of them do not have data dependences between iterations? >>Todd Mowry: That's a really good question. And the next couple slides are going to dive into that, 'cause that actually -- the answer to that I think is very interesting, so... so hold that thought for one second. The idea with the mechanism is hopefully in the worst case we run about as fast a sequential execution and actually we run a little bit slower that because there's some overheads. In the best case, if there's no dependences, we get parallelism and speedup. How often do these dependences exist? Now, this is going to start addressing Ed's question here. On the motivation for this hardware support was actually spec type of program. So little things where we have relatively small threads from loops, and in those cases dependences tend to be fairly infrequent and the threads tend to be relatively small. So if it's painful to recover from speculation failing somewhere, it's not that big of a deal because most of the time there aren't any problems. But when we look at databases, when we look at transactions, parallelizing them, the threads are much larger. These are maybe hundreds or maybe a thousand or so instructions, and these are tens and maybes hundreds of thousands of instructions here in these things. And there are many dependences. There are typically dozens of real dynamic dependences between these threads, so we see in contrast to the easier case, we have lots of dependences. So we can't -- it's not okay if we only win when there are no dependences. We'll never win. They're always there. So we have to find a way to tolerate the dependences. So I'll talk about that. So we actually learn something new about how to design the system from doing this. We also have to have much more buffering to store these larger threads. And in the second part of the talk, I'll discuss how we address those things. We also have concurrent transactions that we're speculatively parallelizing, so that has to work and there's a nice way to make that work, too. Okay, so there are a lot of dependences. But there's some good news here, which is the hardware is already tracking dependences. It has to do that to realize when it needs to restart things. And the information that it's capturing happens to be exactly what you'd like to know if you were going to fix a problem. So it knows, oh, that didn't work because of this particular load and store pair. So the idea is when you see a problem, you can actually give feedback to the user and say, that's the immediate problem right there. This speculation failed because of this. And if you can fix that, then you'll get to see the next one, and then the next one and the next one. So if you choose to, and this is what we did in this study, you can incrementally fix dependence by dependence, and you don't have to make them all go away, it's each time you do it, you might see slightly better and slightly better performance. So that's the idea. Okay. Now, there's one non-intuitive thing that can happen, which is sometimes eliminating a dependence can make things worse. So in this case we have on the left this original case where there's a dependence between P and Q, and P triggers a problem first. So if we eliminate the problem with P, and now Q triggers the restart, we actually restart later, it takes us longer to realize there's a problem, and we actually have worse performance. So that's annoying. So one of the things that we did is we fixed this. And really the fundamental problem is having to get an all or nothing -- having all or nothing as the mode of execution isn't so nice. So to fix that one of the new things that we modified in the system is we essentially added an extremely super lightweight checkpointing mechanism, where we can keep these different checkpoints such that if something goes wrong, we only have to back up to the most recent checkpoint. That was really important for making any of this work. I'll show you numbers on that later. So now we have a more robust system design that will work much better because of this. Okay, so, and this was a critical to getting the relative speedup from having this was really important. With this support, we could get speedups of one point nine to two point nine. >>: So do you synchronize the frequent checkpoints across threads? You could have a case where the thread on the left runs behind and ends up conflicting with something previous? >>Todd Mowry: Multiple of these things? >>: Yes. Can you guarantee that doesn't happen? >>Todd Mowry: I do have some backup slides on that. But basically we do have this issue where you have a whole chain of things you have to cancel. I don't know if that's what you're asking about. We do have a way to -- we can selectively rollback to the most recent checkpoint in each of the threads if we have a chain of things that we have to cancel. >>: Rollback more than one checkpoint and one thread? >>Todd Mowry: So what we do is we're lazy about rolling back threads. So we don't roll them back aggressively, so we only rollback once, if at all. We don't rollback multiple times, because you could imagine discovering more and more dependences and eventually rolling the whole thing back. So we didn't do it aggressively, we did it lazily. It's only when we would commit a thread that -- we only triggered it at commit time, so... this picture looks a little funny. This isn't actually quite accurate. It doesn't happen at the moment this occurs, it actually happens a little later. >>: I assumed that. Could I have a violation that would roll me back three of the checkpoints on the right-hand side? >>Todd Mowry: Oh, I see. Yes. >>: Yes, okay. So you have to rollback to any number of them? >>Todd Mowry: Yes. >>: Okay, great. >>Todd Mowry: We recognize -- so in this picture, what it's trying to show is that you actually -- it's hard to see, but we actually rolled back to this checkpoint. This load was above this checkpoint. It's right here, and here is the checkpoint. We skipped this one, went back to this one. We know which checkpoint is the most recent one ahead of you. >>: Okay. >>Todd Mowry: Yes. >>: [Indiscernible]? >>Todd Mowry: Yes. It's a trivial mechanism for checkpointing that works because we're already supporting thread level speculation. We bump a counter somewhere. That's all we do. Okay, so to make all of this work, there are different layers here. There are transactions running on top of a DBMS running on top of a system. There's an operating system in here, too, of course. It's an operating systems company. I forgot to mention of operating system. I'll be shown to the door now. In particular what we're runnings is TPCC on top of BerkeleyDB because we have the code on top of assimilated machine. The hardware support didn't exist. This involves different people changing what they're doing a little bit. So the transaction programmer, we don't want them to change anything that they're doing. If they choose to, they can explicitly indicate where they want to break their transactions into different threads. In our experiments we simply took every loop in a transaction and ran loop iterations in parallel, and they didn't need to do anything to do that. So there are a lot of people who do this, so we don't want them to -- we don't want to put any of the burden on them. There's a lot more interesting stuff that happens for database developers. So they do have to change things. That's actually most of what the talk is about, is how we fix that. Now, that was the part that took a month. So there's more work to do here than here, but this still isn't a lot of work. And then there is some work for hardware people to do, which is to put this into the hardware. That's been sort of very well-studied. Right now it's just a value proposition argument that people are actively debating right now. >>: Would you say that TLS hardware research is mostly done or are there new capabilities you think might pop out? It's hard to predict the future. But you've worked in the area a long time. >>Todd Mowry: I don't think anyone's really actively pushing on it right now. I think it's mostly cooled down at the moment, but it could become interesting again. Yeah, so... For the moment, it's paused I think. >>: But in terms of -- do you see significant unsolved problems remaining from the TLS implementation side? >>Todd Mowry: No. >>: Or it's basically most of the problems are solved, if you can find the right sort of stuff in the higher layers, it will take off? >>Todd Mowry: The reason why it's not in today's hardware is not because designers don't think that there's some major problem that would prevent them from implementing it. It's just because they haven't been convinced yet that people want it. But they know how to design it. I've seen many different designs that all seem reasonable. So people aren't worried about whether it will work, just whether they should bother to do it. I think once it really exists in hardware, it will probably make the research interesting again potentially because there will be much more interesting software built around it and then that might make life interesting again. Okay, so if you're, say, looking at the world from a database perspective, the new and interesting thing is we have a way to extract parallelism in a transaction without changing the transactions, with small changes to the database management system. You don't have to change anything about locking. We didn't introduce any bugs. We got nice performance. We made transactions run twice as fast on four cores. Okay, so here is an outline of the rest of the talk. So there are really two parts to this. The first part I'm going to talk about how the database changed, how the database layers changed to take advantage of thread level speculation, then I'll briefly talk a bit about the hardware support. So these are the new things we learned by doing this as opposed to spec benchmarks. But the first part is the transaction programmer. So we have to divide the transactions into separate threads. So looking at New Order again, it has this loop, and if you look at this loop, it takes up -- so New Order is the most important transaction. This loop takes up about 80% of the time in the transaction. So this is the hotspot. And if we can make this run in parallel, that's a good thing to do. And, in fact, if you focus on the code, it looks like this is almost embarrassingly parallel well, almost, not quite. So it's possible that different iterations will depend on each other if the same item appears multiple times as separate entries in the list. That's possible. For the benchmark it turns out that the definition, given the way we configured it, is this will happen once every hundred thousand times. So this looks like a great opportunity for thread level speculation because it will only fail once every hundred thousand times. At this level it looks very good. So all that you need to do is decide that you want to run this parallel -- this loop in parallel, and the programmer can either explicitly annotate that or not, or you could just have the system decide to do that. So there's not much work for the transaction programmer to do, and that's good. The next part is what about the database developers? How do we remove problems from the database management system? So what we'll see is, in fact, just simply doing this didn't work very well at all. There were a lot of problems, and they didn't occur at the sequel level, they occurred inside of the database itself. So I had this picture before where I showed that what happens is we try to run things in parallel, and they don't run very well in parallel because there really are a lot of dependences. What we're going to do is use the hardware support allows us, gives us profiling feedback where we can tell exactly what the pieces of our code were that caused this most critical dependence, then we can do hopefully a small simple thing to fix it, we get a little bit better performance, a little bit better, so on. So that's the idea. Yes? >>: At the beginning of the talk, you mentioned transactional memory, you still have a version of your code that is in a way explicitly parallelized by the programmers. Nevertheless, at the end of the day it seems you are going to make modifications to existing applications. It would seem you would take this database and try to parallelize it [indiscernible]. >>Todd Mowry: I think the amount of effort to take this -- okay, I think there's a very big difference in terms of the amount of effort involved because transactional memory, the starting point is that you already have a concurrent program, and that it's mostly -- it's correct, and you're just trying to take a critical section and get better performance out of it. That's an important thing to do. But you're starting with something that's already concurrent. In this case we're starting with something that's sequential. We never get to the point where we really have a properly threaded program. Instead, we're just fixing -- we're relaxing some aspect of the program and making it a little more parallel, but we're not bothering to fix all the other things in the thread. So I didn't do the actual study of trying to rewrite BerkeleyDB where we would rewrite the threads. But, as I said before, I think that involves almost starting from scratch in terms of the writing the code. It's a really fundamental assumption throughout the software that there are never multiple threads within one transaction. I think that would be measured in years of effort, not a month of effort, to do that. So transactional memory is a good thing, but it's for a different purpose than this I think. So what I'm going to do now is one concrete example of something -- of a case where we found a problem and fixed it, just to illustrate the types of things we found here. So one important part of the database is the buffer pool manager. So this is a little bit like the virtual memory system of the database. When we access data, so the buffer pool is the section of physical memory. If you access something that's on disc, you say, get_page, in BerkeleyDB, it grabs something off the disc, puts it in the memory, and then to make sure that it doesn't get displaced until you're finished, it bumps a reference counter so we realize somebody is actively using this. Then you get to access it. When you're finished, to release it, you say put_page, now that page can be reclaimed by the buffer pool manager if you need more physical memory. Okay. So what happens when we try to run the code in parallel and looking at these two primitives? So let's say that two different -- so we were originally going to run this thread, then this thread is supposed to run after it in the original program. We're optimistically running them concurrently. They both want to access page five. So they do a get_page, they do some work, then they do a put_page. It turns out that speculation will always fail here. And the reason is because of this reference counter. So when we decrement the reference count here, that's going to cause the reference count value that we read here to look like it was wrong because it just changed and we read an older value. It will squash speculation and this will never succeed. If you stop and think about it for a minute, that's a little -- although that is strictly what the sequential execution would have done, we don't really need it to behave this way. We want buffer pool management to work, but we don't care about this artifact here. So I don't really care -- in particular, I don't care which of these threads grabs the page first. I just want to make sure that they have it pinned in memory the right way when they access it. So what we can do in this case to fix this is we can temporarily turn off the speculation tracking mechanism. We can say, we're about to do something and we don't want you to treat this as something that's a problem, so we're going to go ahead and change the reference count. In order for this to be okay, we have to realize that it's possible that this thread may get squashed for another reason, and if that happens, we have to have a way to undo the side effect we just did. So fortunately that's easy to do in this case. To undo a get_page, all we have to do is decrement the counter and everything will be fine. So what we do is we escape speculation. We do the operation. We store an undo function. We turn speculation back on again. Yes? >>: [Indiscernible]? >>Todd Mowry: I meant [indiscernible]. So to show you in the actual code, here's what we did. We implemented this again by hand. This isn't automatic. Chris decided that this was the thing to do. So we put a wrapper function around get_page. In the middle of it it calls get_page with this ID that we pass in. In addition, as I mentioned, we turn off speculation. We have an instruction to our hardware that says turn off speculation, don't track it, then we turn it back on here. We also do some extra checking of the arguments to make sure they're okay because when you're running speculatively, you may get a bad ID and we want to make sure we don't do anything bad inside this routine, so we do some extra checking here. Also we don't know how to do multiple get_pages concurrently, so we add a mutex to make sure there's only one active call to get_page at a time. Then finally we say that the way that you undo this is you need to call put_page in order to undo get_page. And this works because the effects here are isolated. It's not going to cause a cascading chain of aborts if we do this. We have a way to undo all of our side effects. This could be used for other things like memory cursor management or malloc, for example. Yes? >>: Do you run into problems if the operating system deschedules one of the threads in this speculative team? >>Todd Mowry: Yes. So we don't want -- I mean, well, the problem is that nothing crashes and burns. What happens is speculation will fail. That's actually really the only thing that goes wrong, I think. I mean, we wanted things to be getting scheduled if you want it to, would. In general, if you have a parallel application, you swap out one of the threads, unpleasant things can happen if you're holding a lock. >>: You'll still be able to detect conflicts? >>Todd Mowry: Yes. Well, okay, what happens is we have a limited ability to detect conflicts. If things ever get knocked out of the cache, we say there's conflicts. Our default is we squash everything. So we can always fall back to that. That's the reason why it still works if the operating system reclaims something. It's not fast anymore, but it's functional. We know how to fix the problem with get_page, which we just described. Now what about put_page? So here we are decrementing the reference count. Can we play the same trick here? We can't 'cause get_page is not a way -- is not a proper way to undo -you can't undo put_page with get_page. It may end up going somewhere else and this could cause very different behavior. So instead what we do is we just postpone doing the put_page until the thread receives this token that says you are no longer speculative because everyone ahead of you has already committed, and it's definitely safe to do that. The downside of that is we're actually increasing the lifetime in which this is pinned in memory. That's a tradeoff that seemed okay. We're increasing it by a couple thousands instructions and memory is large, so we're not too worried about that in this case. If we do these two things, we can now actually run code with get_pages and put_pages in parallel, and that won't cause speculation to fail. So this is one example of the type of thing that we were doing. Okay, so in general delaying something until the end of the epoch when you're not speculative is another trick. So Chris went through this, in the title of talks as I said, he went through a month's worth of effort finding the next thing and fixing it. The other thing I forgot to mention is that Chris had never seen BerkeleyDB before he started doing this, and he doesn't know much about databases. He's a systems guy, not a database guy. He never took a database course or anything like that. So to him this was just a big giant piece of software that he didn't understand. But of all the fixes that he made, these basically fell into three buckets. We saw the two things we saw, escaping speculation, delaying an operation, and then the other bucket was there were several cases where there was a central structure, and the solution was simply to distribute -- have a parallel version of it across different cores, for things like, say, gathering statistics you could gather local statistics and then just add them together later rather than having one counter. That's one not simple example here. These are the things that he did. Okay, so now I'm going to show you some performance results here. So we simulated a quad-core processor. It was an out-of-order super scaler. Here is the memory hierarchy we were modelling. We were running TPCC in core. We weren't worried about disc accesses. Yeah, we're also measuring the latency of a transaction, not its throughput. Okay, so here are results. So first I'm showing you the original unmodified version of the application running on one core on a quad-core. So that means three of the cores are doing nothing because it's a sequential piece of software, at least one transaction is. So the way this is color coded is that black means that there's a CPU that's doing nothing. So three of the processors, this is renormalized to excuse time, are sitting idle. The one processor that's busy, the blue part is the time when it's doing useful work and the green part is the time when it's stalled for cache misses. Okay, so now we've got this wonderful thread level speculation mechanism that we spent years developing and we turned it on and parallelized this embarrassingly parallel sequential loop and we got a slowdown. So things were worse than they were originally by a little bit. By about 8 or so percent. This is without any of the optimizations that I just mentioned. The reason is that the speculation is always failing. So the red part is time when you've done work that you have to throw away because it wasn't useful. It's a little bit worse in part because the cache this time actually went up a little bit because previously all of the data was on one core's cache and now we've spread it across all the cores, so now our cache misses are more expensive. We lose a bit here. So then Chris started going through this process of iteratively fixing things. So first he fixed the problem with the latches, and things actually got even a little bit worse. And he fixed the problem with locks, and we're still not speeding up. Then he fixed the problem with malloc and things still are not good. Then he fixed the problem with the buffer pool, suddenly we saw a nice jump in performance. More than a quarter of the latency disappeared. You might wonder, why did he bother to do the first three things, and jump to the fourth step? The answer is you had to do all of them. They were four problems lurking underneath each other. When you eliminated all of them, you saw this improvement. Then he kept going, did something with the cursors, fixed some false sharing, fixed logging. We eventually got down to almost a nearly twofold speedup here. Yes? >>: Are all of the techniques on that? >>Todd Mowry: They keep adding on. >>: I assumed that. But are all of them necessary to get close to that final performance? You said they were all lurking. Are every single one of them important? >>Todd Mowry: Yes. So the way we get the feedback is that the hardware always tells us the most critical one. So if we didn't fix that one, it wouldn't matter if we fixed anything else underneath. It might matter if we eventually fixed the critical one, but until you fix the critical one, you won't see any improvement. >>: I understand. But if you take the end point that has everything. >>Todd Mowry: These last two didn't improve upon this one, if that's what you're asking. We didn't really need to go all the way out to this point. Like once we got to that point, it kind of levelled off. >>: [Indiscernible]? >>Todd Mowry: Uh-huh. >>: Could you, for example [indiscernible]? >>Todd Mowry: No. I mean, you wouldn't do any better than that. I mean, if you wanted to stop here, fixing these things without fixing that would not improve anything beyond this. They're sort of strictly ordered in terms of being the more critical dependences. If you don't fix them, you don't get to move to improve the performance. Yes? >>: [Indiscernible]? >>Todd Mowry: Yes. >>: So I have a comment about this. >>Todd Mowry: Right. Unfair is fine. I may give you an unfair answer [laughter]. >>: [Indiscernible]? >>Todd Mowry: Yeah, so the one big caveat I think for interpreting this work is there actually are legitimate arguments you can make about why you may not like this feature all the time, certainly in a real database system. If you didn't have enough transactions to keep the entire system busy, there's certainly no reason to turn this on. You would only want to turn this on if you actually had idle resources and you're willing to pay the power cost of turning them on. So but, yeah, you have to keep that in mind. You don't want to make non-speculative work slower by doing this or it will hurt throughput and potentially even latency. But the other way I like to look at it is, pretend this isn't a database system. Just take a transaction and say, well, ignore what it's doing, this is a pretty complicated piece of software, and we managed to make it run a lot faster. >>: How heavily loaded are typical systems that run this application? >>Todd Mowry: It's very bursty. That's my impression. Sometimes they are saturated, but most of the time it's not saturated. It's very bursty. So you want to monitor the bursts and not turn this off during a burst. It's easy to turn this off. You can turn this off in one instruction, so... >>: How much does your programmer need to worry, A, about the underlying microarchitecture versus just the code? For example, false sharing is an example, hardware details, then, B, about the specifics of the speculative threading protocol? >>Todd Mowry: Okay, false sharing was really the only one about the memory system in this case. Other than that, there weren't any other like system parameters that you would need to know other than like your ability to buffer threads, you would want to realize that you shouldn't make threads that are way larger than your system's ability to buffer things. But we never actually thought about it. We just tried to design something. I mean, we just used the -- well, I'll talk later about the hardware support for this, why we can buffer enough. But, yeah, when Chris did this, he wasn't thinking at all about any system parameters. He just realized that, oh, that's not parallel code. I could make that parallel code by doing this little thing. And the key thing that I probably didn't emphasize is the reason why this works so well and you can do it so quickly is you don't actually have to go off and understand all of the code when you make one of these changes, you can just zoom in on this specific thing you're working on, you and you can still have lots of other problems going on. In fact, there are. Even as he's doing this, there's dozens of dependences sitting underneath this. We never really get full parallelism. We get partial parallelism. Things are somewhat overlapped. That's why we get a speedup of two as opposed to four-onfour cores. May be possible to push it further, but we didn't work on that. Okay, so at this point we actually can't improve this much more because the only part of execution time that gets better with better speculation is this violation time, and that's already fairly small. The reason why we aren't running faster is the number of iterations in this loop isn't very large. So we just don't have enough parallel work to keep all the cores busy in many cases. That's really the bottleneck. We tried an artificial experiment where we increased the number of iterations artificially and we got much better speedup. But that's what you would expect. Again, the important part is it took Chris 30 days of effort and he changed less than 1200 lines of code out of 200,000 and he didn't understand the application and didn't really know -- I don't want to make Chris sound like he's not a bright guy, he's a very bright guy, but you don't need to be an expert on this to get this kind of improvement. So we looked at the other -- New Order is one of the five transactions in TPCC. We did the same process to the other four. Two of them sped up very nicely, even more than New Order. So delivery is running -- we've eliminated two-thirds of the execution time here and stock level is running more than twice as fast. The other two cases did not improve. And in these cases, there weren't loops. There were a lot of dependences. We couldn't find any obvious way to break them into smaller chunks. So this doesn't always work. But it worked well in these three cases. Okay, so for a database type audience, the conclusions from this are that we found this new form of parallelism. We can now break up transactions in different chunks. We did have to change the database management system, but it didn't take very much effort to do that. We think this could be applied to more than transactions, not only within databases, but other types of software. Okay, so that's the main part of the talk. I've got another little part. Yes? >>: Did you look at energy efficiency? >>Todd Mowry: No. Okay. So, I mean, you just have to decide what your preferences were. If you are incredibly stingy about energy, then don't speculate. But relative to your preference for latency, improving performance. There's a tradeoff there and you always have to decide how much you care about one thing versus the other. Without this, it's not obvious to me at least how you would make transactions run faster 'cause they're not concurrent, and clock rates aren't getting any faster. Yes? >>: [Indiscernible] when you turn speculation off, how much extra do you pay for this, if you have a peak, for example, you don't want to have speculation running? >>Todd Mowry: Let's see. So, well, you can figure that out actually by just looking at the height of the blue bar. The software overhead shows up in this blue section. It's actually increasing a little bit. But it's only a couple percent. So you're not paying very much. So, yeah, that's the basic answer. Yes? >>: [Indiscernible] single client? >>Todd Mowry: Yes. >>: Did Chris look as you increase the number of clients [indiscernible]? >>Todd Mowry: We didn't look at that. That would be interesting to look at, about you we didn't do that, so... Yes? >>: [Indiscernible]? >>Todd Mowry: That's a good -- we did not do that, no. I don't -- yeah. I don't know what the answer would be. I believe that the things that we did here -- the nice thing, they're all generally useful, I think. I didn't make this clear. But what Chris did is he started with New Order and he made these improvements, and then he simply ran these cases without changing the database further, database management system. So that the fixes for New Order are also very useful for these two cases. There weren't any other bottlenecks he had to fix. TPCH, I don't know how it would stress the database management system. There's a decent chance that the improvements here would be useful for it. I mean, they have much longer running queries. In a sense, part of the reason why we didn't do it because the intraquery parallelism worked so well generally we didn't think it was as interesting, but there's no reason not to try it. That would be another experiment to do. Okay. So the second part of the talk, I'll just go through this quickly, but this is about -what we learned about changing the underlying design of the system to make this work. I know that the majority of you aren't architects, but you might find this a little bit interesting. So, okay, we have to buffer large speculative threads, and we have to deal with subthreads. Those are the main two things. When we were parallelizing [indiscernible], our threads were tens to hundreds or thousands of instructions long. We were just parallelizing loops in C code, individual iterations typically aren't that big. But here our thread sizes, even though we're parallelizing a loop, it's a loop in sequel. Beneath that is a database management system engine that's invoking lots and lots of other code. That turns into tens of thousands of instructions. Delivery was almost half a million instructions. This is the number of dynamic data dependences on average between those threads. So these are 75 or tens of dependences. We start off with 75, and he fixed about a dozen of them. So there are still about 60 or so dependences still there that he didn't fix, but they just happened to be less critical and we could get some performance improvement. So one of the challenges is we have to buffer larger threads. Well, so, there's related work here. The big difference is in our original design, we buffered everything in the first level caches and we now realize we need to buffer it deeper in the memory hierarchy. This involved adding another bit to our first level cache to keep track of some extra state and adding a few bit to the tags of our second level caches. Also, when there are collisions, an interesting thing that happens is when you're physically sharing the cache, different threads are writing to different copies of things, you end up with different versions of something. So what we did is we would actually allow replication with associative cache. So that was another change. We added a small victim cache in case we had some pathologically bad cache mapping. That worked well. We also had to support subthreads. So the only trick for supporting subthreads is we already have, external to the cache, these epoch numbers. These are logical timestamps. We sit on the side because there are only a handful of them at a time. >>: [Indiscernible]? >>Todd Mowry: Yes. This is how we implemented incremental checkpointing. All we had to do was add support for some more virtual timestamps. The trick is you just bump the timestamp number, that's all you have to do. That takes no time basically. Just adds a few more bits to the tags. Another question that we thought might be -- something we thought might be an interesting research question was when do we create -- where do we take the checkpoints? So our intuition was the optimal answer would be that you would take -let's say that this is what our code looked like and we're going to have a dependence violation here, the optimal thing would be to take a checkpoint just before a very risky load, one that's likely to fail. So we thought maybe we need to build some predictor to predict which are the risky loads and take the checkpoint just before them. We were excited to write a paper on this brilliant idea. But then we discovered just actually doing something very simpleminded works quite well. If you just take a checkpoint every five thousand instructions, it works really well, just as well as this. Okay, so here are some results here. So going back to -- oh, these are all the different things, plus this artificial version of New Order where we increased the number of iterations by a factor of 10. Now, this is not the number I showed you before. This is what the world would look like if we did not have our checkpointing support. Remember, New Order was running twice as fast before. Now it's running faster, but not nearly as fast. Delivery was running three times faster, and now it's only running a little bit faster. Sorry, this delivery outer was running three times faster out here. Two different versions of this. So subthreads are important. If we had no dependence violations, if this was perfect, we would only get the bars on the right. So the checkpointing allowed us to cover much of the distance between the baseline case and the perfect case. So there's not a huge amount of room for further improvement. Okay, so that's important. And subthreads didn't actually affect the cache performance, not that we would expect it to affect it very much. >>: Let me ask you about that. If you have a violation which ends up rolling back a speculative thread to a large number of what you call subthreads or checkpoints, you would expect to see a large performance gap between that and no dependence violations. So what you're saying here is that ->>Todd Mowry: Yeah, that shows up as this -- the red part in the middle bar is that time. >>: Yeah. So what that is saying, with this technique, almost all of the time you rollback a small number of subthreads. >>Todd Mowry: Yes. >>: So this idea we discussed today of doing selective replay and only playing the stuff that hasn't changed in terms of data wouldn't be all that useful because there's not very much more to get? >>Todd Mowry: Yes, that's right. Yeah, the difference between the height of the red bars is the amount we've got -- actually, in some cases it didn't help very much. Well, these cases are never very good on the right. But in these cases, yeah, there's not much to get. In these cases we've gotten much of what there is to get. >>: [Indiscernible]? >>Todd Mowry: Gee, I have to remember. I have to look it up. Something like four or eight. Just multiply that number by two and that's the number of bits you have in the cache tag. That's the cost of it. >>: [Indiscernible]? >>Todd Mowry: No. We -- oh. Let's see. Let's see. How did we do that? I forget the answer to that. I'll have to look that up after the talk. I know we did that properly. >>: [Indiscernible]? >>Todd Mowry: Yeah. I don't remember the answer to that. I remember discussing that, and I forget what the answer is. I'll look it up at the end of the talk. But I don't remember off the top of my head how we did that. I remember how we did it generally, but I know we didn't do it that way here. I said we have a victim cache. The question here is, how large would the victim cache need to be with different amounts of associate activity to capture everything that gets evicted? This is a one megabyte L2 cache. You would need dozens of entries here. If it's eight or 16, then you don't need very many entries. So it can be relatively small and work. This graph just shows the size of the subthreads with a fixed size, starting from 250 dynamic instructions going out to 25,000. The sweet spot happens to be at around 5,000, for some reason. So if you make this too small, the tradeoff is you end up using up all of your checkpoints early on in the thread and then you run for a long time and you can't checkpoint anymore. If you make them too large, you have to go back further to get to a checkpoint. So given just the particular sizes of our threads, where the dependences were occurring, this is where it was happening, you could easily imagine changing this dynamically. This is something in the hardware that it decides how often to do this. So this could be dynamically adjusted. >>: Were all of your -- this is a case where reasoning about the microarchitecture, the size of the L2, would be something that the programmer really had to do if you weren't going to do some sort of automatic adaptation of the subthread size? In other words, turning that around, if I supported a certain subthread size in hardware, the programmer would need to reason about that in terms of choosing the ->>Todd Mowry: We didn't bother to do this, but if the hardware just tracked maybe after the first iteration the dynamic size of each thread, and just divided that evenly between the number of checkpoints that you have, I think that would work well, probably pretty well. And it turned out that with 5,000 instructions per checkpoint, I think we had maybe eight checkpoints, that was enough instructions that you got 40,000 into the thread, and that was -- you gained enough with that that everything was good. A question? >>: Did you think about recycling checkpoints? You run out of check points, you throw away the first one, stick it on the end? >>Todd Mowry: I'd have to go back and think about that. I think there may have been a reason why that wouldn't have worked, why that would have been painful to implement. But it seems like a good idea. Other than there being some detail in our design why that would be hard to do, it would make a lot of sense to sort of exponentially back off the checkpoints and keep them more recently and fewer of them as you get back in the past. That's probably another good improvement on this. But it was already working well enough that -- oh, yeah, these are a number of subthreads. Here we go, two, four and eight subthreads. So we found it was working well enough that, you know, it seemed like there wasn't that much to game by being fancier than we were being. Okay, so, you know, my last two slides here. So for an architecture audience what we learned are that the lightweight checkpoints or subthreads are important to get speedup in cases where there are a lot of dependences between the threads. The fact that we get this feedback about the critical dependence is important because that allows the programmer to iteratively fix things. Now, the reason why I mention that is it's possible to build TLS and not make this visible to software. This is something that we would want hardware people to make visible to software, because if you don't do that, you're losing a big opportunity to improve your code. It might even argue that even if TLS doesn't functionally work, if you simply profiled this and made that visible to the programmer, that would be very interesting information. It would tell you just dynamically what is going wrong if I tried to run this in parallel. So we found a way to deal with large threads, and this is a pretty simple set of extensions to existing design. So final slide. I think we got a fairly compelling amount of speedup on a large piece of software with a fairly small amount of effort. And basically what we need are subthreads and the speed back that I just mentioned. There's hope for dusty dead codes on multicore without having to invest years of time to go back and rewrite something from scratch. So if you like this idea, you might want to mention it to your friends at Intel. Hint, hint. >>: Don't you have ties into Intel? >>Todd Mowry: I consult for Intel. I'd be happy to introduce you to the right people. >>: [Indiscernible]? >>Todd Mowry: Yeah. Actually we had a paper several years ago showing exactly that. So we can -- we designed our support so that the chip boundaries didn't mean anything. So it just worked on top of coherence, and coherence can be implemented. It will just work on any scale of system. Now, the only thing that limits that being interested is how much speedup you get. There are cases where you can get interesting speedup across multiple chips where it's speculating across the whole thing. >>: Do you have to do anything special with inflight messages when taking your checkpoints? >>Todd Mowry: No. You just need to -- there's a logical timestamp, when a thread is executing, it has a logical timestamp. The trick we're doing is we're bumping its logical timestamp. We gave it a range of timestamps that are contiguous. We say, now you're the first one, second one, next one. All you need to do is to make sure you atomically change that with respect to any cache actions. The messages can still be coming in, you just need to hold them off for a minute while you change the count and let it proceed, then everything should be okay. >>: [Indiscernible]? >>Todd Mowry: Yes. >>: [Indiscernible]? Do you think the situation might change when you implant it in real hardware, adding bits to the cache might be a problem in hardware because size increases and latency increases and it might look like a good idea in software simulation where it's cheap, but in hardware it might not be cheap? >>Todd Mowry: I don't believe -- let's see. I don't believe the absolute performance of our simulator. Actually I wrote a lot of the initial part of the simulator and we tried to design everything so that whenever we were cutting corners on something that we didn't think it would change any of the relative performance gains of anything. That's why I always present normalized execution time. Because I think the ballpark speedup improvement, I do believe that, plus or minus 5%, whatever. I wouldn't expect it to suddenly be the case that this was slowing down a lot rather than speeding up a lot. In terms of going from the bar where we slow down a little bit in our initial -- that's not the one I want. It's back here. So, like, in terms of how much is this slowdown, I don't know exactly. It could very well be worse than that because we didn't model the extra latency to the primary cache because we had more bits in it or something like that. That's something -- there were things that we didn't model. Great, thanks.

>>Todd Mowry: I'm going to be talking about this... interesting and timely, which is how we manage to take...

Related documents

Products

Support

&gt;&gt;Todd Mowry: I'm going to be talking about this... interesting and timely, which is how we manage to take...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>>Todd Mowry: I'm going to be talking about this... interesting and timely, which is how we manage to take...