18799 >>: All right, good morning. It is my pleasure to introduce Chris Rossbach who is here from UT Austin interviewing with us. And he -- I will say this: He's the first candidate we've had interview who has brought his guitar along with him. He has a gig on Friday. So I suppose if somebody really wanted ->> Chris Rossbach: >>: If things go really badly -- Will you play a request? >> Chris Rossbach: [Laughter]. -- I can always play a song. >>: All right. So -- >>: [Indiscernible]. >> Chris Rossbach: So in fact -- [laughter]. >>: So it really is great to have you here, and I'll let him get in to his talk. >> Chris Rossbach: All right. Great. So good morning, thanks for coming, and clearly, I'm going to try to spend some time talking to you about how we can make concurrency more accessible through better abstractions. So the motivation for this talk and in fact for most of my research is an abiding interest in concurrency, and particularly in finding abstractions and mechanisms that can make it easier to manage and exploit concurrency. And I think this is an urgent problem currently because we are in a position where parallelism is really the only way forward in terms of performance. So we see this in a couple of domains. Chip manufacturers have stopped scaling things like clock frequency and process as a way of improving single-threaded performance and have shifted the burden on to the shoulders of the programmer, but it's starting to scale the number of cores on a chip. And we see also, you know, proliferation of massively parallel hardware in the form of GPUs or graphics processing units. And I think while there has been, you know, a trend where parallel hardware is increasingly abundant, tools that make it easy to program that hardware have not really enjoyed the same rate of growth. Okay. And particularly in the CPU programming world, locks and threads remain the state of the art. Just put long list of really well lamented issues like deadlock, livelock and so on. And while GPU-based programming has also seen a lot of big leaps forward as we've seen new parallelizations frameworks like CUDA and OpenCL come around, at the end of the day, programming these devices is still something of a black art. It requires systems-level knowledge of the device. And so I put it to you that better programming tools are urgently needed. And at the heart of the task of finding better programming tools is deciding what abstractions those tools are going to support. And I think choosing the right abstractions for concurrency can impact a lot of different levels of the technology stack. My interest focuses on operating systems and architectural support, particularly on the interaction of the two. And in this talk, I'm going to talk about two seemingly separate areas. I'm going to be talking about transactional memory support in the OS and on -- I'm going to talk about my new preliminary work on GPU support in the OS. The high-level theme here is we're looking at places where the operating system and the hardware can collaborate to provide better abstractions to make it easy to get at the concurrency at that these devices enable. Okay. So brief outline for the rest of the talk. I'm going to spend the first bulk of the talk talking about transactional memory. I'll move on, talk about OS support for GPUs, touch briefly on future work and conclude. Okay. So let me posit for your consider that programing with locks is still very much a necessary evil. And why is it evil? Well, probably -- sorry. >>: I said you probably don't really need to tell us, but you can -- >> Chris Rossbach: Probably don't need to tell you. So I will briefly breeze through this -- these points. They deadlock, they livelock, they compose poorly. They have poor complexity and performance tradeoffs and ultimately it's just hard to get right. Now ostensibly, transactions, and particularly as supported by transactional memory, are free of these problems. And as sort of that more concrete illustration of what we'd like to be able to do with transactional memory, this is a 50-line comment from the Linux 2.6 memory manager's filemap.c. And this describes the lock ordering discipline for the locks in that file. So if you're going to write code in this file or alter the code, you need to understand this. And I submit to you that this is a lot of complexity. What we'd like to be able to do with TM, after we make it shimmer and shake, is sweep all this complexity away. All right. So I'm going to devote two slides to background on transactional memory for those who do not spend a lot of time thinking about it or don't spend as much time as I spend thinking about it. There are a few key ideas and key abstractions you need to understand. And the key idea, the key idea really is that critical sections or going to execute speculatively and concurrently. So this is in strict contrast to locks which enforce mutual exclusion which cause critical sections to serialize. What we want to be able to do with transactional memory is say, hey, everybody go for it. If a sharing pattern occurs that violates correctness or endangers correctness, we'll detect that dynamically using the TM hardware and use check point and rollback mechanisms to retry until we know we can get a correct answer. Now, the abstractions that you need to support this, they're really -- I'm showing you six primitives here. The ones in blue are ones that you can expect in any transactional memory implementation. Primitives to begin, end, and retry transactions. The ones in red are machine-level instructions that I'm going to convince you hopefully in the next several slides are additional primitives you need if you're going to support transactional memory in an operating system. Now, a conflict -- and I'll define them in greater detail subsequently as well. You just need red instructions. [Laughter]. A conflict between two transactions occurs when there's a non-known intersection between the write set of one and the read and write set of another somewhere informally. If two transactions access the same datum and at least one is a write, you have a conflict, at which point you will invoke this thing called a contention manager, which comes in and enforces some hopefully performance-preserving policy about which transactions need to restart and which can continue to execute. And there's been a big body of research that has shown we need flexible policy in this area in order to improve performance. So that was kind of the verbose take on it. Here's a more visual take, and I'm going to show you how hardware transactional memory can allow two critical sections to execute concurrently on CPU 0 and CPU 1. These CPUs are simplified so they have a program counter and a read and write set. And what we see is that as these CPUs step into the critical section and begin reading variables, the variables appear in the read set that is maintained on the processor. Now when we get to line three, and CPU 0 reads C but CPU 1 writes it, at this point we have a conflict under the definition on the previous slide. So we'll invoke the contention manager which comes in and makes a decision about which of these CPUs gets to continue executing and which has to roll back. So for the sake of the argument, let's say that the contention manager decides in favor of CPU 1. This means that CPU 0 has to roll back, and its read and write set are clear. CPU 1 gets to commit and continue. And by the way, I do encourage you to stop me and ask questions because I suspect you don't need a lot of encouragement on this front, but . . . So I also want to stop and ->>: [Indiscernible] [laughter]. >> Chris Rossbach: >>: What? What? [Indiscernible] previous slide. >> Chris Rossbach: Sure. >>: In this case, if seven was empty in there, you'd have a conflict by your definition, but either serialization would be correct. Is that right? >> Chris Rossbach: If seven was empty? >>: If there's nothing in line seven there, if you just did the read or write and then you did XM. >> Chris Rossbach: Oh, so what you're suggesting is that there is actually an ordering which could yield the right answer in this case? >>: I'm suggesting that any of the orderings that you could come up with for those things produces a serialize that will result for that particular ->> Chris Rossbach: >>: In other words, it's not really a conflict. >> Chris Rossbach: >>: Well -- Well, actually, I'm not sure -- I'm not sure I buy that. It's a conflict but [indiscernible] -- >> Chris Rossbach: I think it's a conflict -- I think the a conflict depending on what order you commit these transactions in. So the order I -- right? So ->>: [Indiscernible]. >> Chris Rossbach: Well, because it might be -- if you can serialize all of the writes of one after all of the reads of another, you do get a correct result regardless, right? So are you asking about if we leave this interleaving without serializing these critical segments? Or are you ->>: What I'm saying is that the only thing that -- the only interaction between those two isn't C, right? >> Chris Rossbach: >>: A and B are read only. >> Chris Rossbach: >>: Yes. Okay. Mm-hmm. Because I'm positive that -- >> Chris Rossbach: >>: Okay. So you're talking about a blind -- blind -- right -- So what I'm saying is that -- >> Chris Rossbach: Yes, I agree. >>: -- no matter what you do with this, if you actually -- there's no real conflict here. >> Chris Rossbach: these ->>: Yes, I agree because there's a blind write in one of Yes. >> Chris Rossbach: I totally agree, yes. simplified example. >>: So that's kind of an artifact of a You asked -- >> Chris Rossbach: No, no, you -- [laughter]. And I concede that you're right. [Laughter]. I should probably -- it would be probably a good idea to insert a read of C here in both and then that would get rid of the problem you're talking about. >>: So the bigger question, the more interesting question is: Does this kind of thing happen in practice? Is the definition that you're using actually generating conflicts in practice that aren't really necessary? >> Chris Rossbach: >>: Yeah. So -- [Indiscernible] -- >> Chris Rossbach: So the answer is absolutely. And you know, I really wish that I had chosen to talk about a different paper because that would be a perfect [laughter] setup. >>: Well, guess what. You get an hour this afternoon to talk about [indiscernible]. [Laughter]. >> Chris Rossbach: So yeah. It happens. It's really workload dependent. So there are a lot of sharing patterns where essentially any time you have variables that are -- have a lot of write sharing like counters, this definition is too conservative. It's definitely possible that the interleaving that you execute that has a conflict under this definition can still produce the correct result. Okay. And it's possible also to use a technique that I call dependence awareness that allows you to essentially keep speculating in the presence of that kind of conflict and then sort it out at a later time. I did this actually -- it wasn't safe to tolerate this conflict. And if so, then commit and otherwise -then roll back. And in some cases, if you have things like shared counters in environments where like garbage collection, a lot of statics, like lists have these kind of patterns, if your workload features a lot of that, you can get significant speedups. What we found is that in two of the stamp benchmarks, we got speedups of up to 30 percent. However, in a lot of cases, you don't. You know. If you don't have blind writes, if you don't have things that are essentially, you know, single points of write sharing, then you wind up investing a lot of complexity in accelerating that particular scenario. Great question. Okay. So up until this year, it has been sort of -- Dave Ragare [phonetic] at a TM talk to say, well, this is free of livelock and deadlock; therefore, it's easier and we need it and everybody should build it. And I can claim to have benefitted actually from that blind acceptance of that dogma, but I do want to dedicate a slide to talking about a paper I had in this year's PPoPP where we actually questioned this assumption and tried to bring some empirical evidence to bear on the problem. And what we did was we took five semesters' worth of UTs OS undergrad students and had them write the same programs using locks with fine and coarse grain, conditioned variables, monitors, and transactional memory. And essentially what we did is say okay, you're going to write the same programs nine different times and we're going to control the order in which you write them so that we can normalize for experience and what you learn in the last case. And we surveyed them afterwards and said, well, you know, was it easier and then I read all their code and classified the synchronization errors. And what we found -- [laughter] -- yeah, that was a good time. [Laughter]. What we found is that, you know, the survey, you know, there were 36 questions in the survey, so this is maybe a little bit unfair to try to condense it into a single partial order, but ultimately this is kind of the take-away. If we asked them, did you find it easier, we didn't really get the stark, you know, flag-waving support we might have hoped for. What we found was that coarse-grain locks were actually the easiest to use. Conditioned variables, nobody liked them. No one could get them right, and it was kind of a tie between fine-grain locks and TM. However, there was a very different story that was revealed if we go and look at the synchronization error rates across all 1300 programs. And what I found is that if -- you know, for all the programs that use coarse-grain locks, 27 percent had at least one synchronization error. A whopping 62 percent had errors for fine-grain locks, which might be kind of an indictment of our ped gotchi [phonetic], but this is in fact what we found. And then TM had error rates on average of less than ten percent. So you know, what I think this is ultimately saying is that if your goal is to write correct programs, which I hope it is, TM is in fact easier. And with that, I'm going to move on and talk about TM support in the OS. And what I believe is that the operating system really is the killer application for transactional memory. And among the reasons I believe this is that operating system are among the most complex parallel programs around. And they can really benefit from simplification both in terms of reducing the number of locks and getting rid of the lock ordering disciplines. We need our operating systems to be correct, right? And hopefully I've given you some reason to believe that we can get things correct more easily with TM. And of course we need our operating systems to be fast. And by using optimistic concurrency, there's a good reason to believe we can get better performance, a lot of applications and a lot of time in that kernel. make the kernel faster, everybody -- everybody benefits. So you should use TM in an operating system. go about it. If we can And the question remains, how to And we -- when we started looking the at this problem, sort of the conventional wisdom about how to take a lock-based program and use transactions in it is you find some pulling primitive like spinlocks, and you map acquires and releases to transaction begins and transaction ends. And this is in fact exactly what we did when we first started trying to use transactions in Linux 2.6. We posited this primitive called an xspinlock which essentially changed the implementation of spin_lock and spin_unlock to do exactly what I'm showing you up here. And we were able to change nine subsystems in Linux this way to manage about 30 percent of the dynamic lock calls over the workloads we looked at. And it took six of us a year to do this, which is, you know, it was kind of a long and painful year. So why did it take us this long? Well, we ran into problems that we didn't -some of them we might ought to have foreseen; some we did not. IO is a big one. We of course want to be able to do IO in an operating system. Kind of important. And it's well known that I was not always a good fit for transactions. Also idiosyncratic locking, so we saw a lot of cases where people were kind of creative with how they used locks. A great example is a spinlock that protects the scheduler's runqueue in Linux 2.6.16, is acquired in one process context and released in another on a context switch which is kind of a tweaky place to use a transaction. So what this eventually led us to is the conclusion that if you're going to use transactions, locks and transactions need to be able to cooperate. There's other great arguments for this. There's a vast body of lock-based code out there that you don't want to have to throw away just because you've drunk the transactional Kool-Aid and you've decided you want to -- yes? >>: [Indiscernible] you kind of just vastly increased the complexity of the operating system. Now have not only different [indiscernible] transactional memory discipline as well you have to incorporate. those together rather than saying what -- if ->> Chris Rossbach: You have to raise both Right. >>: -- transactional memory is safer and easier [indiscernible], then [indiscernible]. >> Chris Rossbach: Sure. So I'm tempted to -- I'm tempted to respond to that as if you know what's in the coming slides. Is that true? No. Okay. Maybe you should hold off on that question because if you are making the point that by adding yet another synchronization technique we've kind of increased the diversity in the ecosystem in a way that is likely to increase complexity, that's one thing. In the coming slides, you're going to be even more convinced. And I want to wait until you've seen that before I answer that question. >>: [Indiscernible]. >> Chris Rossbach: Yeah. >>: [Indiscernible] literally not allowed to read. true but it's close. [Laughter]. Well, that's not quite Let's put it that I haven't read it. You're allowed to do IO operations while your holding spinlocks and [indiscernible]? >> Chris Rossbach: Yes. Yes. >>: Okay. And you can take page faults while you're holding spinlocks [indiscernible]? >> Chris Rossbach: Yes. >>: Wow, okay. That's very foreign [indiscernible]. I'm not going to [indiscernible] design now. I just wanted to clarify that because it seems weird. >> Chris Rossbach: Yes, it does. And you can conditionally do line up with the spinlock out which is even weirder and essentially is -- what this point is about. So ->>: [Indiscernible]. >> Chris Rossbach: So just because you have acquired a spinlock and you're going to release it in that critical section, you might not always do the IO in the critical section. It might be something [indiscernible] ->>: [Indiscernible] -- >> Chris Rossbach: >>: Right. -- if some condition does IO, otherwise don't, right? [Indiscernible]. >> Chris Rossbach: And so, I mean, ultimately, you know, what I'm getting at with this argument is the idea that, well, if you -- why do you need a special tool to handle IO, Chris? Why can't you just look at the critical sections and say, oh, this one does IO, this one does not use a lock with this IO -- and you can't because this conditional -- conditional IO things nest at very deep levels and you can take page faults ->>: [Indiscernible] small step after spinning a while blocked on IO. okay. But >>: So kind of a related question is so Linux kernel [indiscernible]; is that correct? >> Chris Rossbach: I believe that is correct. >>: Okay. So they're less likely to have this problem, but they could for instance take a spinlock on block and copy it. [Indiscernible]. >> Chris Rossbach: >>: Thank you. [Laughter]. The -- [Indiscernible]. >> Chris Rossbach: Finally, the last argument for this really is that transactions perform well when -- I mean, the whole idea here is the common case is no contention. That's where we're going to get our performance benefits. That's when you want programmers to use this primitive. If there is contention, locks might be a better fit, and we want people to have still the ability to choose the right primitive for the right environment. And the abstraction that we eventually chose to address this situation is call the cxspinlock which stands for cooperative transactional spinlock. And the idea is that we're going to dynamically choose between a lock or a transaction, depending on what's going on in the system. So most critical sections will optimistically try to execute a transaction, and if the hardware detects and I O attempt, I mean, it's hardware, right? You can definitely see the IO happening there if you can't see it anyplace else. If the IO attempt is detected, we're going to roll back and acquire a lock. And the trick here is involving the contention manager in the lock acquisition, because what this does introduce is the need to be able to serialize transactional execution against lock-based execution. And this is what we need these additional instructions for that I showed you in the previous slide. We need an xtest, which is a transactional version of a test instruction. Xcas, transactional version of a comparent swap. And those of you who spend a lot of time looking at spinlock implementations will recognize these as the basic building blocks of the spinlock. And then this extra instruction called xgettxid, which allows a CPU to query for -- a fred [phonetic] to query for the existence of an active transaction on that CPU. And the final edition is we're going to say let this xbegin instruction, return a reason for the retry, so that if it roles back because it detected IO, we get a return code that says IO occurred, and then we can decide what to do on the rollback from there. One nice thing about using cxspinlocks instead of bear transactions is that using preprocessor managing, we were able to convert the entire kernel in a month with this abstraction instead of -- yes? >>: What do you classify as IO? >> Chris Rossbach: Yes. So this is at the CPU level, right? >>: So [indiscernible] and writes to memory -- >> Chris Rossbach: >>: Right, TMA. -- [indiscernible]. >> Chris Rossbach: Okay. So this is the cxspinlock API. Consists of three functions. The first is a cx_optimistic. And this is the version that is going to optimistically attempt a transaction. And what it will do is if the xbegin returns this flag that says we need to -- we need mutual exclusion to execute this critical section, then it will use the cx_exclusive API to try to get a lock on the critical section instead. Now, the cx_exclusive, as I've said, acquires a lock, but it's going to use this transactional cas instruction, which makes the writes and reads that it does to a lock variable visible to the transaction subsystem. And then the last one is cx_end, which releases a critical section, whether it's protected by a transaction or by a lock. So this is where we use that xgettxid instruction. If we're in a transaction, end it. Otherwise write the lock that it will -- to release it. So I want to show you how we can use this primitive to serialize access to a critical section between transactional execution and lock-based execution. So CPU 0 on the left is going to try to use cx_optimistic to enter into a transaction. CPU 1 is going to use cx_exclusive. And initially we have a lock which in the next parlance has a value of one. That means it's unlocked. Okay. So in the interleaving we're going to consider, CPU 0 enters and starts the transaction. This is reflected by a new txid of one on the processor. A txid of 0, by the way, means no active transaction. Now, that's of course relevant when we watch what CPU 1 does. The going to execute this xgettxid instructions, which says oh, I don't have an active transaction, so I'm going to come down into this wild loop where I'm going to start spinning and trying to use a transactional cas to acquire the lock. Now, because no IO has occurred, the status check for exclusivity here fails in cx_optimistic, and we come down and use in xtest instruction to read the value of the lock variable. Now, xtest has the same semantics as test. It exceeds if the value in the lock matches the other parameter, but it has the additional semantics that if it succeeds, that variable is entered into the read set for that CPU's transactions. Now what this enables is that when CPU 1 subsequently goes and tries to write the lock variable in a transaction, you can think of xcas as essentially a mini transaction of one instructional transaction. This makes its writes visible and we have a read/write conflict under the definition that we've been considering in this talk. And this allows us to invoke the contention manager which can then arbitrate and decide who gets to keep going. So let's assume that the contention manager decides CPU 1 wins. This means that CPU 0 rolls all the way back to its xbegin and keeps retrying. The xcas instruction succeeds and the lock variable gets written and is now locked. And you can see that the test instruction down here will subsequently fail until we're out of the critical section under lock-based ->>: [Indiscernible]. >> Chris Rossbach: >>: No, it doesn't. [Indiscernible] roll back? >> Chris Rossbach: have to roll back. >>: That the exactly the -- So it has to [indiscernible]. >> Chris Rossbach: >>: That's right. Well, let me show you [laughter]. So -- If in fact -- it doesn't [Indiscernible]. >> Chris Rossbach: Essentially, yeah. So if the contention manager decides, you know what, CPU 0 is -- we failed the xcas instruction, this guy enters a spin that looks a lot like a T cas loop and winds up waiting. >>: Xtest. How is that any different from a read in this case -- oh, you said [indiscernible] puts it as a read [indiscernible]? >> Chris Rossbach: Oh, so, if we did not have it, we can also use it outside of the transaction. >>: Oh, okay. In this example -- >> Chris Rossbach: In the other -- but the more important part is that if it fails, it doesn't enter it in the read set. >>: Oh, okay. >> Chris Rossbach: >>: So if this failed and we entered it into the read set -- Then it would -- >> Chris Rossbach: -- this guy would always have a conflict and we would wind up [indiscernible] ->>: Now I'm getting it. >> Chris Rossbach: once -- okay. Great. Makes much more sense that way. Any more? Okay. Cxspinlock questions? No? No? Going So let me tell you a little bit about our evaluation of TxLinux. So we started with a MetaTM, which is our hardware transactional memory implementation, which essentially extends x86 instruction set with these instructions that I've been talking about. There's a paper about MetaTM in ISCA 2007. We used Simics simulation environment with these hopefully convincing machine parameters. 16 and 32 CPUs. And the benchmarks we looked at were parallel make, bonnie++, which is a file system stress test. Modified Andrew benchmark, parallel configured, parallel find. Now, the thing I want to stress about these is that they're user mode benchmarks. They do not use transactions directly, right? They're exercising the kernel which is using transactions internally for its own synchronization. What we found in our initial work starting with 2.6.16, is that -- oh, also recall we -- you know, we sort of went through this exercise twice, right? The xspinlock version was using bare transactions and the cxspinlock version where we actually converted all the subsystems to use transactions ->>: [Indiscernible]. >> Chris Rossbach: Right, exactly. So there's nine subsystems. All the highest contending subsystems. And then here, absolutely every spinlock in the kernel is converted. And what we found is that with 16 CPUs, we had a two percent slow down and a two percent speedup on 32 with xspinlocks. And then using cxspinlocks, we had 2.5 percent and 1 percent speedup on 16 and 32 ->>: [Indiscernible]. [indiscernible]. Spinlocks the main lock that you're using >> Chris Rossbach: They're certainly, in terms of the -- they vastly outnumber, but I mean, what does main really mean, right? >>: So there are other locking [indiscernible]? >> Chris Rossbach: Oh, yeah. There's 9 or 10 of them. New texes [phonetic], semi fours, read, copy, update. Sequence locks. I mean, there's definitely ->>: And you're not touching any of those historically. >> Chris Rossbach: Well, so any of them that are built on spinlocks we do touch. Okay. So reader, writer, spinlocks, there are spinlocks used in RCU. There are spinlocks used in sequence locks. We do not touch new texes [phonetic] and semi fours because those are blocking. Right? So transactions, when we emphasize ->>: Don't want to block. >> Chris Rossbach: You don't want to block. I mean, you can argue that there are ways to do it, and in our ASPLOS paper in 2009, we came up with a transactional variant on the new texes [phonetic]. I'm not sure whether I believe you want to block, even after -- great, we got a paper. But you know, I think you ->>: [Indiscernible]. >> Chris Rossbach: >>: That's great. Anyway, the point of all -- yeah? So are you ready to come back to [indiscernible] yet or -- >> Chris Rossbach: Oh, yeah, yeah, yeah. got [indiscernible] and forgot. Thank you. Thank you. I completely So the question is, isn't -- you know, not just is this introducing another primitive making the world more complex. Suddenly I'm telling you, well, we want to get rid of locks because of lock ordering and deadlock and so on, and yet now you're telling me that your transactions are going to roll back and get a lock. Right? And so isn't that the worst of both worlds? And you know, the answer is no. And the reason is that you can drastically reduce the number of locks with this. Even if you still have to use lock ordering for things that might do IO. So an example of how you might do this is you might take every critical section in the kernel that conditionally does IO, and map it to one lock. So you get rid of -- you know, I think our estimates at the time were ->>: [Indiscernible]. >> Chris Rossbach: Yes, exactly. Coarser locking. And also, you can separate -- you can -- under the current regime, a lock is really tightly coupled with the data structure it protects. And this is -- this makes sense with locks, right? With transaction -- [laughter]. However, you don't need this with transactions, right, because you're speculated. Particularly if you -- you know, if it's not the common case that you need to roll back and acquire a lock, it's totally acceptable to share a lock among data structures that might not be related. >>: [Indiscernible]? >> Chris Rossbach: Hmm? >>: [Indiscernible]. >> Chris Rossbach: Effectively. Effectively. Accept, you know, the problem with BKL is that it's highly contended, right? So we don't want to return to like one highly contended lock that can -- you know, cause even more ->>: [Indiscernible] we do. You said go to -- [indiscernible] coarser direction. That's the benefit of this. We can forward coarser locks, but not back [indiscernible]? Is that the -- where's the sweet spot [indiscernible]? >> Chris Rossbach: That coarse, but maybe not that lock. That lock is highly contended. So what you really want to do is find critical sections that conditionally do IO and map all of those to one coarse lock. Right? What you don't want to do is take a lock -- mm-hmm? >>: Aren't those the ones that you're likely to be at the longest and hence the ones most likely to be contended? [Indiscernible]. I mean unless conditionally means almost never. >> Chris Rossbach: >>: I mean, that really is the point. Right. It leaves that -- >> Chris Rossbach: I mean, if in fact you're always -- if the common case is that you are going to do IO and roll back, you're essentially -- it's a waste. It's not just -- it's not that you just don't get the performance benefits. You actually make things worse because you're going to speculate ->>: [Indiscernible] fine-grade locks and turn them into coarse-grain locks that are contended. Right? Which can ->> Chris Rossbach: No, no. I'm saying take fine-grain locks and turn them into coarse-grain locks that are not contended. Okay. So ->>: So how are they not [indiscernible]? >> Chris Rossbach: >>: Okay. I mean, unless they -- >> Chris Rossbach: section ->>: Consider the following. So we have a [indiscernible] [Indiscernible] just checking. >> Chris Rossbach: -- that does -- let's just broadly call sharing, let's just have an instruction, share. Okay. A. do some IO. Okay. Now, we have another one which I'll let in. Share something else completely unrelated. Check some maybe two. You know, do some other IO. read and write And then if maybe you kind of fill other condition, if The point here is that in the common case, we don't have contention because we're using the same lock variable to synchronize disjoint data structures, in the common case you get no conflict. Of course you only want to do this if it's also the common case that maybe and maybe two are not true. Okay. >>: So how often are -- I don't have any sense for how often conditional IO and [indiscernible] happens. So, I mean, literally none. And ->> Chris Rossbach: Yeah. So it's -- I don't believe I have the number. Intuitively, I want to tell you that it's between 50 and 70 percent of the time for some locks. But it depends which lock, right? It depends which lock, which data structure. So it's not -- you do lose the property that you can go and say -- you can blindly choose the primitive. You do need to understand the IO profile of the critical section that you're doing this with. >>: [Indiscernible]. I thought you said was that roughly -- you're going to take anything that does this conditional IO and if it's actually doing the conditional IO you're going to fallback on one [indiscernible]. >> Chris Rossbach: That's right. >>: Only if the [indiscernible]. >>: [Indiscernible] so they're not going to conflict with [indiscernible]. >>: You won't [indiscernible]. >> Chris Rossbach: The common case is because these -- so I used to have -- >>: No, no, no. >>: Now I don't get it. >>: You always -- you always have to grab the lock if you're doing -- >> Chris Rossbach: >>: If you're actually going to do -- >> Chris Rossbach: >>: You always have to if you're going to do IO. But if the common case is that you don't -- Yes. >> Chris Rossbach: -- and you're using, you know, and you don't touch the same memory, you have essentially created a situation where you don't need as many lock variables. >>: Sure. If you don't do the IO and you don't touch the same number, then you clearly win. >> Chris Rossbach: Yes. >>: But I thought you just told me that between 50 and 75 percent of the time you do the [indiscernible]. >> Chris Rossbach: Right. So just like what I'm telling you I guess is that it depends on the profile of the particular lock [indiscernible]. Some things do IO rarely; some things do IO most of the time. >>: So you can go [indiscernible]. >> Chris Rossbach: >>: That's exactly what I'm saying. Okay. >> Chris Rossbach: You do need to go and understand the profile of -- >>: Nothing is free. >> Chris Rossbach: Nothing is free, unfortunately. Yeah? >>: [Indiscernible] so that you can [indiscernible] locks that you use [indiscernible], have you actually done that or you just [indiscernible]? >> Chris Rossbach: Yes. What we did do -- you know, what I wanted to say in that last slide is that I wasn't expecting anyone to, you know, cheer my 2.5 percent speedups and what I really didn't want people to come away from this talk with was the idea that in fact we -- you know, the performance benefits of TM were not [indiscernible]. And so we did repeat this exercise with Linux 2.4 for ASPLOS last year. And what I'm showing you here is a snapshot from this paper. And the reason I moved on to this is of course 2.4 has the big kernel lock, much coarser-grain synchronization. And what we were able to do is go and essentially replace coarser synchronization with transactions and what I'm showing is the scaleability. The top line is 2.6 unmodified. Bottom line is 2.4 unmodified. And the middle is TxLinux 2.4. We were transactionalizing not just spinlocks, but also new texes [phonetic], cx new texes [phonetic], the new primitive in that paper. And what we find is that there actually is synchronization overhead to eliminate, we can eliminate it. And we were able to make up a significant fraction of the performance benefits that took kernel developers years to achieve in the 2.4 to 2.6 transition. Yes? >>: And in 2.4 cases you follow what you've been saying and actually place the fewer [indiscernible] spinlocks, or are you -- [indiscernible]. >> Chris Rossbach: >>: See what happened [indiscernible] locks. >> Chris Rossbach: >>: No, we did not rewrite [indiscernible]. Yes. [Indiscernible] them in the kernel and said [indiscernible]? >> Chris Rossbach: Yes. I mean, you know, ultimately, you know, one thing that I would really -- like I said at the beginning, I think that this -- you know, no matter where you come down on this TM hype that we've enjoyed over the last 4 or 5 years or not enjoyed, I do believe that the operating system is the only place that is likely to really be able to use a hardware-based primitive like this. And I would love to see, or do, some work where we actually where able to go and restructure significant portions of it to take advantage of that. And that's something that like, you know, me in my cube at UT having already sort of demonstrated, you know, the idea, doesn't make sense. Might make more sense to do it here. Yes? >>: I'm really surprised that the map doesn't speed up [indiscernible]. 2.4 shows nothing, and you're getting it 2x because the file systems community is largely abandoned it because we all decided that essentially the just measuring the com pilot [phonetic] running, which isn't a very great file system benchmark, but it ought to parallelize wonderfully. So any intuition for why it's so cruddy? >> Chris Rossbach: not -- Why it's so cruddy? Why in general does the benchmark >>: Why is it not getting much closer to 1 to 1 [indiscernible]? >>: It's not [indiscernible], sit? >>: Oh, maybe that's it. >>: [Indiscernible]. >>: It depends on -- I mean, when we ran it, [indiscernible]. Parallelized? >> Chris Rossbach: [Indiscernible]. You're asking did we actually run an entire separate benchmark [indiscernible]? >>: Instances [indiscernible]? >> Chris Rossbach: No, no, no. >>: How does it parallelize it? >>: Okay. I mean, what are you parallelizing. Now I understand [indiscernible]. [Laughter]. >>: [Indiscernible]. >>: That's why it speeds up at all. [Laughter]. >> Chris Rossbach: You know, the real reason we did -- the reason we can see this much is that in a simulation environment, we can make [indiscernible] 0. All right. So think point I want to spend just a little more time talking about transactional memory and I want to talk about how we can use it to eliminate priority inversion. Now, in this audience, I probably don't need to spend a lot of time coming up with background on priority inversion, so I'm going to kind of skip through this slide. Essentially it occurs if -- if we have some sharing, we have a higher priority process sharing a resources with a low-priority process. We wind up at the high-priority process descheduled. It's a drag if we get a medium-priority process coming in doing totally unrelated work, everybody winds up asleep and then there's of course the additional watchdog timer kind of problem can come in and say, oh, is nobody making any progress? Reset the system at times when you don't want it. And this is a real and expensive problem. Sorry to put you through this slide, but I got to get through it. This happened with Mars Pathfinder in '97. Fortunately they found it before they launched it. So they got to keep their $150 million. But the real point here is that existing solutions such as priority inheritance are fundamentally Band-Aids because they do not make the problem go away. All they do is enforce an upper bound on how much time you can spend executing other priority inversion. Now, here's what I really wanted to get to, which is this dogma that TM gets rid of all the problems of locks. Priority inversion is not one of those problems that it makes go away. Okay. In fact, priority inversion can happen with transactions and it boils down to the policy that the contention manager chooses. And so to convince you of this, I'm essentially showing you the same scenario that I showed you in the previous slide where we have a low-priority process that is older, having a conflict with a higher-priority process that is younger. Now, the -- one of the findings of the body of research on contention management in transactional memory is that you want the older transaction to win. This is a nice thing because it usually provides good performance and it's also free of livelock for obvious reasons. Imposes a total order. Unfortunately, in this case, it causes priority inversion. And in fact, what we -- yes? >>: [Indiscernible]? Tell me were why this thing [indiscernible] why I'm having sorting order being major sort on the priority and minor sort on the [indiscernible]. >> Chris Rossbach: That is exactly what we're getting to. But before we did this work, no one had thought of this. I mean, it's just profoundly obvious, right? [Laughter]. I mean, I remember walking into my advisor's office and saying, this is so obvious, we can't publish this, right? Like ->>: But nobody else did. >> Chris Rossbach: >>: But no one else -- [Indiscernible]. >> Chris Rossbach: So in fact we could and we did. But absolutely. You jumped straight to the punch line, which I will show you after convinces you that this actually does happen in our benchmarks. And across all the benchmarks we looked at, almost ten percent had this problem. And the way we solved it was by providing a register that is writable in kernel mode, only that OS can use to say this current thread has this priority. And that way when the hardware that actually decides [indiscernible] a conflict gets invoked, it can say, okay, higher-priority process wins unless we've got a tie, in which case we're going to default to some other livelock free policy like time stamp. And what this does in our benchmarks is eliminate a hundred percent of the priority inversion. Now, the reason I have the caveat in our benchmarks is that because the priority that the hardware uses is a snapshot at the time the transaction begins, we essentially provide and instruction that says you know, write the priority here, and then start speculating. It's totally possible that the kernel with change the dynamic priority while we're in a transaction. So it's truth in advertising. You know, it is possible that priority conversion can still occur. But we didn't see it. And negligible performance costs, which is great. So I want to sort of briefly wrap up the basic lessons learned with TxLinux. Obviously, as I've just told you, priority inversion can be eliminated with TM. This is a good thing, right? We want that. Locks and transactions need to be able to cooperate if you're going to use transactions in an operating system. I think this is true even if you're going to use transactions in a user mode program. Fortunately, this new abstraction, the cxspinlock makes it possible to do that and makes it possible to handle IO sort of more gracefully. Transactions can reduce synchronization overhead, but only if it's there to begin with. Now, these are the conclusions that you might see in a paper. I want to sort of step back and tell you what I actually really think you should take away from the TxLinux work. So most importantly, TxLinux I think remains the most realistic bench market for transactional memory to date. So at the time we started this work, most TM research was based on micro benchmarks like RB trees, hash tables, splash two, things like this. And being able to actually go and apply this to a real system exposed a lot of myopic designs, both in terms of how people thought about how we should use transactions and in terms of what a transactional subsystem should support. So for example, the contention management policy, there's papers about, hey, you know, here's my new whiz-bang policy that is a blend of these other policies and gives us speedups under such and such conditions. Not one of them thought about, you know, what -- integration with an operating system. Additionally, you know, we needed new primitives from the hardware to be able to use this in an OS that none of the existing designs thought about. And ultimately, you know, you really need to be able to do research that crosses layers in the technology stack to come to these kinds of conclusions. So I'm going to change channels here and start talking about my preliminary work, which is -- shares the theme of OS abstractions but is no longer going to be about transactional memory. At all. So I want to give you an opportunity if you want to dig in about TM before I move on. This might be a more tasteful place for me to give you that opportunity. >>: One more question. So the designs I've seen from hardware TM support tend to work by having sort of limited sets of every [indiscernible] sets and then when they spillover, you fall back into some sort of horrible [indiscernible]. [Laughter] software transactional memory. Did your simulator consider that? It sounds like these Linux [indiscernible] touch a lot of stuff. >> Chris Rossbach: Yeah. So great question. And our simulator does handle this. So you know, essentially what I had to do was, there's a lot of different ways to build an HTM. One of them is to use the L1 cache as essentially your right buffer. So if you write a line, you mark it and you cache and this way when you abandon the speculation, you can just invalidate those lines. So that's the kind of design we looked at. Other approaches of using store buffer, which is a smaller write set, but at the end of the day, you're limited by the size of your cache so the number of writes and reads you can do. So in this system, I built a cache model that essentially does exactly this and then asserts a line to the CPU when any line that has been touched transactionally falls out. So this can happen not just if you touch so much data that it doesn't fit in the cache. This can also happen when you have an associativity eviction. And, yeah, we did, you know, all our experiments model this. And it's also a big motivation behind the cxspinlock. If in fact the data that your transaction touches, you know, are such that you can't actually make a transaction succeed because it's always going to have an associativity conflict with something Like This, you need to be able to roll back and acquire a lock. >>: Okay. So the [indiscernible] you would fallback and take a lock if you [indiscernible]. >> Chris Rossbach: >>: However -- That's [indiscernible]. >> Chris Rossbach: Well, but you know, in fairness, be in that ASPLOS paper, we also looked at other strategies like rolling -- you know, defaulting to an STM. Turns out to be a nightmare to get right. We looked at simplified strategies of, well, you know, can we do something like allow just one software transaction with any number of concurrent hardware transactions. And it turns out yes, there are tricks you can play with per CPU variables and things like this, and simple commit protocols that allow you to do this. But again, it kind of turns up to be a nightmare to get right. always give you the performance benefits that you want. And doesn't >>: [Indiscernible] transaction is afforded [indiscernible] running out of cache lines? >> Chris Rossbach: It really depends. With the kernel benchmarks, it's very rare. Three percent tops, because these are spinlocks. >>: [Indiscernible]. >> Chris Rossbach: Okay. So you know, especially like when you start with 2.6.16, the average size of a critical section is a hundred instructions. So you're almost never going to overflow [indiscernible] cache. When we got to 2.4 where we're dealing with much bigger read/write sets, then we're looking at more in the 5 to 10 percent range. And then the de facto standard for TM research, user mode benchmarks, I don't know if you're familiar with Stamp that the kind of what everybody uses. And that suite has a whole range of profiles of critical section size. Some things are essentially designed to stress this part of the design and others are not. >>: [Indiscernible] you run stamp on the top of TxLinux? >> Chris Rossbach: Yes, not on the work I'm talking about here but definitely [indiscernible] in the micro paper where we tried to use -- the question you asked earlier about do all [indiscernible]. >>: So you have user mode transactions, hardware transactions have to be [indiscernible]. >> Chris Rossbach: >>: Yes. And dealt with the conflicts that come out of that. >> Chris Rossbach: Yeah. I mean, it's very rare for a user mode transaction to conflict with a kernel -- I mean, they're not supposed to share their memory, right? And additionally, let me say, the only time you might actually have that kind of sharing is in a system call like the [indiscernible] two user but a system call is effectively IO. So ->>: [Indiscernible]. >> Chris Rossbach: That's right. stuff, you might try to do that. Yeah. I mean, unless you have got Don's >>: So how can your performance [indiscernible] measure -- I guess any hardware simulation would be kind of free, cost little. How would you measure the performance of [indiscernible]? >> Chris Rossbach: you know ->>: Same way. I mean, it's execution-driven simulation. So Actually you really measure -- >> Chris Rossbach: Absolutely. I built a cache model that aborts a hardware transaction in the simulator when it over flows and then the operating system reacts by coming back and acquiring [indiscernible] -- >>: You gave that XC some reasonable, you know -- >> Chris Rossbach: >>: -- [indiscernible] like aborting the cache. >> Chris Rossbach: >>: Reasonable -You have -- Yeah, well, so, okay -- Now add some [indiscernible]. >> Chris Rossbach: You do although it's cache clearing [indiscernible]. Right? So to abort, you essentially need to make every line invalid, which winds up being a right to want -- you can essentially flash clear. So you can do that in some very small number of cycles. That's not the main cost. The main cost is in the fact that you suddenly have a cache cold cache for everything that you've just thrown away, which by the way, turns out to be kind of a major neglected cost in using hardware transactions is that ->>: [Indiscernible] build a machine, it would probably perform the same. >> Chris Rossbach: >>: I believe that, yes. This Linux CPU is pretty -- it's in order, right? >> Chris Rossbach: Yeah. anything like that. >>: Yeah. Okay. >> Chris Rossbach: >>: No, we did not -- we did not do out of order, But you [indiscernible] cache flow. >> Chris Rossbach: >>: Yes. So I mean, I rewrote, you know -- [Indiscernible]. >> Chris Rossbach: Oh, okay. >>: I had one more question. >> Chris Rossbach: Yes. >>: So apart from the priority [indiscernible], did you see other progress problems like in cases where ->> Chris Rossbach: Oh, all the time. Not so much with contention, but more often in cases where the implicit, you know, I guess I want to say protocol, but you know, the way you're supposed to use the lock was not obvious to us until we, you know, broke it by using a transaction. So one great case is sequence locks have this assumption that they can actually do sharing on the same stack, but a reader always has to be higher on the stack than the -- than a writer or you deadlock. We didn't know that. And so we replaced it with something that could roll back and acquire a lock that caused the whole system to freeze. So you'd like to hear about GPUs? >>: Yes. [Indiscernible]. >> Chris Rossbach: through this then. preliminary work. I'm going to run out of able to. So I'm going to breeze So it's good that I can breeze through it because it's So the motivation for really this work is what I believe in my experience of trying to leverage GPUs for certain workloads is that OS-level abstractions are what's limiting GPU applicability in certain application domains. So there are certain application domains where it's definitely a good fit. One of them and most obvious is graphics and gaming community, right? Better be -- these people better have the right programming tools. The devices were built for that. The tools these people have are things like shader languages, HLSL, graphics libraries like Directx and OpenGL. And then there's this other community called the GP-GPU, general purpose GPU community. These people are largely focused on parallelizing high latency scientific algorithms like protein folding, and there are some great tools for them like CUDA and OpenCL which are user mode parallelization frameworks that provide a C-like interface. However, I believe that the application ecosystem is considerably more diversion than this, and what I'm going to try to convince you in the next few slides is that lack of good OS-level abstractions is what's kind of sequestering GPU usage into these kinds of -- these two domains. And to sort of focus the argument I'm making, I want to look at what I'm calling interactive applications. So examples of this are gestural interface, waving your hands at your computer hoping for a result, brain-computer interface, which my advisor is a huge believer in, by the way. He went and got one of these helmets. Spatial audio and real time image recognition. So what these share with the applications on the previous slide is the high level of data parallelism, making them a good fit for a GPU model of computation. What makes them different is that in most cases, they're fundamentally processing user input, which means that while they need concurrency to get good performance, they also need low latency. And because they're acting as a logical device, they need to be multiplexed by the OS across applications. And I want to focus in particular on the problem of gesturally interface, because I've spent a lot of time looking at this particular problem. Examples of are of course you guys have. At a high level, be I'm showing you, in the upper left of the screen, a basic decomposition of this problem. I'm interested in the problem when you implement the system with cameras. And so you know, a basic system to build this would capture some number of images from cameras which might support ranging like a distance to objects in the field. You need to be able to do some image filtering and geometric transformation so this is to transform the data that you've captured from the perspective of the camera to the perspective of the user or the screen. And the results of this step is some cloud of points which you can feed to a gesture recognition algorithm that looks for hands and eventually needs that to the OS as HID input. So it can train to, you know, gesture messages or mouse messages and so on. Now, what characterizing this workload is of course high data rates. If you have a lot -- if you have multiple cameras producing large images, you know, 60 to a hundred hertz, and because we want to be able to use commodity hardware, we might have noisy input. And to convince you of that, I'm showing you this blue blob in the lower right of the strewn. And if you kind of squint and tilt your head to the side, you can see this is my hand in front of my screen. This is actually captured from the prototype I've been building. Very, very noisy. So we need some pretty heavyweight algorithms to be able to denoise this. Now, here's how I wish I could build this system. What I wish I could do is write four separate programs and connect them with posix [phonetic] pipes. It would be nice, modular. The four programs would be catusb, cat/dev/usb to capture images from the device. Xform, which transforms image data and does noise filtering, puts it in the perspective of the screen. Detect takes a point cloud for hands. And then hidinput takes the output of the text sends it to the OS. Now, some observations. Capturing data from a camera, sending mouse events to an OS. These are inherently sequential. Noise filtering. Geometric transformations, and potentially detection are inherent in the data parallel. Now, with the OS abstractions I'm illustrating here, mainly pipes, I could do this on a -- if I were going to implement this with essentially just CPUs. However, what I want to convince you is that I don't want to do that. In fact, I can't really do that. And what I'm showing you here is the performance in terms of frame rate for the xform step in the prototype that I've been building. I built it two ways. One using fork joint parallelism and running -- the numbers here are captured on a four CPU -- or four core CMP machine. And then the blue bars are for a GPU implementation that is using a 256 core video card. Now, the idea here is that there's a difficult tradeoff between the quality of noise filtering and the amount of computation that you're willing to invest in it. And the ultimate point is that for any acceptable level of quality of noise filtering, we wind up with frame rates on the CPU implementation that are below one per second. So the system winds up being unusable. And the ultimate point is that no only do we want to use the GPU, in this case we need to. John, is there a concern? >>: [Indiscernible] things get worse. >> Chris Rossbach: >>: Sorry. Oh, okay. Thank you. We can dig in. Essentially I'm using -- [Laughter]. >> Chris Rossbach: That was much more succinct than -- [laughter]. going to dig into the algorithm. I was Great. You might say to me, okay, Chris, fine, use the GPU. What's the big deal? You can still use your pipes. Why don't you just rewrite your transform and your detect program so that, you know, after it's reading, reading input from the pipe, you pack it up, send it to the GPU, do your computation there, and hey, have you heard of these new GPU -- GPU -- GP-GPU frameworks you can use? Make it simpler to write. Now, before I convince you that this is going to fall down, I need to give you some background on how GPUs execute a shader program. The first and probably most important property is that in general -- and there are exceptions like Larrabee, of course, a GPU can't run an OS. This is because it has a different instruction set. It looks for features like interrupts that are critical for implementing -- for running an operating system. Typically, GPUs have a disjoint memory space and are not -- and that memory space is not coherent with main memory. Right? And so ultimately what we wind up with is a situation where some process on the CPU has to orchestrate the execution on the GPU. And the current regime, user mode applications have to implement this per application in a sort of ad hoc application-dependent way. Now, let take a look at the way ID composed this problem from the technology stack point of view with all of this in mind. And what I want to show you is how data is going to move back and forth across the kernel boundary. I've got some cameras connected to a usb port and I've got all my different components in the design up above the user boundary. And the first thing that happens is we capture some image data from a camera. We read it into user space. What do we do? We write it back into kernel space to send it into a pipe. Now, the next step in the program is this transform step. And we need to run that on a GPU. So we're going to read data out of a pipe, write it back into the kernel through all these parallelizations frameworks. We run the program on the -- we run the shader program on the device, and then repeat the exercise in reverse to writ it into the pipe in the next step move. Hopefully you can all see where this is going. You don't need me to like play it out in great detail. We wind up with this kind of tennis match of data that is going back and forth across the user kernel boundary that we would really like to avoid. And in this naive design, which wind up with 12 kernel crossings and six copy to and six from for fairly large image buffers. And hopefully I'll convince you also in subsequent slides that there are performance tradeoffs introduced by using these additional layers that we'd like to be able to get rid of. Now. You might say to me, okay, so you don't want to cross the user kernel boundary. What's your problem? Why don't you just do this? So in fairness, I will admit that this is actually the design that I started with when I thought this is how I want to build this. But it really is kind of a nonstarter because there are no high-level abstractions. This is not where you want to be able to -- you don't want to be writing code here if you can avoid it. If you're Microsoft or nVidia, this might be tenable because you can actually get the documentation you need to write this. But if your me in your cube at UT, this is definitely, you know, a challenge. And but ultimately, the solution winds up being specialized. the modularity. Right? You lose But even under this design, if you're willing to accept all of that, there's still a data migration problem. To convince you of that, be I'm showing you kind of the hardware view. We have a GPU connected on a PCI express bus, main memory, south bridge, north bridge CPU. And again, regardless of how data moves across the user kernel boundary, we wind up with some sort of unattractive communication patterns. So when we capture data from the usb from the cameras, we wind up writing into main memory. To execute on the GPU, we then need to copy it across into GPU memory space. After the shader runs. We copy it back. And the rest of course is left as an exercise the reader here, right? What we wind up with is data traversing a labyrinth through the system. are big buffers. They're happening with a lot -- at high frequency. These It's avoidable, and -- or I believe it's avoidable. It's not avoidable yet. But why do we want to avoid it? It wastes bandwidth. It wastes power. PCI transfers have to be coherent with memory on the CPU, so we can even cause cache pollution here. What we really want is to be able to simplify the data path. We'd like to be able to read straight out of, you know, straight out of the south bridge, go straight into the GPU memory where we can run two steps without any additional memory transfers and only move the memory -- only move buffers at the very end to main memory when we want to perform the final step. And, you know, the ultimate point here is that I think the machine can do this, but the OS interfaces to make it possible are not there. >>: [Indiscernible]? >> Chris Rossbach: You could, and then you lose the modularity. absolutely. Two passes on the shader. But yes, Okay. So what I'm proposing and what I've just started to build as extensions to Linux are some new OS abstractions to address this problem. In the first is a ptask, stands for parallel task. And it's like a processor thread, but it has this additional stipulation that it can exist without a dedicated user mode process managing it. It has a list of mappable input and output resources that you can think of as analogous to standard in, standard out, standard error. Next abstraction is an endpoint which is a -- it's an object in the kernel name space that you should think of as a data source or a sync. So examples are usb bus, buffers and GPU or CPU memory. And then a channel which allows us to connect endpoints and orchestrate how data moves through the system. And essentially it's an IPC analog that is similar to a pipe and has also these properties of being able to have 1-to-1, 1-to-many kinds of relationships. Now, at the end of the day, I want to make it clear that I'm fundamentally proposing that we expand the system call interface to have an logs for IPC for execution of processes on a GPU. And even scheduling it so that we can bring the GPU under the domain of the OS scheduler. Now, if we revisit this problem with these new abstractions, what we wind up with is essentially a graph. Now, to sort of hopefully very quickly talk you through this, recall that the catusb, hidinput, are sequential programs, so I've left them as traditional processes. But I've introduced ptasks for the data parallel process, the xform and the detect. I have endpoints for each of the fundamental sources and syncs of data in the system so usb source raw image input, and I'm connecting them using these channel abstractions. Now, obviously I'm not the first guy to say that a graph is a good way of thinking about concurrency. There's a lot of people who have come before me saying the same thing. So why is this a better way to solve this problem? Well, first of all, we can eliminate unnecessary communication by implementing our channels such that they avoid transfers to and from main memory where they can be avoided. >>: [Indiscernible]? >> Chris Rossbach: >>: This -- >> Chris Rossbach: >>: This one, absolutely. The one not sure. (Inaudible) possible? >> Chris Rossbach: Yeah, I don't think it is yet either. I'm willing to say we should have it. Okay. I'm willing to -- And of course we can eliminate unnecessary user kernel crossings and eliminate the need to have all these dedicated user mode processes that orchestrate by having the arrival of new data at a particular endpoint trigger the computation on the ptask it's connected to. Now, as I said, I have then -- I have only just started implementing this in Linux, but I did spend a good deal of team prototyping, using this little camera here and Windows 7 to see, is this in fact a reasonable research direction. So I want to show you a snapshot of what I've -- what I learned doing that. And again what I'm showing you is the performance of my xform program and I'm comparing a simple implementation on top of CUDA against what I'll call a ptask analog. So what's a ptask analog? Obviously I can't modify Windows 7. I can't modify the drivers that nVidia supplies, but what I could do is build a kernel mode driver that deals with the usb and maps memory that is shared with a user mode driver whose only task is to call the copy resource NPI from Directx. So essentially I'm minimizing the migration across the user kernel boundary and I'm completely minimizing the user mode work. And in cases -- even in cases where I have to migrate data back and forth to the hardware on every frame, I can see significant speedups this way. In cases where I can eliminate communication from the host to the device, which I consider kind of representative of this case of being able to transfer straight in usb, we can see speedups of up to ten percent -- sorry, ten percent? 10x. Ten percent plus a lot more. All right. So a brief note about related work. There's obviously a lot of it and much of it done by people in this room. Helios in particular is a -- you know, very much related, although because you guys were largely looking at Larrabee like GPUs, I think it's kind of a different problem domain. Also I think there's essentially user synergistic ideas. Okay. Graph-based programming models, synthesis is a big inspiration. there. The IO model Dryad, StreamIt. You know, DirectShow. This looks a lot like DirectShow, which has been around for quite a while now. And anyway, with that, I'll try to wrap things up and move on. A brief word about future work. I don't want to say too much because I've just spent the last ten minutes really talking about future work. I do think there are a lot of interesting problems still open in transactional memory. You know, I just think that that's something that we're going to want in some form or another going forward. One problem I'm very interested in is integration with hypervisors, virtual machine monitors. I'd like to be able to expand my horizons and look a little more at distributive parallelism. I spent a lot of time in the late '90s building dot-com kinds of things and would like to be able to leverage some of that experience again. And finally, I think as we come to expect more and more from how we interact with our machines, we're going to need to essentially be able to parallelize more an more complex algorithms. You know, that's -- you can sort of read this bullet as code for that. I think there's a lot of interesting opportunities for research there. So oh, yes, the derogare selected publications slide. Please take away from this that I have broad interests. I'm published in operating systems architecture, programming languages, and you know, I'm essentially interested in anything provided it's cool. So in conclusion, I do believe parallelism is the way forward. And it's hard. It's probably going to remain hard for quite a while. There's going to be plenty of interesting research opportunities there. And in order to take advantage of concurrency, we really need the right abstractions. And this requires being able to do research that looks at multiple layers of the technology stack. So thanks for listening.