18799

advertisement
18799
>>: All right, good morning. It is my pleasure to introduce Chris Rossbach
who is here from UT Austin interviewing with us. And he -- I will say this:
He's the first candidate we've had interview who has brought his guitar along
with him. He has a gig on Friday. So I suppose if somebody really wanted ->> Chris Rossbach:
>>:
If things go really badly --
Will you play a request?
>> Chris Rossbach:
[Laughter].
-- I can always play a song.
>>:
All right.
So --
>>:
[Indiscernible].
>> Chris Rossbach:
So in fact -- [laughter].
>>: So it really is great to have you here, and I'll let him get in to his
talk.
>> Chris Rossbach: All right. Great. So good morning, thanks for coming, and
clearly, I'm going to try to spend some time talking to you about how we can
make concurrency more accessible through better abstractions. So the
motivation for this talk and in fact for most of my research is an abiding
interest in concurrency, and particularly in finding abstractions and
mechanisms that can make it easier to manage and exploit concurrency.
And I think this is an urgent problem currently because we are in a position
where parallelism is really the only way forward in terms of performance. So
we see this in a couple of domains.
Chip manufacturers have stopped scaling things like clock frequency and process
as a way of improving single-threaded performance and have shifted the burden
on to the shoulders of the programmer, but it's starting to scale the number of
cores on a chip. And we see also, you know, proliferation of massively
parallel hardware in the form of GPUs or graphics processing units.
And I think while there has been, you know, a trend where parallel hardware is
increasingly abundant, tools that make it easy to program that hardware have
not really enjoyed the same rate of growth. Okay. And particularly in the CPU
programming world, locks and threads remain the state of the art. Just put
long list of really well lamented issues like deadlock, livelock and so on.
And while GPU-based programming has also seen a lot of big leaps forward as
we've seen new parallelizations frameworks like CUDA and OpenCL come around, at
the end of the day, programming these devices is still something of a black
art. It requires systems-level knowledge of the device. And so I put it to
you that better programming tools are urgently needed.
And at the heart of the task of finding better programming tools is deciding
what abstractions those tools are going to support. And I think choosing the
right abstractions for concurrency can impact a lot of different levels of the
technology stack.
My interest focuses on operating systems and architectural support,
particularly on the interaction of the two. And in this talk, I'm going to
talk about two seemingly separate areas. I'm going to be talking about
transactional memory support in the OS and on -- I'm going to talk about my new
preliminary work on GPU support in the OS.
The high-level theme here is we're looking at places where the operating system
and the hardware can collaborate to provide better abstractions to make it easy
to get at the concurrency at that these devices enable.
Okay. So brief outline for the rest of the talk. I'm going to spend the first
bulk of the talk talking about transactional memory. I'll move on, talk about
OS support for GPUs, touch briefly on future work and conclude.
Okay. So let me posit for your consider that programing with locks is still
very much a necessary evil. And why is it evil? Well, probably -- sorry.
>>:
I said you probably don't really need to tell us, but you can --
>> Chris Rossbach: Probably don't need to tell you. So I will briefly breeze
through this -- these points. They deadlock, they livelock, they compose
poorly. They have poor complexity and performance tradeoffs and ultimately
it's just hard to get right.
Now ostensibly, transactions, and particularly as supported by transactional
memory, are free of these problems. And as sort of that more concrete
illustration of what we'd like to be able to do with transactional memory, this
is a 50-line comment from the Linux 2.6 memory manager's filemap.c. And this
describes the lock ordering discipline for the locks in that file. So if
you're going to write code in this file or alter the code, you need to
understand this. And I submit to you that this is a lot of complexity.
What we'd like to be able to do with TM, after we make it shimmer and shake, is
sweep all this complexity away.
All right. So I'm going to devote two slides to background on transactional
memory for those who do not spend a lot of time thinking about it or don't
spend as much time as I spend thinking about it. There are a few key ideas and
key abstractions you need to understand. And the key idea, the key idea really
is that critical sections or going to execute speculatively and concurrently.
So this is in strict contrast to locks which enforce mutual exclusion which
cause critical sections to serialize.
What we want to be able to do with transactional memory is say, hey, everybody
go for it. If a sharing pattern occurs that violates correctness or endangers
correctness, we'll detect that dynamically using the TM hardware and use check
point and rollback mechanisms to retry until we know we can get a correct
answer.
Now, the abstractions that you need to support this, they're really -- I'm
showing you six primitives here. The ones in blue are ones that you can expect
in any transactional memory implementation. Primitives to begin, end, and
retry transactions. The ones in red are machine-level instructions that I'm
going to convince you hopefully in the next several slides are additional
primitives you need if you're going to support transactional memory in an
operating system.
Now, a conflict -- and I'll define them in greater detail subsequently as well.
You just need red instructions. [Laughter].
A conflict between two transactions occurs when there's a non-known
intersection between the write set of one and the read and write set of another
somewhere informally. If two transactions access the same datum and at least
one is a write, you have a conflict, at which point you will invoke this thing
called a contention manager, which comes in and enforces some hopefully
performance-preserving policy about which transactions need to restart and
which can continue to execute.
And there's been a big body of research that has shown we need flexible policy
in this area in order to improve performance.
So that was kind of the verbose take on it. Here's a more visual take, and I'm
going to show you how hardware transactional memory can allow two critical
sections to execute concurrently on CPU 0 and CPU 1. These CPUs are simplified
so they have a program counter and a read and write set. And what we see is
that as these CPUs step into the critical section and begin reading variables,
the variables appear in the read set that is maintained on the processor.
Now when we get to line three, and CPU 0 reads C but CPU 1 writes it, at this
point we have a conflict under the definition on the previous slide. So we'll
invoke the contention manager which comes in and makes a decision about which
of these CPUs gets to continue executing and which has to roll back. So for
the sake of the argument, let's say that the contention manager decides in
favor of CPU 1. This means that CPU 0 has to roll back, and its read and write
set are clear. CPU 1 gets to commit and continue.
And by the way, I do encourage you to stop me and ask questions because I
suspect you don't need a lot of encouragement on this front, but . . .
So I also want to stop and ->>:
[Indiscernible] [laughter].
>> Chris Rossbach:
>>:
What?
What?
[Indiscernible] previous slide.
>> Chris Rossbach:
Sure.
>>: In this case, if seven was empty in there, you'd have a conflict by your
definition, but either serialization would be correct. Is that right?
>> Chris Rossbach:
If seven was empty?
>>: If there's nothing in line seven there, if you just did the read or write
and then you did XM.
>> Chris Rossbach: Oh, so what you're suggesting is that there is actually an
ordering which could yield the right answer in this case?
>>: I'm suggesting that any of the orderings that you could come up with for
those things produces a serialize that will result for that particular ->> Chris Rossbach:
>>:
In other words, it's not really a conflict.
>> Chris Rossbach:
>>:
Well --
Well, actually, I'm not sure -- I'm not sure I buy that.
It's a conflict but [indiscernible] --
>> Chris Rossbach: I think it's a conflict -- I think the a conflict depending
on what order you commit these transactions in. So the order I -- right?
So ->>:
[Indiscernible].
>> Chris Rossbach: Well, because it might be -- if you can serialize all of
the writes of one after all of the reads of another, you do get a correct
result regardless, right? So are you asking about if we leave this
interleaving without serializing these critical segments? Or are you ->>: What I'm saying is that the only thing that -- the only interaction
between those two isn't C, right?
>> Chris Rossbach:
>>:
A and B are read only.
>> Chris Rossbach:
>>:
Yes.
Okay.
Mm-hmm.
Because I'm positive that --
>> Chris Rossbach:
>>:
Okay.
So you're talking about a blind -- blind -- right --
So what I'm saying is that --
>> Chris Rossbach:
Yes, I agree.
>>: -- no matter what you do with this, if you actually -- there's no real
conflict here.
>> Chris Rossbach:
these ->>:
Yes, I agree because there's a blind write in one of
Yes.
>> Chris Rossbach: I totally agree, yes.
simplified example.
>>:
So that's kind of an artifact of a
You asked --
>> Chris Rossbach: No, no, you -- [laughter]. And I concede that you're
right. [Laughter]. I should probably -- it would be probably a good idea to
insert a read of C here in both and then that would get rid of the problem
you're talking about.
>>: So the bigger question, the more interesting question is: Does this kind
of thing happen in practice? Is the definition that you're using actually
generating conflicts in practice that aren't really necessary?
>> Chris Rossbach:
>>:
Yeah.
So --
[Indiscernible] --
>> Chris Rossbach: So the answer is absolutely. And you know, I really wish
that I had chosen to talk about a different paper because that would be a
perfect [laughter] setup.
>>: Well, guess what. You get an hour this afternoon to talk about
[indiscernible]. [Laughter].
>> Chris Rossbach: So yeah. It happens. It's really workload dependent. So
there are a lot of sharing patterns where essentially any time you have
variables that are -- have a lot of write sharing like counters, this
definition is too conservative. It's definitely possible that the interleaving
that you execute that has a conflict under this definition can still produce
the correct result. Okay.
And it's possible also to use a technique that I call dependence awareness that
allows you to essentially keep speculating in the presence of that kind of
conflict and then sort it out at a later time. I did this actually -- it
wasn't safe to tolerate this conflict. And if so, then commit and otherwise -then roll back.
And in some cases, if you have things like shared counters in environments
where like garbage collection, a lot of statics, like lists have these kind of
patterns, if your workload features a lot of that, you can get significant
speedups. What we found is that in two of the stamp benchmarks, we got
speedups of up to 30 percent.
However, in a lot of cases, you don't. You know. If you don't have blind
writes, if you don't have things that are essentially, you know, single points
of write sharing, then you wind up investing a lot of complexity in
accelerating that particular scenario.
Great question.
Okay. So up until this year, it has been sort of -- Dave Ragare [phonetic] at
a TM talk to say, well, this is free of livelock and deadlock; therefore, it's
easier and we need it and everybody should build it. And I can claim to have
benefitted actually from that blind acceptance of that dogma, but I do want to
dedicate a slide to talking about a paper I had in this year's PPoPP where we
actually questioned this assumption and tried to bring some empirical evidence
to bear on the problem.
And what we did was we took five semesters' worth of UTs OS undergrad students
and had them write the same programs using locks with fine and coarse grain,
conditioned variables, monitors, and transactional memory. And essentially
what we did is say okay, you're going to write the same programs nine different
times and we're going to control the order in which you write them so that we
can normalize for experience and what you learn in the last case. And we
surveyed them afterwards and said, well, you know, was it easier and then I
read all their code and classified the synchronization errors.
And what we found -- [laughter] -- yeah, that was a good time.
[Laughter].
What we found is that, you know, the survey, you know, there were 36 questions
in the survey, so this is maybe a little bit unfair to try to condense it into
a single partial order, but ultimately this is kind of the take-away.
If we asked them, did you find it easier, we didn't really get the stark, you
know, flag-waving support we might have hoped for. What we found was that
coarse-grain locks were actually the easiest to use. Conditioned variables,
nobody liked them. No one could get them right, and it was kind of a tie
between fine-grain locks and TM.
However, there was a very different story that was revealed if we go and look
at the synchronization error rates across all 1300 programs. And what I found
is that if -- you know, for all the programs that use coarse-grain locks,
27 percent had at least one synchronization error. A whopping 62 percent had
errors for fine-grain locks, which might be kind of an indictment of our ped
gotchi [phonetic], but this is in fact what we found.
And then TM had error rates on average of less than ten percent.
So you know, what I think this is ultimately saying is that if your goal is to
write correct programs, which I hope it is, TM is in fact easier.
And with that, I'm going to move on and talk about TM support in the OS. And
what I believe is that the operating system really is the killer application
for transactional memory. And among the reasons I believe this is that
operating system are among the most complex parallel programs around. And they
can really benefit from simplification both in terms of reducing the number of
locks and getting rid of the lock ordering disciplines.
We need our operating systems to be correct, right? And hopefully I've given
you some reason to believe that we can get things correct more easily with TM.
And of course we need our operating systems to be fast. And by using
optimistic concurrency, there's a good reason to believe we can get better
performance, a lot of applications and a lot of time in that kernel.
make the kernel faster, everybody -- everybody benefits.
So you should use TM in an operating system.
go about it.
If we can
And the question remains, how to
And we -- when we started looking the at this problem, sort of the conventional
wisdom about how to take a lock-based program and use transactions in it is you
find some pulling primitive like spinlocks, and you map acquires and releases
to transaction begins and transaction ends. And this is in fact exactly what
we did when we first started trying to use transactions in Linux 2.6. We
posited this primitive called an xspinlock which essentially changed the
implementation of spin_lock and spin_unlock to do exactly what I'm showing you
up here.
And we were able to change nine subsystems in Linux this way to manage about
30 percent of the dynamic lock calls over the workloads we looked at. And it
took six of us a year to do this, which is, you know, it was kind of a long and
painful year.
So why did it take us this long? Well, we ran into problems that we didn't -some of them we might ought to have foreseen; some we did not. IO is a big
one. We of course want to be able to do IO in an operating system. Kind of
important. And it's well known that I was not always a good fit for
transactions.
Also idiosyncratic locking, so we saw a lot of cases where people were kind of
creative with how they used locks. A great example is a spinlock that protects
the scheduler's runqueue in Linux 2.6.16, is acquired in one process context
and released in another on a context switch which is kind of a tweaky place to
use a transaction.
So what this eventually led us to is the conclusion that if you're going to use
transactions, locks and transactions need to be able to cooperate. There's
other great arguments for this. There's a vast body of lock-based code out
there that you don't want to have to throw away just because you've drunk the
transactional Kool-Aid and you've decided you want to -- yes?
>>: [Indiscernible] you kind of just vastly increased the complexity of the
operating system. Now have not only different [indiscernible] transactional
memory discipline as well you have to incorporate.
those together rather than saying what -- if ->> Chris Rossbach:
You have to raise both
Right.
>>: -- transactional memory is safer and easier [indiscernible], then
[indiscernible].
>> Chris Rossbach: Sure. So I'm tempted to -- I'm tempted to respond to that
as if you know what's in the coming slides. Is that true? No. Okay. Maybe
you should hold off on that question because if you are making the point that
by adding yet another synchronization technique we've kind of increased the
diversity in the ecosystem in a way that is likely to increase complexity,
that's one thing. In the coming slides, you're going to be even more
convinced. And I want to wait until you've seen that before I answer that
question.
>>:
[Indiscernible].
>> Chris Rossbach:
Yeah.
>>: [Indiscernible] literally not allowed to read.
true but it's close. [Laughter].
Well, that's not quite
Let's put it that I haven't read it. You're allowed to do IO operations while
your holding spinlocks and [indiscernible]?
>> Chris Rossbach:
Yes.
Yes.
>>: Okay. And you can take page faults while you're holding spinlocks
[indiscernible]?
>> Chris Rossbach:
Yes.
>>: Wow, okay. That's very foreign [indiscernible]. I'm not going to
[indiscernible] design now. I just wanted to clarify that because it seems
weird.
>> Chris Rossbach: Yes, it does. And you can conditionally do line up with
the spinlock out which is even weirder and essentially is -- what this point is
about. So ->>:
[Indiscernible].
>> Chris Rossbach: So just because you have acquired a spinlock and you're
going to release it in that critical section, you might not always do the IO in
the critical section. It might be something [indiscernible] ->>:
[Indiscernible] --
>> Chris Rossbach:
>>:
Right.
-- if some condition does IO, otherwise don't, right?
[Indiscernible].
>> Chris Rossbach: And so, I mean, ultimately, you know, what I'm getting at
with this argument is the idea that, well, if you -- why do you need a special
tool to handle IO, Chris? Why can't you just look at the critical sections and
say, oh, this one does IO, this one does not use a lock with this IO -- and you
can't because this conditional -- conditional IO things nest at very deep
levels and you can take page faults ->>: [Indiscernible] small step after spinning a while blocked on IO.
okay.
But
>>: So kind of a related question is so Linux kernel [indiscernible]; is that
correct?
>> Chris Rossbach:
I believe that is correct.
>>: Okay. So they're less likely to have this problem, but they could for
instance take a spinlock on block and copy it. [Indiscernible].
>> Chris Rossbach:
>>:
Thank you.
[Laughter].
The --
[Indiscernible].
>> Chris Rossbach: Finally, the last argument for this really is that
transactions perform well when -- I mean, the whole idea here is the common
case is no contention. That's where we're going to get our performance
benefits. That's when you want programmers to use this primitive.
If there is contention, locks might be a better fit, and we want people to have
still the ability to choose the right primitive for the right environment.
And the abstraction that we eventually chose to address this situation is call
the cxspinlock which stands for cooperative transactional spinlock. And the
idea is that we're going to dynamically choose between a lock or a transaction,
depending on what's going on in the system.
So most critical sections will optimistically try to execute a transaction, and
if the hardware detects and I O attempt, I mean, it's hardware, right? You can
definitely see the IO happening there if you can't see it anyplace else.
If the IO attempt is detected, we're going to roll back and acquire a lock.
And the trick here is involving the contention manager in the lock acquisition,
because what this does introduce is the need to be able to serialize
transactional execution against lock-based execution.
And this is what we need these additional instructions for that I showed you in
the previous slide. We need an xtest, which is a transactional version of a
test instruction. Xcas, transactional version of a comparent swap. And those
of you who spend a lot of time looking at spinlock implementations will
recognize these as the basic building blocks of the spinlock. And then this
extra instruction called xgettxid, which allows a CPU to query for -- a fred
[phonetic] to query for the existence of an active transaction on that CPU.
And the final edition is we're going to say let this xbegin instruction, return
a reason for the retry, so that if it roles back because it detected IO, we get
a return code that says IO occurred, and then we can decide what to do on the
rollback from there.
One nice thing about using cxspinlocks instead of bear transactions is that
using preprocessor managing, we were able to convert the entire kernel in a
month with this abstraction instead of -- yes?
>>:
What do you classify as IO?
>> Chris Rossbach:
Yes.
So this is at the CPU level, right?
>>:
So [indiscernible] and writes to memory --
>> Chris Rossbach:
>>:
Right, TMA.
-- [indiscernible].
>> Chris Rossbach: Okay. So this is the cxspinlock API. Consists of three
functions. The first is a cx_optimistic. And this is the version that is
going to optimistically attempt a transaction. And what it will do is if the
xbegin returns this flag that says we need to -- we need mutual exclusion to
execute this critical section, then it will use the cx_exclusive API to try to
get a lock on the critical section instead.
Now, the cx_exclusive, as I've said, acquires a lock, but it's going to use
this transactional cas instruction, which makes the writes and reads that it
does to a lock variable visible to the transaction subsystem.
And then the last one is cx_end, which releases a critical section, whether
it's protected by a transaction or by a lock. So this is where we use that
xgettxid instruction. If we're in a transaction, end it. Otherwise write the
lock that it will -- to release it.
So I want to show you how we can use this primitive to serialize access to a
critical section between transactional execution and lock-based execution.
So CPU 0 on the left is going to try to use cx_optimistic to enter into a
transaction. CPU 1 is going to use cx_exclusive. And initially we have a lock
which in the next parlance has a value of one. That means it's unlocked.
Okay.
So in the interleaving we're going to consider, CPU 0 enters and starts the
transaction. This is reflected by a new txid of one on the processor. A txid
of 0, by the way, means no active transaction.
Now, that's of course relevant when we watch what CPU 1 does. The going to
execute this xgettxid instructions, which says oh, I don't have an active
transaction, so I'm going to come down into this wild loop where I'm going to
start spinning and trying to use a transactional cas to acquire the lock.
Now, because no IO has occurred, the status check for exclusivity here fails in
cx_optimistic, and we come down and use in xtest instruction to read the value
of the lock variable.
Now, xtest has the same semantics as test. It exceeds if the value in the lock
matches the other parameter, but it has the additional semantics that if it
succeeds, that variable is entered into the read set for that CPU's
transactions.
Now what this enables is that when CPU 1 subsequently goes and tries to write
the lock variable in a transaction, you can think of xcas as essentially a mini
transaction of one instructional transaction. This makes its writes visible
and we have a read/write conflict under the definition that we've been
considering in this talk. And this allows us to invoke the contention manager
which can then arbitrate and decide who gets to keep going.
So let's assume that the contention manager decides CPU 1 wins. This means
that CPU 0 rolls all the way back to its xbegin and keeps retrying. The xcas
instruction succeeds and the lock variable gets written and is now locked. And
you can see that the test instruction down here will subsequently fail until
we're out of the critical section under lock-based ->>:
[Indiscernible].
>> Chris Rossbach:
>>:
No, it doesn't.
[Indiscernible] roll back?
>> Chris Rossbach:
have to roll back.
>>:
That the exactly the --
So it has to [indiscernible].
>> Chris Rossbach:
>>:
That's right.
Well, let me show you [laughter].
So --
If in fact -- it doesn't
[Indiscernible].
>> Chris Rossbach: Essentially, yeah. So if the contention manager decides,
you know what, CPU 0 is -- we failed the xcas instruction, this guy enters a
spin that looks a lot like a T cas loop and winds up waiting.
>>: Xtest. How is that any different from a read in this case -- oh, you said
[indiscernible] puts it as a read [indiscernible]?
>> Chris Rossbach: Oh, so, if we did not have it, we can also use it outside
of the transaction.
>>:
Oh, okay.
In this example --
>> Chris Rossbach: In the other -- but the more important part is that if it
fails, it doesn't enter it in the read set.
>>:
Oh, okay.
>> Chris Rossbach:
>>:
So if this failed and we entered it into the read set --
Then it would --
>> Chris Rossbach: -- this guy would always have a conflict and we would wind
up [indiscernible] ->>:
Now I'm getting it.
>> Chris Rossbach:
once -- okay.
Great.
Makes much more sense that way.
Any more?
Okay.
Cxspinlock questions?
No?
No?
Going
So let me tell you a little bit about our evaluation of TxLinux. So we started
with a MetaTM, which is our hardware transactional memory implementation, which
essentially extends x86 instruction set with these instructions that I've been
talking about. There's a paper about MetaTM in ISCA 2007.
We used Simics simulation environment with these hopefully convincing machine
parameters. 16 and 32 CPUs. And the benchmarks we looked at were parallel
make, bonnie++, which is a file system stress test. Modified Andrew benchmark,
parallel configured, parallel find.
Now, the thing I want to stress about these is that they're user mode
benchmarks. They do not use transactions directly, right? They're exercising
the kernel which is using transactions internally for its own synchronization.
What we found in our initial work starting with 2.6.16, is that -- oh, also
recall we -- you know, we sort of went through this exercise twice, right? The
xspinlock version was using bare transactions and the cxspinlock version where
we actually converted all the subsystems to use transactions ->>:
[Indiscernible].
>> Chris Rossbach: Right, exactly. So there's nine subsystems. All the
highest contending subsystems. And then here, absolutely every spinlock in the
kernel is converted.
And what we found is that with 16 CPUs, we had a two percent slow down and a
two percent speedup on 32 with xspinlocks. And then using cxspinlocks, we had
2.5 percent and 1 percent speedup on 16 and 32 ->>: [Indiscernible].
[indiscernible].
Spinlocks the main lock that you're using
>> Chris Rossbach: They're certainly, in terms of the -- they vastly
outnumber, but I mean, what does main really mean, right?
>>:
So there are other locking [indiscernible]?
>> Chris Rossbach: Oh, yeah. There's 9 or 10 of them. New texes [phonetic],
semi fours, read, copy, update. Sequence locks. I mean, there's definitely ->>:
And you're not touching any of those historically.
>> Chris Rossbach: Well, so any of them that are built on spinlocks we do
touch. Okay. So reader, writer, spinlocks, there are spinlocks used in RCU.
There are spinlocks used in sequence locks. We do not touch new texes
[phonetic] and semi fours because those are blocking. Right? So transactions,
when we emphasize ->>:
Don't want to block.
>> Chris Rossbach: You don't want to block. I mean, you can argue that there
are ways to do it, and in our ASPLOS paper in 2009, we came up with a
transactional variant on the new texes [phonetic]. I'm not sure whether I
believe you want to block, even after -- great, we got a paper.
But you know, I think you ->>:
[Indiscernible].
>> Chris Rossbach:
>>:
That's great.
Anyway, the point of all -- yeah?
So are you ready to come back to [indiscernible] yet or --
>> Chris Rossbach: Oh, yeah, yeah, yeah.
got [indiscernible] and forgot.
Thank you.
Thank you.
I completely
So the question is, isn't -- you know, not just is this introducing another
primitive making the world more complex. Suddenly I'm telling you, well, we
want to get rid of locks because of lock ordering and deadlock and so on, and
yet now you're telling me that your transactions are going to roll back and get
a lock. Right? And so isn't that the worst of both worlds?
And you know, the answer is no. And the reason is that you can drastically
reduce the number of locks with this. Even if you still have to use lock
ordering for things that might do IO. So an example of how you might do this
is you might take every critical section in the kernel that conditionally does
IO, and map it to one lock. So you get rid of -- you know, I think our
estimates at the time were ->>:
[Indiscernible].
>> Chris Rossbach: Yes, exactly. Coarser locking. And also, you can
separate -- you can -- under the current regime, a lock is really tightly
coupled with the data structure it protects. And this is -- this makes sense
with locks, right? With transaction -- [laughter]. However, you don't need
this with transactions, right, because you're speculated. Particularly if
you -- you know, if it's not the common case that you need to roll back and
acquire a lock, it's totally acceptable to share a lock among data structures
that might not be related.
>>:
[Indiscernible]?
>> Chris Rossbach:
Hmm?
>>:
[Indiscernible].
>> Chris Rossbach: Effectively. Effectively. Accept, you know, the problem
with BKL is that it's highly contended, right? So we don't want to return to
like one highly contended lock that can -- you know, cause even more ->>: [Indiscernible] we do. You said go to -- [indiscernible] coarser
direction. That's the benefit of this. We can forward coarser locks, but not
back [indiscernible]? Is that the -- where's the sweet spot [indiscernible]?
>> Chris Rossbach: That coarse, but maybe not that lock. That lock is highly
contended. So what you really want to do is find critical sections that
conditionally do IO and map all of those to one coarse lock. Right?
What you don't want to do is take a lock -- mm-hmm?
>>: Aren't those the ones that you're likely to be at the longest and hence
the ones most likely to be contended? [Indiscernible]. I mean unless
conditionally means almost never.
>> Chris Rossbach:
>>:
I mean, that really is the point.
Right.
It leaves that --
>> Chris Rossbach: I mean, if in fact you're always -- if the common case is
that you are going to do IO and roll back, you're essentially -- it's a waste.
It's not just -- it's not that you just don't get the performance benefits.
You actually make things worse because you're going to speculate ->>: [Indiscernible] fine-grade locks and turn them into coarse-grain locks
that are contended. Right? Which can ->> Chris Rossbach: No, no. I'm saying take fine-grain locks and turn them
into coarse-grain locks that are not contended. Okay. So ->>:
So how are they not [indiscernible]?
>> Chris Rossbach:
>>:
Okay.
I mean, unless they --
>> Chris Rossbach:
section ->>:
Consider the following.
So we have a [indiscernible]
[Indiscernible] just checking.
>> Chris Rossbach: -- that does -- let's just broadly call
sharing, let's just have an instruction, share. Okay. A.
do some IO. Okay. Now, we have another one which I'll let
in. Share something else completely unrelated. Check some
maybe two. You know, do some other IO.
read and write
And then if maybe
you kind of fill
other condition, if
The point here is that in the common case, we don't have contention because
we're using the same lock variable to synchronize disjoint data structures, in
the common case you get no conflict. Of course you only want to do this if
it's also the common case that maybe and maybe two are not true. Okay.
>>: So how often are -- I don't have any sense for how often conditional IO
and [indiscernible] happens. So, I mean, literally none. And ->> Chris Rossbach: Yeah. So it's -- I don't believe I have the number.
Intuitively, I want to tell you that it's between 50 and 70 percent of the time
for some locks. But it depends which lock, right? It depends which lock,
which data structure.
So it's not -- you do lose the property that you can go and say -- you can
blindly choose the primitive. You do need to understand the IO profile of the
critical section that you're doing this with.
>>: [Indiscernible]. I thought you said was that roughly -- you're going to
take anything that does this conditional IO and if it's actually doing the
conditional IO you're going to fallback on one [indiscernible].
>> Chris Rossbach:
That's right.
>>:
Only if the [indiscernible].
>>:
[Indiscernible] so they're not going to conflict with [indiscernible].
>>:
You won't [indiscernible].
>> Chris Rossbach:
The common case is because these -- so I used to have --
>>:
No, no, no.
>>:
Now I don't get it.
>>:
You always -- you always have to grab the lock if you're doing --
>> Chris Rossbach:
>>:
If you're actually going to do --
>> Chris Rossbach:
>>:
You always have to if you're going to do IO.
But if the common case is that you don't --
Yes.
>> Chris Rossbach: -- and you're using, you know, and you don't touch the same
memory, you have essentially created a situation where you don't need as many
lock variables.
>>: Sure. If you don't do the IO and you don't touch the same number, then
you clearly win.
>> Chris Rossbach:
Yes.
>>: But I thought you just told me that between 50 and 75 percent of the time
you do the [indiscernible].
>> Chris Rossbach: Right. So just like what I'm telling you I guess is that
it depends on the profile of the particular lock [indiscernible]. Some things
do IO rarely; some things do IO most of the time.
>>:
So you can go [indiscernible].
>> Chris Rossbach:
>>:
That's exactly what I'm saying.
Okay.
>> Chris Rossbach:
You do need to go and understand the profile of --
>>:
Nothing is free.
>> Chris Rossbach:
Nothing is free, unfortunately.
Yeah?
>>: [Indiscernible] so that you can [indiscernible] locks that you use
[indiscernible], have you actually done that or you just [indiscernible]?
>> Chris Rossbach: Yes. What we did do -- you know, what I wanted to say in
that last slide is that I wasn't expecting anyone to, you know, cheer my
2.5 percent speedups and what I really didn't want people to come away from
this talk with was the idea that in fact we -- you know, the performance
benefits of TM were not [indiscernible]. And so we did repeat this exercise
with Linux 2.4 for ASPLOS last year. And what I'm showing you here is a
snapshot from this paper. And the reason I moved on to this is of course 2.4
has the big kernel lock, much coarser-grain synchronization.
And what we were able to do is go and essentially replace coarser
synchronization with transactions and what I'm showing is the scaleability.
The top line is 2.6 unmodified. Bottom line is 2.4 unmodified. And the middle
is TxLinux 2.4. We were transactionalizing not just spinlocks, but also new
texes [phonetic], cx new texes [phonetic], the new primitive in that paper.
And what we find is that there actually is synchronization overhead to
eliminate, we can eliminate it. And we were able to make up a significant
fraction of the performance benefits that took kernel developers years to
achieve in the 2.4 to 2.6 transition. Yes?
>>: And in 2.4 cases you follow what you've been saying and actually place the
fewer [indiscernible] spinlocks, or are you -- [indiscernible].
>> Chris Rossbach:
>>:
See what happened [indiscernible] locks.
>> Chris Rossbach:
>>:
No, we did not rewrite [indiscernible].
Yes.
[Indiscernible] them in the kernel and said [indiscernible]?
>> Chris Rossbach: Yes. I mean, you know, ultimately, you know, one thing
that I would really -- like I said at the beginning, I think that this -- you
know, no matter where you come down on this TM hype that we've enjoyed over the
last 4 or 5 years or not enjoyed, I do believe that the operating system is the
only place that is likely to really be able to use a hardware-based primitive
like this. And I would love to see, or do, some work where we actually where
able to go and restructure significant portions of it to take advantage of
that. And that's something that like, you know, me in my cube at UT having
already sort of demonstrated, you know, the idea, doesn't make sense. Might
make more sense to do it here. Yes?
>>: I'm really surprised that the map doesn't speed up [indiscernible]. 2.4
shows nothing, and you're getting it 2x because the file systems community is
largely abandoned it because we all decided that essentially the just measuring
the com pilot [phonetic] running, which isn't a very great file system
benchmark, but it ought to parallelize wonderfully. So any intuition for why
it's so cruddy?
>> Chris Rossbach:
not --
Why it's so cruddy?
Why in general does the benchmark
>>:
Why is it not getting much closer to 1 to 1 [indiscernible]?
>>:
It's not [indiscernible], sit?
>>:
Oh, maybe that's it.
>>:
[Indiscernible].
>>:
It depends on -- I mean, when we ran it, [indiscernible].
Parallelized?
>> Chris Rossbach: [Indiscernible]. You're asking did we actually run an
entire separate benchmark [indiscernible]?
>>:
Instances [indiscernible]?
>> Chris Rossbach:
No, no, no.
>>:
How does it parallelize it?
>>:
Okay.
I mean, what are you parallelizing.
Now I understand [indiscernible].
[Laughter].
>>:
[Indiscernible].
>>:
That's why it speeds up at all.
[Laughter].
>> Chris Rossbach: You know, the real reason we did -- the reason we can see
this much is that in a simulation environment, we can make [indiscernible] 0.
All right. So think point I want to spend just a little more time talking
about transactional memory and I want to talk about how we can use it to
eliminate priority inversion.
Now, in this audience, I probably don't need to spend a lot of time coming up
with background on priority inversion, so I'm going to kind of skip through
this slide. Essentially it occurs if -- if we have some sharing, we have a
higher priority process sharing a resources with a low-priority process. We
wind up at the high-priority process descheduled. It's a drag if we get a
medium-priority process coming in doing totally unrelated work, everybody winds
up asleep and then there's of course the additional watchdog timer kind of
problem can come in and say, oh, is nobody making any progress? Reset the
system at times when you don't want it. And this is a real and expensive
problem.
Sorry to put you through this slide, but I got to get through it.
This happened with Mars Pathfinder in '97. Fortunately they found it before
they launched it. So they got to keep their $150 million. But the real point
here is that existing solutions such as priority inheritance are fundamentally
Band-Aids because they do not make the problem go away. All they do is enforce
an upper bound on how much time you can spend executing other priority
inversion.
Now, here's what I really wanted to get to, which is this dogma that TM gets
rid of all the problems of locks. Priority inversion is not one of those
problems that it makes go away. Okay. In fact, priority inversion can happen
with transactions and it boils down to the policy that the contention manager
chooses.
And so to convince you of this, I'm essentially showing you the same scenario
that I showed you in the previous slide where we have a low-priority process
that is older, having a conflict with a higher-priority process that is
younger.
Now, the -- one of the findings of the body of research on contention
management in transactional memory is that you want the older transaction to
win. This is a nice thing because it usually provides good performance and
it's also free of livelock for obvious reasons. Imposes a total order.
Unfortunately, in this case, it causes priority inversion. And in fact, what
we -- yes?
>>: [Indiscernible]? Tell me were why this thing [indiscernible] why I'm
having sorting order being major sort on the priority and minor sort on the
[indiscernible].
>> Chris Rossbach: That is exactly what we're getting to. But before we did
this work, no one had thought of this. I mean, it's just profoundly obvious,
right? [Laughter].
I mean, I remember walking into my advisor's office and saying, this is so
obvious, we can't publish this, right? Like ->>:
But nobody else did.
>> Chris Rossbach:
>>:
But no one else --
[Indiscernible].
>> Chris Rossbach: So in fact we could and we did. But absolutely. You
jumped straight to the punch line, which I will show you after convinces you
that this actually does happen in our benchmarks. And across all the
benchmarks we looked at, almost ten percent had this problem.
And the way we solved it was by providing a register that is writable in kernel
mode, only that OS can use to say this current thread has this priority. And
that way when the hardware that actually decides [indiscernible] a conflict
gets invoked, it can say, okay, higher-priority process wins unless we've got a
tie, in which case we're going to default to some other livelock free policy
like time stamp. And what this does in our benchmarks is eliminate a hundred
percent of the priority inversion.
Now, the reason I have the caveat in our benchmarks is that because the
priority that the hardware uses is a snapshot at the time the transaction
begins, we essentially provide and instruction that says you know, write the
priority here, and then start speculating. It's totally possible that the
kernel with change the dynamic priority while we're in a transaction.
So it's truth in advertising. You know, it is possible that priority
conversion can still occur. But we didn't see it. And negligible performance
costs, which is great.
So I want to sort of briefly wrap up the basic lessons learned with TxLinux.
Obviously, as I've just told you, priority inversion can be eliminated with TM.
This is a good thing, right? We want that.
Locks and transactions need to be able to cooperate if you're going to use
transactions in an operating system. I think this is true even if you're going
to use transactions in a user mode program. Fortunately, this new abstraction,
the cxspinlock makes it possible to do that and makes it possible to handle IO
sort of more gracefully.
Transactions can reduce synchronization overhead, but only if it's there to
begin with.
Now, these are the conclusions that you might see in a paper. I want to sort
of step back and tell you what I actually really think you should take away
from the TxLinux work.
So most importantly, TxLinux I think remains the most realistic bench market
for transactional memory to date. So at the time we started this work, most TM
research was based on micro benchmarks like RB trees, hash tables, splash two,
things like this. And being able to actually go and apply this to a real
system exposed a lot of myopic designs, both in terms of how people thought
about how we should use transactions and in terms of what a transactional
subsystem should support.
So for example, the contention management policy, there's papers about, hey,
you know, here's my new whiz-bang policy that is a blend of these other
policies and gives us speedups under such and such conditions. Not one of them
thought about, you know, what -- integration with an operating system.
Additionally, you know, we needed new primitives from the hardware to be able
to use this in an OS that none of the existing designs thought about. And
ultimately, you know, you really need to be able to do research that crosses
layers in the technology stack to come to these kinds of conclusions.
So I'm going to change channels here and start talking about my preliminary
work, which is -- shares the theme of OS abstractions but is no longer going to
be about transactional memory. At all.
So I want to give you an opportunity if you want to dig in about TM before I
move on. This might be a more tasteful place for me to give you that
opportunity.
>>: One more question. So the designs I've seen from hardware TM support tend
to work by having sort of limited sets of every [indiscernible] sets and then
when they spillover, you fall back into some sort of horrible [indiscernible].
[Laughter] software transactional memory. Did your simulator consider that?
It sounds like these Linux [indiscernible] touch a lot of stuff.
>> Chris Rossbach: Yeah. So great question. And our simulator does handle
this. So you know, essentially what I had to do was, there's a lot of
different ways to build an HTM. One of them is to use the L1 cache as
essentially your right buffer. So if you write a line, you mark it and you
cache and this way when you abandon the speculation, you can just invalidate
those lines. So that's the kind of design we looked at.
Other approaches of using store buffer, which is a smaller write set, but at
the end of the day, you're limited by the size of your cache so the number of
writes and reads you can do.
So in this system, I built a cache model that essentially does exactly this and
then asserts a line to the CPU when any line that has been touched
transactionally falls out.
So this can happen not just if you touch so much data that it doesn't fit in
the cache. This can also happen when you have an associativity eviction. And,
yeah, we did, you know, all our experiments model this. And it's also a big
motivation behind the cxspinlock.
If in fact the data that your transaction touches, you know, are such that you
can't actually make a transaction succeed because it's always going to have an
associativity conflict with something Like This, you need to be able to roll
back and acquire a lock.
>>: Okay. So the [indiscernible] you would fallback and take a lock if you
[indiscernible].
>> Chris Rossbach:
>>:
However --
That's [indiscernible].
>> Chris Rossbach: Well, but you know, in fairness, be in that ASPLOS paper,
we also looked at other strategies like rolling -- you know, defaulting to an
STM. Turns out to be a nightmare to get right. We looked at simplified
strategies of, well, you know, can we do something like allow just one software
transaction with any number of concurrent hardware transactions. And it turns
out yes, there are tricks you can play with per CPU variables and things like
this, and simple commit protocols that allow you to do this.
But again, it kind of turns up to be a nightmare to get right.
always give you the performance benefits that you want.
And doesn't
>>: [Indiscernible] transaction is afforded [indiscernible] running out of
cache lines?
>> Chris Rossbach: It really depends. With the kernel benchmarks, it's very
rare. Three percent tops, because these are spinlocks.
>>:
[Indiscernible].
>> Chris Rossbach: Okay. So you know, especially like when you start with
2.6.16, the average size of a critical section is a hundred instructions. So
you're almost never going to overflow [indiscernible] cache.
When we got to 2.4 where we're dealing with much bigger read/write sets, then
we're looking at more in the 5 to 10 percent range. And then the de facto
standard for TM research, user mode benchmarks, I don't know if you're familiar
with Stamp that the kind of what everybody uses. And that suite has a whole
range of profiles of critical section size. Some things are essentially
designed to stress this part of the design and others are not.
>>:
[Indiscernible] you run stamp on the top of TxLinux?
>> Chris Rossbach: Yes, not on the work I'm talking about here but definitely
[indiscernible] in the micro paper where we tried to use -- the question you
asked earlier about do all [indiscernible].
>>: So you have user mode transactions, hardware transactions have to be
[indiscernible].
>> Chris Rossbach:
>>:
Yes.
And dealt with the conflicts that come out of that.
>> Chris Rossbach: Yeah. I mean, it's very rare for a user mode transaction
to conflict with a kernel -- I mean, they're not supposed to share their
memory, right? And additionally, let me say, the only time you might actually
have that kind of sharing is in a system call like the [indiscernible] two user
but a system call is effectively IO. So ->>:
[Indiscernible].
>> Chris Rossbach: That's right.
stuff, you might try to do that.
Yeah.
I mean, unless you have got Don's
>>: So how can your performance [indiscernible] measure -- I guess any
hardware simulation would be kind of free, cost little. How would you measure
the performance of [indiscernible]?
>> Chris Rossbach:
you know ->>:
Same way.
I mean, it's execution-driven simulation.
So
Actually you really measure --
>> Chris Rossbach: Absolutely. I built a cache model that aborts a hardware
transaction in the simulator when it over flows and then the operating system
reacts by coming back and acquiring [indiscernible] --
>>:
You gave that XC some reasonable, you know --
>> Chris Rossbach:
>>:
-- [indiscernible] like aborting the cache.
>> Chris Rossbach:
>>:
Reasonable -You have --
Yeah, well, so, okay --
Now add some [indiscernible].
>> Chris Rossbach: You do although it's cache clearing [indiscernible].
Right? So to abort, you essentially need to make every line invalid, which
winds up being a right to want -- you can essentially flash clear. So you can
do that in some very small number of cycles. That's not the main cost.
The main cost is in the fact that you suddenly have a cache cold cache for
everything that you've just thrown away, which by the way, turns out to be kind
of a major neglected cost in using hardware transactions is that ->>:
[Indiscernible] build a machine, it would probably perform the same.
>> Chris Rossbach:
>>:
I believe that, yes.
This Linux CPU is pretty -- it's in order, right?
>> Chris Rossbach: Yeah.
anything like that.
>>:
Yeah.
Okay.
>> Chris Rossbach:
>>:
No, we did not -- we did not do out of order,
But you [indiscernible] cache flow.
>> Chris Rossbach:
>>:
Yes.
So I mean, I rewrote, you know --
[Indiscernible].
>> Chris Rossbach:
Oh, okay.
>>:
I had one more question.
>> Chris Rossbach:
Yes.
>>: So apart from the priority [indiscernible], did you see other progress
problems like in cases where ->> Chris Rossbach: Oh, all the time. Not so much with contention, but more
often in cases where the implicit, you know, I guess I want to say protocol,
but you know, the way you're supposed to use the lock was not obvious to us
until we, you know, broke it by using a transaction.
So one great case is sequence locks have this assumption that they can actually
do sharing on the same stack, but a reader always has to be higher on the stack
than the -- than a writer or you deadlock. We didn't know that. And so we
replaced it with something that could roll back and acquire a lock that caused
the whole system to freeze.
So you'd like to hear about GPUs?
>>:
Yes.
[Indiscernible].
>> Chris Rossbach:
through this then.
preliminary work.
I'm going to run out of able to. So I'm going to breeze
So it's good that I can breeze through it because it's
So the motivation for really this work is what I believe in my experience of
trying to leverage GPUs for certain workloads is that OS-level abstractions are
what's limiting GPU applicability in certain application domains. So there are
certain application domains where it's definitely a good fit. One of them and
most obvious is graphics and gaming community, right? Better be -- these
people better have the right programming tools. The devices were built for
that. The tools these people have are things like shader languages, HLSL,
graphics libraries like Directx and OpenGL.
And then there's this other community called the GP-GPU, general purpose GPU
community. These people are largely focused on parallelizing high latency
scientific algorithms like protein folding, and there are some great tools for
them like CUDA and OpenCL which are user mode parallelization frameworks that
provide a C-like interface.
However, I believe that the application ecosystem is considerably more
diversion than this, and what I'm going to try to convince you in the next few
slides is that lack of good OS-level abstractions is what's kind of
sequestering GPU usage into these kinds of -- these two domains.
And to sort of focus the argument I'm making, I want to look at what I'm
calling interactive applications. So examples of this are gestural interface,
waving your hands at your computer hoping for a result, brain-computer
interface, which my advisor is a huge believer in, by the way. He went and got
one of these helmets. Spatial audio and real time image recognition.
So what these share with the applications on the previous slide is the high
level of data parallelism, making them a good fit for a GPU model of
computation. What makes them different is that in most cases, they're
fundamentally processing user input, which means that while they need
concurrency to get good performance, they also need low latency.
And because they're acting as a logical device, they need to be multiplexed by
the OS across applications. And I want to focus in particular on the problem
of gesturally interface, because I've spent a lot of time looking at this
particular problem. Examples of are of course you guys have.
At a high level, be I'm showing you, in the upper left of the screen, a basic
decomposition of this problem. I'm interested in the problem when you
implement the system with cameras. And so you know, a basic system to build
this would capture some number of images from cameras which might support
ranging like a distance to objects in the field.
You need to be able to do some image filtering and geometric transformation so
this is to transform the data that you've captured from the perspective of the
camera to the perspective of the user or the screen.
And the results of this step is some cloud of points which you can feed to a
gesture recognition algorithm that looks for hands and eventually needs that to
the OS as HID input. So it can train to, you know, gesture messages or mouse
messages and so on.
Now, what characterizing this workload is of course high data rates. If you
have a lot -- if you have multiple cameras producing large images, you know, 60
to a hundred hertz, and because we want to be able to use commodity hardware,
we might have noisy input. And to convince you of that, I'm showing you this
blue blob in the lower right of the strewn. And if you kind of squint and tilt
your head to the side, you can see this is my hand in front of my screen. This
is actually captured from the prototype I've been building. Very, very noisy.
So we need some pretty heavyweight algorithms to be able to denoise this.
Now, here's how I wish I could build this system. What I wish I could do is
write four separate programs and connect them with posix [phonetic] pipes. It
would be nice, modular. The four programs would be catusb, cat/dev/usb to
capture images from the device. Xform, which transforms image data and does
noise filtering, puts it in the perspective of the screen. Detect takes a
point cloud for hands. And then hidinput takes the output of the text sends it
to the OS.
Now, some observations. Capturing data from a camera, sending mouse events to
an OS. These are inherently sequential. Noise filtering. Geometric
transformations, and potentially detection are inherent in the data parallel.
Now, with the OS abstractions I'm illustrating here, mainly pipes, I could do
this on a -- if I were going to implement this with essentially just CPUs.
However, what I want to convince you is that I don't want to do that. In fact,
I can't really do that. And what I'm showing you here is the performance in
terms of frame rate for the xform step in the prototype that I've been
building. I built it two ways. One using fork joint parallelism and
running -- the numbers here are captured on a four CPU -- or four core CMP
machine. And then the blue bars are for a GPU implementation that is using a
256 core video card.
Now, the idea here is that there's a difficult tradeoff between the quality of
noise filtering and the amount of computation that you're willing to invest in
it. And the ultimate point is that for any acceptable level of quality of
noise filtering, we wind up with frame rates on the CPU implementation that are
below one per second. So the system winds up being unusable.
And the ultimate point is that no only do we want to use the GPU, in this case
we need to.
John, is there a concern?
>>:
[Indiscernible] things get worse.
>> Chris Rossbach:
>>:
Sorry.
Oh, okay.
Thank you.
We can dig in.
Essentially I'm using --
[Laughter].
>> Chris Rossbach: That was much more succinct than -- [laughter].
going to dig into the algorithm.
I was
Great. You might say to me, okay, Chris, fine, use the GPU. What's the big
deal? You can still use your pipes. Why don't you just rewrite your transform
and your detect program so that, you know, after it's reading, reading input
from the pipe, you pack it up, send it to the GPU, do your computation there,
and hey, have you heard of these new GPU -- GPU -- GP-GPU frameworks you can
use? Make it simpler to write.
Now, before I convince you that this is going to fall down, I need to give you
some background on how GPUs execute a shader program.
The first and probably most important property is that in general -- and there
are exceptions like Larrabee, of course, a GPU can't run an OS. This is
because it has a different instruction set. It looks for features like
interrupts that are critical for implementing -- for running an operating
system. Typically, GPUs have a disjoint memory space and are not -- and that
memory space is not coherent with main memory. Right? And so ultimately what
we wind up with is a situation where some process on the CPU has to orchestrate
the execution on the GPU. And the current regime, user mode applications have
to implement this per application in a sort of ad hoc application-dependent
way.
Now, let take a look at the way ID composed this problem from the technology
stack point of view with all of this in mind. And what I want to show you is
how data is going to move back and forth across the kernel boundary. I've got
some cameras connected to a usb port and I've got all my different components
in the design up above the user boundary. And the first thing that happens is
we capture some image data from a camera. We read it into user space. What do
we do? We write it back into kernel space to send it into a pipe.
Now, the next step in the program is this transform step. And we need to run
that on a GPU. So we're going to read data out of a pipe, write it back into
the kernel through all these parallelizations frameworks. We run the program
on the -- we run the shader program on the device, and then repeat the exercise
in reverse to writ it into the pipe in the next step move.
Hopefully you can all see where this is going. You don't need me to like play
it out in great detail. We wind up with this kind of tennis match of data that
is going back and forth across the user kernel boundary that we would really
like to avoid. And in this naive design, which wind up with 12 kernel
crossings and six copy to and six from for fairly large image buffers. And
hopefully I'll convince you also in subsequent slides that there are
performance tradeoffs introduced by using these additional layers that we'd
like to be able to get rid of.
Now. You might say to me, okay, so you don't want to cross the user kernel
boundary. What's your problem? Why don't you just do this?
So in fairness, I will admit that this is actually the design that I started
with when I thought this is how I want to build this. But it really is kind of
a nonstarter because there are no high-level abstractions. This is not where
you want to be able to -- you don't want to be writing code here if you can
avoid it. If you're Microsoft or nVidia, this might be tenable because you can
actually get the documentation you need to write this. But if your me in your
cube at UT, this is definitely, you know, a challenge.
And but ultimately, the solution winds up being specialized.
the modularity.
Right?
You lose
But even under this design, if you're willing to accept all of that, there's
still a data migration problem. To convince you of that, be I'm showing you
kind of the hardware view. We have a GPU connected on a PCI express bus, main
memory, south bridge, north bridge CPU. And again, regardless of how data
moves across the user kernel boundary, we wind up with some sort of
unattractive communication patterns. So when we capture data from the usb from
the cameras, we wind up writing into main memory. To execute on the GPU, we
then need to copy it across into GPU memory space. After the shader runs. We
copy it back. And the rest of course is left as an exercise the reader here,
right?
What we wind up with is data traversing a labyrinth through the system.
are big buffers. They're happening with a lot -- at high frequency.
These
It's avoidable, and -- or I believe it's avoidable.
It's not avoidable yet.
But why do we want to avoid it? It wastes bandwidth. It wastes power. PCI
transfers have to be coherent with memory on the CPU, so we can even cause
cache pollution here.
What we really want is to be able to simplify the data path. We'd like to be
able to read straight out of, you know, straight out of the south bridge, go
straight into the GPU memory where we can run two steps without any additional
memory transfers and only move the memory -- only move buffers at the very end
to main memory when we want to perform the final step.
And, you know, the ultimate point here is that I think the machine can do this,
but the OS interfaces to make it possible are not there.
>>:
[Indiscernible]?
>> Chris Rossbach: You could, and then you lose the modularity.
absolutely. Two passes on the shader.
But yes,
Okay. So what I'm proposing and what I've just started to build as extensions
to Linux are some new OS abstractions to address this problem. In the first is
a ptask, stands for parallel task. And it's like a processor thread, but it
has this additional stipulation that it can exist without a dedicated user mode
process managing it.
It has a list of mappable input and output resources that you can think of as
analogous to standard in, standard out, standard error.
Next abstraction is an endpoint which is a -- it's an object in the kernel name
space that you should think of as a data source or a sync. So examples are usb
bus, buffers and GPU or CPU memory.
And then a channel which allows us to connect endpoints and orchestrate how
data moves through the system. And essentially it's an IPC analog that is
similar to a pipe and has also these properties of being able to have 1-to-1,
1-to-many kinds of relationships.
Now, at the end of the day, I want to make it clear that I'm fundamentally
proposing that we expand the system call interface to have an logs for IPC for
execution of processes on a GPU. And even scheduling it so that we can bring
the GPU under the domain of the OS scheduler.
Now, if we revisit this problem with these new abstractions, what we wind up
with is essentially a graph. Now, to sort of hopefully very quickly talk you
through this, recall that the catusb, hidinput, are sequential programs, so
I've left them as traditional processes. But I've introduced ptasks for the
data parallel process, the xform and the detect. I have endpoints for each of
the fundamental sources and syncs of data in the system so usb source raw image
input, and I'm connecting them using these channel abstractions.
Now, obviously I'm not the first guy to say that a graph is a good way of
thinking about concurrency. There's a lot of people who have come before me
saying the same thing. So why is this a better way to solve this problem?
Well, first of all, we can eliminate unnecessary communication by implementing
our channels such that they avoid transfers to and from main memory where they
can be avoided.
>>:
[Indiscernible]?
>> Chris Rossbach:
>>:
This --
>> Chris Rossbach:
>>:
This one, absolutely.
The one not sure.
(Inaudible) possible?
>> Chris Rossbach: Yeah, I don't think it is yet either.
I'm willing to say we should have it. Okay.
I'm willing to --
And of course we can eliminate unnecessary user kernel crossings and eliminate
the need to have all these dedicated user mode processes that orchestrate by
having the arrival of new data at a particular endpoint trigger the computation
on the ptask it's connected to.
Now, as I said, I have then -- I have only just started implementing this in
Linux, but I did spend a good deal of team prototyping, using this little
camera here and Windows 7 to see, is this in fact a reasonable research
direction. So I want to show you a snapshot of what I've -- what I learned
doing that. And again what I'm showing you is the performance of my xform
program and I'm comparing a simple implementation on top of CUDA against what
I'll call a ptask analog.
So what's a ptask analog? Obviously I can't modify Windows 7. I can't modify
the drivers that nVidia supplies, but what I could do is build a kernel mode
driver that deals with the usb and maps memory that is shared with a user mode
driver whose only task is to call the copy resource NPI from Directx. So
essentially I'm minimizing the migration across the user kernel boundary and
I'm completely minimizing the user mode work.
And in cases -- even in cases where I have to migrate data back and forth to
the hardware on every frame, I can see significant speedups this way. In cases
where I can eliminate communication from the host to the device, which I
consider kind of representative of this case of being able to transfer straight
in usb, we can see speedups of up to ten percent -- sorry, ten percent? 10x.
Ten percent plus a lot more.
All right. So a brief note about related work. There's obviously a lot of it
and much of it done by people in this room. Helios in particular is a -- you
know, very much related, although because you guys were largely looking at
Larrabee like GPUs, I think it's kind of a different problem domain. Also I
think there's essentially user synergistic ideas. Okay.
Graph-based programming models, synthesis is a big inspiration.
there.
The IO model
Dryad, StreamIt. You know, DirectShow. This looks a lot like DirectShow,
which has been around for quite a while now.
And anyway, with that, I'll try to wrap things up and move on.
A brief word about future work. I don't want to say too much because I've just
spent the last ten minutes really talking about future work.
I do think there are a lot of interesting problems still open in transactional
memory. You know, I just think that that's something that we're going to want
in some form or another going forward. One problem I'm very interested in is
integration with hypervisors, virtual machine monitors. I'd like to be able to
expand my horizons and look a little more at distributive parallelism. I spent
a lot of time in the late '90s building dot-com kinds of things and would like
to be able to leverage some of that experience again.
And finally, I think as we come to expect more and more from how we interact
with our machines, we're going to need to essentially be able to parallelize
more an more complex algorithms. You know, that's -- you can sort of read this
bullet as code for that. I think there's a lot of interesting opportunities
for research there.
So oh, yes, the derogare selected publications slide. Please take away from
this that I have broad interests. I'm published in operating systems
architecture, programming languages, and you know, I'm essentially interested
in anything provided it's cool.
So in conclusion, I do believe parallelism is the way forward. And it's hard.
It's probably going to remain hard for quite a while. There's going to be
plenty of interesting research opportunities there. And in order to take
advantage of concurrency, we really need the right abstractions. And this
requires being able to do research that looks at multiple layers of the
technology stack.
So thanks for listening.
Download