1

advertisement
1
>> Shaz Qadeer: All right. Welcome to this talk. It's my pleasure to
introduce Joe Devietti to you. He's visiting us from across the lake from the
University of Washington, where he has been a graduate student for the past
five years. His work lies at the intersection of computer architecture,
programming languages and software engineering and he's been working on making
programs, parallel programs in particular, deterministic. And he'll tell us
about that work today.
>> Joseph Devietti: Great. Thanks, Shaz. So thanks, everyone, for coming to
my talk. Really appreciate it. So today, I'm going to try and convince you
that there is no such thing as luck, at least when it comes to parallel
programming, and that we can radically improve the way we write parallel
programs with determinism.
So as you all know, we live in an age of parallel hardware, right, every
hardware designer has shipped a multicore design at this point. Most of the
time, it's actually difficult to buy a single core design, even if you want
one. In contrast to this surfeit of parallel hardware, though, there are
relatively few parallel programmers who can write this sort of code that
efficiently utilizes all of these parallel resources.
So why is parallel programming so difficult? Well, first of all, it's
difficult because it's programming, right? This is far from a solved problem.
All of the single thread issues that you have show up again in the context of
parallelism.
On top of that, if you're writing a parallel program, you generally care a lot
about performance. That's one of the main reasons to parallelize and getting a
performing parallel program means you have to deal with things like false
sharing, lock intention, scaleability and other complicated issues.
On top of that, parallelism introduces its own strange concurrency issues that
can affect correctness. So there's a whole new class of correctness bugs out
there with scary names like atomicity violations and data races and these only
show up in the context of parallelism.
And then finally, I would argue that, as if this weren't bad enough, there's an
over arching layer of non-determinism that really exacerbates the single and
multi-threaded issues because it makes execution non-repeatable. So
2
non-determinism really plagues us throughout the software development cycle if
we want to debug or programs or we can't reproduce these so-called Hyzen bugs
when we're testing our parallel code, it's very difficult to guarantee testing
coverage because of non-determinism. And then, of course, there are issues
that will inevitably seep out into deployment and it's difficult to reproduce
those back in a development environment.
So non-determinism makes life difficult at every step of this cycle.
So I've also done work on correctness issues in parallel programs. But today,
I'm going to focus on non-determinism, and I'm going to show you that
non-determinism is not a fundamental aspect of parallel computation, even
though all of the multiprocessors and multichip processors that we've been
building for decades now have been nondeterministic. There's no reason they
have to be that way, and I'll show you some designs today executed in hardware,
in software and also even with small modifications to the programming language
that show that we can get deterministic execution for arbitrary parallel
programs with relatively low overhead.
So here's the outline of the rest of my talk. First, I'll be much clearer
about what I mean by determinism, give you some additional motivation for why
it's valuable. After that, I'll describe this technique of execution-level
determinism, which is a set of techniques we developed at UW that provide
deterministic execution for an arbitrary parallel program. Even programs that
have bugs like data races in them.
So I'll show you some algorithms for providing deterministic execution, argue
why those algorithms are correct, and then I'll describe how to efficiently
implement these algorithms in hardware and in software and then show you some
results from the systems that I've built.
After that, I'll talk about some current work that I'm doing which has some
lightweight language extensions to provide higher performance determinism in
the context of a software only system. And then finally, I'll share some of my
ideas for future work.
All right. So let's get started. What is determinism? So determinism at a
kind of intuitive level means that there's one possible output for each input.
So with a current, nondeterministic system, like a conventional multiprocessor
we have today, there are many possible outputs for any input. You run your
3
program, you don't really know which one of these you're going to get. With
determinism, of course, there would be only one output for each of those
inputs. And crucially, the output that you get for any particular input is
somewhat arbitrary. It's always going to be the same, but we're going to use
this flexibility of not really specifying in advance which of these outputs you
get in order to provide higher performance, as you'll see later.
So just to make this a little bit more concrete, and since it's an election
year, we have this example of a multithreaded voting machine. So we've got
these two threads, a blue thread and a red thread. So what they're doing is
they're subtracting and incrementing from a shared counter. So at the end of
the day, we just look at the value in the counter. If it's positive or
negative, we can tell who won the election.
So if the code from these two threads, the instructions, interleaves at a
relatively coarse granularity, then that's okay. We get the expected result
that the election is a tie. But if we have them interleaving at a finer
granularity, we have a classic lost update problem. And we're going to drop
one of the votes. Now, depending on whether it's a red vote or a blue vote
that's been dropped, you may or may not consider this to be a bug. Maybe a
feature. But in the spirit of bipartisanship, let's say that this is some sort
of issue we want to fix, right.
So why can these different interleavings arise in the first place? Well,
there's lots of things that can affect an execution, from the single threaded
world, we're familiar with notions of input, the bites that you read from a
file or the buttons that a user clicks on. These obviously can affect what
your program does.
In the multithreaded world, the number of threads that your program is using
can also affect the computation that it performs. But unfortunately, on our
current, nondeterministic parallel systems, there are many other things that
can also affect what execution you get when you run your program.
So whether your program has data races in it can definitely affect what
happens. Scheduling decisions made by the operating system or even by the
hardware can affect things. Even low-level phenomena like temperature can
affect the speed with which circuits are switching and ultimately that can
bubble up into something that's visible in the program's output.
4
And for those of you who have done a lot of parallel programming, before you
will no doubt be familiar that the phase of the moon is actually a crucial
aspect to getting a program, a parallel program that computes correctly, right.
So ultimately, we could maybe hope to control a lot of these phenomena, right.
Maybe if we didn't believe too much in quantum mechanics, we could control
things down to the very lowest physical level. But ultimately, as a
programmer, it's very difficult to control a lot of these very low level things
that can affect an execution. Programmers have enough trouble just keeping
data races out of their program, let alone worrying about things like
temperature.
So ultimately, what we want is we want a higher level, more useful notion of
input like these explicit inputs here and we want a useful notion of
determinism that works for arbitrary multithreaded programs, abstracting away
all these low-level details that can affect an execution and leaving execution
strictly a function of these high-level inputs that a programmer can easily
identify and control.
So that's exactly the kind of scheme that I'll describe to you today.
So just to talk a bit about some of the benefits that determinism could provide
if we imagine a world in the future where deterministic execution was deployed
throughout our computing infrastructure, so what I have here is I have a little
cartoon of the software development cycle. So we do some amount of iterating
between testing and debugging. After that, we go ahead and deploy something
that we're recently confident in. Of course, deployment just reveals new
issues and the cycle starts all over again, right.
So if we had determinism available throughout our deployed computer
infrastructure, we'd be able to use reverse debugging techniques while we're
debugging or code, testing results would be reproducible, and this would
eliminate the Ned to stress test parallel programs, right. We don't have to
run them over and over again, hoping to trigger interesting thread
interleavings. We just run it once, you get the output that you get, and then
you move on to the next input.
Because we would have a guarantee that programs would behave the same way from
testing on to deployment, testing would ultimately be more valuable and so we
would be deploying code that would be more robust, that we'd have more faith
5
in. And then finally for the issues that did leak out into production, we'd be
able to reproduce those back in a development environment.
Yes, question?
>>: So your definition of determinism was kind of end to end. Got an input,
got an out put, some end state and those are in one on one relationship.
Talking about debugging, seems like you might be concerned with intermediate
states where your definition would allow there to be multiple different
intermediate states leading to the same final state.
>> Joseph Devietti:
>>:
Correct.
Are you going to talk about that?
>> Joseph Devietti: Yeah, so let me just repeat it for the recording. So the
question is there is a notion of sort of just input/output determinism that
I've described that maybe during production all we care about is we provide the
inputs, we get the same output. But during debugging, we would also care about
maybe the intermediate states and the computations that were used to derive
that final output. Yes, that's a great point. So I will touch on that a
little bit later. As you'll see, as I describe the systems and how they
actually work, that we do provide that series of deterministic intermediate
states as well, which I do think is crucial for debugging.
So many others have identified these as useful properties for systems research,
so there's been a lot of related work that provides these same benefits but in
a more limited fashion.
So, for example, there's been a large amount of work on record and replay
systems, both in hardware and also in software and also hybrid
hardware/software systems. Some of this technology is actually transferred
over to the commercial realm. So these days, it's not uncommon for hyper
visors and also debuggers to allow you to execute programs in reverse or allow
you to replay an execution. Currently, the commercial products only allow
record and replay for single threaded programs. So multithreaded record and
replay is still considered an open research prob.
On the testing side of things, no doubt many of you will be familiar with the
work that comes up with smarter testing strategies, so ways to examine all of
6
the schedules that are possible for a given input, and ways to intelligently
explore that space.
There's also been work on what's called constrained execution. So the idea
here is that we test our program, right, and we explore through testing a
certain part of the space of possible schedules, and then when we go and deploy
our program, we might like to sort of limit its execution to that part of the
space that we've tested before.
So constrained execution is a mechanism for, in a best effort way, trying to
limit the code to doing what we've tested it doing.
So in contrast to these point solutions, determinism provides a unified
mechanism. Single mechanism that provides all these benefits. There have been
other approaches to determinism as well, however.
So one of the most popular has been this idea of language-level determinism.
So the basic idea here, this is a new programming language, something like
deterministic parallel java or nestle, if you're familiar with those. And the
basic idea is let's have a new programming language which is constrained in the
kind of parallelism that it can express. Often there's a heavy emphasis on
static checking as well.
What we will require, what we will do is we'll design the language such that
all programs that are expressed in such a language are deterministic by
construction. So that's the trade-off that you make. You have to rewrite your
program in this new language. It only supports restricted forms of
parallelism, but it gives you determinism by construction.
Another interesting, somewhat bizarre use case of determinism came up with, of
all places, the Halo game engine. So Halo is a, as you and Microsoft well
know, a very popular first person shooter for the X-Box. And it turns out that
in speaking with the Halo developers, the Halo engine has been explicitly
architected to run in a deterministic way, since back in 2001. And so Halo is
multi-threaded, running on the X-Box as well as the GPU. And the reason that
the Halo developers did this was for all of the good software engineering
reasons that I described previously. But also, determinism enables some new
use cases in the context of their game engine.
So, for example, when you are recording a game session in Halo, you can
7
efficiently do that recording by just logging the explicit network and
controller inputs, right. You don't have to record anything else about the
execution. And then later on, when you watch that game session again, you
aren't just seeing a movie. You're actually experiencing a deterministic
reexecution of that previous session. And so this allows you as a user to
interact with the game world in certain non-intrusive ways. For example, you
can move the camera around and you can also re-render that session at higher
resolution if you want.
And because the underlying platform is deterministic, you have a guarantee that
when you re-execute it, even with this non-intrusive instrumentation, no
divergence will occur. And if there were a -- if Halo was a nondeterministic
execution platform, obviously, all sorts of things could go wrong during the
replay process.
So in contrast to these two approaches that we've seen, so the Halo developers
were just using manual best effort approaches. They stuck to deterministic
parallelism patterns. They tried to avoid data races in their C-plus-plus
code, and they hoped for the best. So that's the sort of manual approach. We
could also write all of our code in these deterministic languages, but they
have various restrictions as well.
So in contrast to these two previous approaches, what I'll show you today is
this idea of execution level determinism. And so execution level determinism
is a completely automated approach. There's no need for a programmer to do
anything to their code for it to run on this platform in a deterministic way.
All right. So that's the first bit. So now let's move on. How do we actually
provide determinism? Well, let's go back to this idea of our multithreaded
voting machine. So we have these different interleavings of instructions here.
And we saw some exam else of how they could arise. But in order to control the
program's output, we need to control this interleaving in some way. And so the
basic idea is we're going to isolate the instructions from each thread from all
of the other threads. So we're going to isolate them in terms of the memory
that they read and write. So it's a shared memory program, but we isolate each
thread from all the other threads. Basically, what this is doing is it's
taking a multithreaded program and converting it into a collection of
single-threaded programs. So each of these single-threaded programs is going
to naturally execute in a deterministic way and so that's how we can get
determinism from a multithreaded code.
8
Now, obviously, threads want to communicate with each other. That was the
whole point of writing a multithreaded program in the first place. So we need
to handle these updates. What we need to do is with he need to merge updates
in a deterministic way and at deterministic times or points in the execution.
And then to execute an arbitrary parallel program, we can just lather, rinse,
repeat this process until termination. So this is the determinism recipe that
we're going to follow today. We need to isolate threads' updates from each
other. We need to merge updates in a deterministic way and at deterministic
times.
And as you'll see, we have a series of optimizations, but we can come back to
this recipe each time and see how it provides determinism and how we can change
the mechanisms to provide different ->>: [indiscernible]. You might have to, well, if you are interested in
providing sequential consistency, just merging [indiscernible].
>> Joseph Devietti: Right. So we'll see the first scheme I'll talk about uses
transactional memory. It does provide sequential consistency. We'll show how
to handle these conflicts. And then later on, I will argue that maybe we
should give up on sequential consistency, move to more relaxed consistency, but
we don't have to give up on determinism in that context.
All right. So let's start with the simplest possible thing, right.
Serializing the program. So this is an execution timeline diagram. I'm going
to show you a bunch of these throughout the talk today. We've got time on the
X-axis and then threads are on the Y-axis. And what we're going to do is we're
going to execute one thread at a time, switching between them in a round robin
order.
Now, the things we have to enforce in order to ensure determinism, so we have
these each thread executes a set of instructions. We call that a quantum. And
then the amount of time it takes for each thread to execute a quantum, we call
a quantum round. And so enforce determinism here we need to ensure the quanta
are of a deterministic size so they're typically on the order of 10,000
instructions. And then we have to have a deterministic scheduling policy for
switching from one thread to the other. So round robin is a very simple but
perfectly fine example.
9
If we have these two properties, then we can enforce deterministic execution.
To put this in terms of the recipe that I showed you previously, we need to
isolate threads from each other. We do that by running one thread at a time.
Very trivial form of isolation. We don't actually need to merge updates,
because, well, it's only one thread running. And then in order to switch
between threads the deterministic times, we're going to count out a fixed
number of instructions.
So this is great. This allows us to enforce deterministic execution for an
arbitrary parallel program. This could be the end of my talk, right? Yes,
question.
>>: That's a fixed number of instructions or unless the thread doesn't
[indiscernible].
>> Joseph Devietti: Yes, so there are some things you have to do if you want
to handle -- like if it blocks into the operating system, say, where threads
are created, threads exit, yes. But in the common case where threads are just
running, this is fine.
Great. I should also point out it's straightforward to handle those other
cases deterministically as well, because they arrive at deterministic points in
each thread's execution. Great.
So what's wrong here, right? There's only one simple thing that we left out,
and that is, of course, all of the parallelism in the original program. We
wrote a parallel program in the first place that was something, obviously, that
was important to us. So what we're doing or what we would like to do is we'd
like to be able to somehow, you know, magically overlap the execution of these
quanta, because we started with a parallel program. Parallelizing parallel
code should be relatively straightforward, right? Much easier than
parallelizing sequential code.
So an architect, whenever you hear that we need some kind of magic mechanism to
overlap these quanta, I think, okay, that's pretty straightforward. We'll just
go ahead and speculate. And the particular speculation mechanism that we're
going to use is called hardware transactional memory. So many of you are no
doubt familiar with transactional memory and it's a very sophisticated and
active research area. And I don't have time to go into all the details today,
10
but I will tell you the three things you need to know about transactions to
follow the rest of this talk.
So the first thing is that programmers place code inside transactions. The
second thing is that the semantics of these transactions are such that the code
running inside a transaction is under the illusion that it's executing
completely serially. It's like it's the only code running in the spire system.
Now, in reality, there may be other transactions running concurrently, but this
is the illusion that you as a program can rely upon.
The second thing we need to know about transactions is they're implemented
using speculation. Particularly, like in the context of hardware. And so what
this means is that what the run time system will do is it will potentially
launch a bunch of transactions, all running concurrently, will check for
interference between these transactions as they run. If there's no
interference by the end of a transaction, then that transaction can go ahead
and commit.
However, if there is some interference that's detected, then that transaction
will roll back, it will be as if it never executed in the first place. All
right.
So in terms of our recipe, how does this all fit together? Well, we're going
to isolate threads from each other by taking these quanta and putting them
inside transactions. In order to merge updates in a deterministic way, we are
going to enforce that these quanta commit in a deterministic order. And then
in order to switch between threats at deterministic times, we'll just count
instructions like before. So let's see. Yes. I should also point out that
there are actually three criteria that determine when a thread's quantum ends.
One is hitting a fixed instruction count. The others deal with synchronization
and also with exhausting finite hardware resources, and I'll get to those later
in the talk.
All right. So let's see an example of how this all fits together. So we're
going to execute quanta in parallel, right, because they're implicitly running
inside these transactions. After the quanta are done executing, we go ahead
and commit them in some sort of fixed deterministic order. So that's what
commit mode does. So there's two things that are important about commit mode.
One is that it's enforcing the illusion of serial execution, right. So this is
11
exactly equivalent to the straightforward serialization that we saw previously.
And that's because we're forcing a commit in a particular order.
The second thing that's important about commit mode I'll get to in just a
second, but that fixed order can actually help us optimize a bunch of the
conflict detection mechanism as well.
So what happens if we have one of these conflicts? Let's say we have a
conflict between thread one and thread two. What we're going to do is we will
detect that conflict, we will roll back in this case the transaction running on
thread two, re-execute it during commit mode, and then continue executing from
there. So again, this maintains that illusion of serial order that we started
with.
So let's dig into conflict detection a little bit here, right. So as you're
probably familiar with, conflicts are defined as two operations from, say, two
different threads or two different cores, accessing the same memory location
and at least one of those operations is a write. So that's a standard
definition of a conflict from the transactional memory literature, and that's
in this table here.
Now, when we get to our deterministic transactional memory system, it turns out
that we can be much more relaxed in detecting conflicts. So how does this
work? Well, the basic idea is that for these two rows that I've highlighted
here, we have a write operation as the second operation. So this write
operation is always going to produce the same value for a location, because
we're executing deterministically.
The write operation will always be ordered as the second thing, again, because
of our deterministic commit order. We could roll back and re-execute in these
cases, but we'd always end up with the same value, same final value anyway. So
there's no point in doing that. So let's just not detect a conflict there.
Of course, there's still this other case where we have actual data flow between
operations and that does require a conflict, and let me just dig into that a
little bit. So let's say thread two is writing location X. Thread three is
reading location X. Thread two's value is locked up inside its transaction.
So thread three is going to see some previous value for X from somewhere else
in the system.
12
Then if we detect a conflict, roll back and then re-execute the transaction on
thread three, we can see that the commit here from thread two has made that new
value of X visible. So this, this re-executed version here is what conforms to
the illusion of serial execution that we wanted. We want the illusion that
thread three executed after thread two. And so because reexecution generates a
different result, we have to detect a different conflict and roll back.
Well, we don't actually have to do that all the time. So let's just take a
step back and see how we can optimize this a little more. So thread two is
producing this new value for X. And it's trapped inside the transaction, but
in some sense, we don't want it to be trapped inside anymore because that's
precisely the value that thread three needs to see. So if there was a way that
we could sort of poke a hole in that isolation, let the value leak out, then we
could potentially optimize away a roll back in this case. So the general idea
is that let's break isolation, this will help us avoid rollbacks. This doesn't
work in all cases. It could be the case, for example, that thread two later
writes another value to X and so we would have to detect that we had forwarded
the wrong value and still rolled back and re-execute in that case.
But ultimately, this means that many of the situations in which we would
normally have to detect a conflict, we no longer have to detect a conflict.
Yes?
>>: In that case, you would have also have to maybe produce seg faults in
thread three, even if they [indiscernible].
>> Joseph Devietti: Yes, that's correct, that execution or that the control
flow could diverge arbitrarily due to reading the wrong value. Yes?
>>: So I guess I'm not sure. How do you know in terms of when you break this
isolation where the visibility of the write is in the third transaction?
Doesn't it matter where it shows up in terms of the ordering of the
instructions within the transaction? Which gets you back to [indiscernible].
>> Joseph Devietti: Correct, so maybe, if I understand your question
correctly, you're asking about when exactly do we do the forwarding and can't
that affect the, I guess the -- so if we had multiple reads going on in thread
three, which of them will see the value. That's correct.
So what we're doing here is we're still enforcing the illusion of serial
13
execution of these quanta. So the way that -- I didn't go into details, but
the way that we would detect conflicts and the need to roll back even in this
more advanced speculative forwarding scheme would be to ensure that we always
had the illusion it was thread one's quantum, then thread two, then thread
three.
>>:
So no reads to X occurred before the value that was written?
>> Joseph Devietti:
Exactly, yes.
>>: So if you recorded the happens before graph, it would say that there was a
dependence to T3, but that was okay because it was consistent with the order I
would have imposed for determinism.
>> Joseph Devietti:
>>:
But if you saw, you know, T2 ordered before T3 and T3 before T2 --
>> Joseph Devietti:
>>:
Correct.
Then you have a cycle.
You'd have a cycle.
>> Joseph Devietti:
That's not serializable.
>>: So are you piggybacking it on the conflict detection to do that?
that work? Do you extend the HTM to sort of do this.
How does
>> Joseph Devietti: So the question is how does this more sophisticated
conflict detection work. So yes, you can extend hardware transactional
memories, conflict detection to detect these kinds of errors. Not these
errors, but these kinds of essentially serializeability violations as well.
>>: But do you have to know in advance so that the set of threads in the order
you want to resolve conflicts to do this?
>> Joseph Devietti: Yes, knowing the order in advance is crucial, because then
we know that lets us know that, okay, forwarding from, you know, this thread to
this thread is okay or that it should have been the other way and we need to
roll back. Great.
14
So just to summarize how this transactional memory scheme works, I've shown you
how we can recover parallelism from a parallel program using hardware
transactional memory. I've shown how we can avoid rollbacks in many cases by
optimizing the conflict detection process, and then the final thing is really
something that would be nice to have that we don't have, which is that it would
be really nice if we could avoid this need for constant speculation. We're
putting every instruction in the program inside a transaction, constantly
speculating, and as you're well aware, speculation has many costs, right.
There's the time and the energy that you lose whenever you mis-speculate, but
then in some sense, even worse is just the complexity of implementing the
speculative machinery in the first place.
What do we do about irrevocable actions, because everything is running inside a
transaction, we need some other out of band mechanism to handle things like
I/O.
So if we go back to this recipe for determinism using transactions, we can see
that it gave us everything that we needed in order to provide determinism. But
it also gave us something that we didn't need, right. It gave us this serial
execution, which was enforced through speculation. And we really don't need
this illusion of serial execution. In some sense, transactions are really
overkill. So they give us this nice serial order. But if all we care about is
determinism, it would be totally okay to have a more sophisticated interleaving
of instructions, as long as we maintain determinism.
So the hope here is that we can come up with a scheme that provides the
performance of transactions in terms of still expressing or harnessing the
underlying parallelism of the application, but that it avoids the downsides of
speculation.
Now, of course, it's important to maintain two properties. One is we still
have to be deterministic. That's the whole point here. Of course, the second
thing is that we have to maintain program semantics. So reordering
instructions in very sophisticated ways may break existing code and so we want
to make sure that we still support existing parallel programs.
So I'll talk about these two points in turn.
So the first thing I'd like to explain, though, is this notion of a store
buffer. It's a bit of a detour into how processors are implemented. And we'll
15
see that this is important for understanding how we can enforce determinism in
a nonspeculative way.
All right. So store buffers arise because we have the situation, say, where a
processor is doing a store and that store misses in the memory hierarchy. So
there's kind of two options we have here. The very simple thing to do would be
to say, all right, we'll just sit around and wait. And as soon as the cache
line that we need percolates into the L1, then we will allow that store to
complete and then keep on executing.
But if you think about what a store is doing, it's just producing a value,
sending it to the memory system. There's nothing that we need from the memory
system in order to continue executing. So why not have a little hardware
structure that sits here, buffers the operation coming out of the processor,
and then lets us keep going. We have to remember that value so that way,
subsequent loads can still see it. But really, there's no need to stop and
block the entire core just because we missed on a store.
So for our purposes, store buffers are really nice because they provide
isolation without any kind of speculation. So there's nothing speculative
about using a store buffer, you have a store, you just put it into the store
buffer, subsequent loads check that store buffer, but you never detect
conflicts, you never roll back.
I should also point out the store buffers are private to each processor. So P2
can see its own stores, but P1 has its own separate store buffer who doesn't
see what P2 has done. Great. So in terms of our determinism recipe, now if we
want to eliminate speculation, we're going to use store buffers to isolate
threads' updates from each other.
In order to merge updates in a deterministic way, we came up with a parallel
merge algorithm, which it turns out is exactly the same as the Z buffer
algorithm that's very common in graphics hardware.
And then, of course, we'll count instructions like before.
Now, I don't have time to go into this parallel merge algorithm, be happy to
talk with you more about that offline, but I will go into store buffers and how
we use those to buffer updates.
16
Okay. So we're executing in parallel again, we're using store buffers to
isolate threads from each other. I should also point out that we have a way of
performing synchronization in parallel mode as well. This is due to a nice
algorithm from some folks at MIT called Kendo, and I won't have time to explain
that today, but I can show you the details offline.
So how does the store buffer mechanism work? Well, let's say thread one is
writing to location A, thread two is reading from it thread two is going to get
some previous value of A. It's not going to see thread one's update. Thread
one will see its own updates, of course, otherwise things would get pretty
crazy.
And if thread two also wants to update location A, that's fine. It gets its
own private copy, can do whatever it wants with that. After we're done
executing in parallel, we have commit mode, which deterministically publishes
the contents of each thread's store buffer, making it visible to the rest of
the system that uses that Z buffer algorithm that I mentioned previously.
And then we just continue executing. So one thing, maybe some of you are
curious about at this point, is there a question? Yes.
>>:
So you still have this quanta of 10,000 instructions?
>> Joseph Devietti:
>>:
Yes, that's right.
So what are you assuming about the size of the store buffers?
>> Joseph Devietti: So it depends on whether you're implementing them in
hardware and software, but in terms of hardware, we use the private levels of
the cache hierarchy, and so private L1, private L2 if available and so that
gives us on the order of 100, 200K space.
>>:
So you basically extend the notion of the store buffer?
>> Joseph Devietti:
>>:
Yes.
In your algorithm into the cache?
>> Joseph Devietti:
Right.
Yes.
We couldn't --
17
>>: The store buffer is just a little cache and you make it big, extend the
notion of the local?
>> Joseph Devietti: Right. So the term store buffer as applied to current
processor design is generally a very small, maybe eight entry structure. And
we need something that's much larger. But it's the same idea.
Okay, great. So one thing maybe that you're wondering about is what we've done
to this program is we've taken all of its updates and we've hidden them inside
the store buffer so no one else can see them, and is this okay. Obviously, we
need to make sure that these updates are visible at some point, and so really
what we're talking about now is the memory consistency model of the programming
language the program was written in and also the architecture for implementing
this in hardware.
So for our purposes, the memory consistency model says when these store buffers
have to be made visible. When their contents need to be published. Just a bit
of terminology. Publishing less often means that we have a more relaxed
consistency model. Publishing more frequently means we have a stricter
consistency model.
So in terms of the consistency ramifications of the scheme I just described to
you, it really comes down to commit mode. So commit mode is what publishes the
contents of the store buffer. It's kind of like a fence in, say, a current
nondeterministic system. Unfortunately, commit mode is somewhat expensive,
because we have this global barrier. We have to wait until everyone has
finished their quantum until we can go ahead and commit.
So because commit mode is so expensive, what we ultimately want is we want a
really relaxed consistency model, because the more relaxed a consistency model
is, the fewer restrictions it places on the program in terms of when updates
need to be visible. We want as much flexibility there as possible. So we want
the most relaxed consistency model we could possibly get away with. Because we
want it published as rarely as possible. Just an interesting side note. You'd
think that as an academic, I might be interested in publishing as frequently as
possible. But it turns out in this particular context, publishing rarely is
exactly the right idea.
On a more serious note, deterministic synchronization, because it has this
global visibility aspect to it, is more expensive than nondeterministic
18
synchronization. So as hopefully you'll take away from this later, relaxing
the memory consistency model is something that definitely makes sense in the
nondeterministic context, but it makes even more sense in the context of
determinism, because synchronization is more expensive.
All right. So which consistency model are we going to choose? Well, there's a
bunch of them out there. I've just listed a handful on this slide. As I said,
we want the most relaxed consistency model, as opposed to the most uptight
possible so we're going to choose what's called data-race-free. Essentially
the same consistency model in the C-plus-plus memory model or java.
So let me just give you a quick gloss on how the DRF zero model works. Say
we've got these two threads, they're executing these instructions DRF zero says
that visibility has to be propagated along these happens before edges. So
happens before edge is something that maps up a -- that matches a release
operation of a log to the subsequent acquire. And so DRF zero says that
updates have to be visible along these edges, and any other time that you are
free to keep those updates buffered.
So now let's put this all together in terms of an am excel. So we're executing
in parallel. We can do synchronization operations in parallel as well, thanks
to the Kendo algorithm. Then we have a commit mode. We continue executing.
Let's say that thread two unlocked this lock and then thread three subsequently
locked it. So we've got a happens before edge here from thread two to thread
three. Our consistency model says that we have to ensure that updates are
visible along one of these happens before edges.
Fortunately, for my example, we have a commit mode that exactly intervenes and
so this commit mode satisfies the constraints of the consistency model.
So it turns out that this is a relatively common pattern in the bench marks
that we looked at. And ultimately, it means that we are rarely forcing a
commit mode, which is expensive, due to synchronization constraints. Our
consistency model is relaxed enough that that rarely comes up.
And so what this allows us to do is it allows us to execute larger and larger
quanta without having one of these commit modes intervene. And so because we
can execute larger quanta, we can better amortize the cost of these barriers at
the end of each quantum round, and this ultimately gives us higher performance.
So just to summarize this relaxed consistency approach to determinism, I've
19
shown you how we can still execute programs in parallel in a deterministic way
without relying on speculation. The key to that was adopting a relaxed memory
consistency model. And I hope I've also convinced you that relaxed consistency
is a very natural optimization for determinism because deterministic
synchronization is so much more expensive than its nondeterministic variant.
All right. So that does it for the algorithm section. Now let's talk about
how do you actually build this stuff, all right, and then I'll show you some
graphs at the end of this from some of the systems that I've built.
So if we go back to transactional memory implementation, what we started with
was a conventional multicore processor, and the first thing we needed was the
best effort hardware transactional memory system, very similar to what Intel
has said is coming out in their [indiscernible] processors next year.
And then in addition to this best effort HTM machinery, we also need a little
bit of extra stuff. We need quantum building inside the processor core, we
need to count instructions. We need to force a deterministic commit order of
transactions, and then finally, we also need to ensure that if we overflow the
resources of the deterministic transactional hardware, which is typically
implemented in the private cache, we need to ensure that that's a deterministic
event so that way we can use that as a signal for when a thread should stop its
quantum.
And so ultimately, this boils down to modifying the cache eviction policy a
little bit.
So when we moved from the transactional memory implementation to the relaxed
consistency implementation, we realized that, well, first of all, we don't need
hardware transactional memory anymore, and because that was the majority of our
hardware support, we realized we had an opportunity to kind of strip down the
other mechanisms that we needed in hardware as well.
So at the end of the day, we got it down to just these three mechanisms. So
the first is instruction counting. I think instruction counting makes a lot of
sense in hardware. It's really cheap, and it has a non-negligible cost in
hardware. Sorry, in software.
And so the second mechanism we use is we require a store buffer mechanism,
trying to leverage the private caches on the system. We want to keep threads
20
isolated from each other. Private caches are isolated at the hardware level so
it makes a lot of sense to reuse that physical isolation at a higher level of
the stack.
Question?
>>: This is totally off the subject, but there's speculation going on inside
the hardware itself, right?
>> Joseph Devietti:
That's correct, sure.
>>: Depending on the state of the cache and things like that, you know, the
same instruction sequence may or may not take more cycles, I guess. But at the
level of the individual instruction you're saying it's completely
deterministic? Is that right?
>> Joseph Devietti: Right. So as you point out, there's a lot of speculation
going on under the covers. And the decisions used to inform that speculation
are not necessarily deterministic, right.
So ultimately, we have a deterministic outcome for a program, but
necessarily have deterministic timing for that program. Yes, and
gesture toward some ideas in the future work section about how we
about building a processor that has deterministic timing and from
would also get a deterministic outcome as well.
we don't
so I will
might go
that, you
So the third mechanism that we need is this -- so the parallel merge algorithm
that I didn't describe has a lot of fine grain parallelism implemented in
hardware and GPUs, and it makes a lot of sense to make that into hardware for
our purposes as well.
On the software side of things, we tried to leave as many policy decisions as
possible up to software. So issues about how big quanta should be, maybe
whether they should adaptively grow and shrink over the course of an execution,
that's all up to software. Issues about when to place a particular store value
into the store buffer or not, because you can leverage static analysis to prove
for certain locations you don't need to do that. And then finally, issues
about when the contents of store buffer should exactly be made visible so
basically, the question of enforcing the consistency model because it's a
relatively rare occurrence, we also leave that up to software.
21
So now, I'll show you some results, first from a hardware simulator that we
built to simulate this processor design. There's a lot of details here.
Source code is available on our group website. But ultimately, you care about
the results, right? So we've got benchmarks on the X axis, and the Y axis is
performance overhead that's normalized to nondeterministic execution. So the
baseline is let's take a nondeterministic program, so just basic P-threads plus
locks, run it on our simulator with, say, four threads. That's the baseline.
How much does determinism cost you on top of that. And so here are the results
from our hardware simulator. As you can see, overhead various substantially by
benchmark, but even when we're looking at a 16 processor system, the worst case
overhead is in the 50 to 60 percent range, which we think is relatively
tolerable.
These same ideas also lend themselves to a compiler implementation. So we have
a C, C-plus-plus compiler built around LLVM. In order to isolate threads from
each other, we use hash tables attached to each thread. So on a store, a
thread will put an in you entry in the hash table. And then on a load, it
checks the hash table to see if there's some value. If there's nothing in the
hash table, then it goes out to the read-only version of main memory.
And then we've implemented the commit mode algorithm as well in software. So
it's the same setup as before. Performance is normalized to nondeterministic
execution because this is all pure software, these are experiments running on a
real machine in our lab. As you can see by the scale of the Y axis, the
performance overheads are not quite as rosy. And, indeed, there are some
benchmarks that show roughly an order of magnitude slowdown for enforcing
deterministic execution.
>>: You didn't say anything about how scaleability is affected. Looks like in
some cases you get straight CL, like you actually get higher overhead to
[indiscernible] more processes. Is that generally? I'm curious.
>> Joseph Devietti: I think the scaleability is an interesting point. So
there's sort of two answers there. One is that the store buffer
instrumentation that we add to the program is relatively scaleable. Because
that's something that each thread does in isolation. And so one thing you
notice on the hardware results is that for some benchmarks, the overhead
reduces as we increase the number of threads. And that's because we're
normalizing to a nondeterministic baseline and the nondeterministic program
22
doesn't scale perfectly. We add a bunch of overhead to the program, some of
which does scale quite well.
>>: But that's kind of hiding the fact that there's an underlying non-scalable
aspect to the program, right? The fact that it went down is really just more
to the fact that this program won't scale on [indiscernible] processor no
matter what you do, right?
>> Joseph Devietti: Right. Yes. And so ultimately, you would want to
re-architect the fundamental program to scale better.
>>:
Scale up to 16, you could have an even higher overhead with this approach?
>> Joseph Devietti: Correct. So the thing where scaleability really starts to
affect is us in terms of the global barriers at the end of each quantum round.
Obviously, stopping all threads, forcing them to synchronize is not the way to
scaleability. So I do have some ideas about a more scaleable deterministic
algorithm, which I haven't gotten around to implementing yet, but which
eliminates this need for global barriers.
All right. So I've shown you that hardware determinism can be quite fast. But
obviously, it doesn't exist unless you have a billion dollars or so lying
around, you just want to give to me, I'd be happy to go off and fab this for
you. Software determinism does exist, but as we can see, it's quite slow. And
so ultimately, we would want some way of improving the software performance,
right, and so that's what I'm going to talk about next.
And I'll make this relatively quick, because I know I'm running a little bit
tight on time here. So the current thing I'm working on is this project called
MELD. So MELD stands for merging execution and language level determinism. So
as I gestured to at the beginning of the talk, there are these sort of two
camps to providing deterministic execution. There is the language level
approaches, which have existed for many years and then there's the
execution-level approach, which I described to you earlier today.
So these two approaches have, in many ways, a classic static versus dynamic
analysis kind of trade-off. So if we do things statically, there's no run time
overhead, but things are going to be more restricted. And so a dynamic
approach can handle the general case, but we have to do everything in run time
so it's more expensive.
23
Another interesting point of distinction between these two approaches is that
the restrictions that deterministic languages enforce allow them to have this
property known as sequential semantics. So sequential semantics, there's this
really nice property where your program is going to have the same exact
behavior if you run it with one thread or ten threads or a thousand threads.
And that really comes about because of the restrictions in the language. If we
think about a dynamic approach, where we are supporting arbitrary multithreaded
programs, there's no way to get that same kind of agnosticism to the number of
threads the program is using. Because fundamentally, you could have a program
that does something completely different with eight threads versus seven
threads. So there's no way to really abstract away the number of threads
you're using.
So MELD is a hybrid approach which tries to combine the best of both worlds in
terms of the static and dynamic techniques so we end up with low run time
overhead, as I'll show you, general support for arbitrary parallel programs.
But because of that generality, we are not going to be able to provide
sequential semantics.
All right. So the motivation behind MELD is this so-called 90/10 rule, right?
90 percent of the time spending 10 percent of the code for a lot of parallel
programs we looked at, it turns out that this 10 percent of the code is often
data parallel in a nicely structured way.
So here's a little cartoon of a parallel program. So there's lots of
complicated stuff going on, maybe, locks, flag-based synchronization. But
there's this little kernel of regular data parallel computation buried
underneath all of this complexity. And fortunately, for our purposes, this
regular data parallel kernel is often where a lot of the time is spent.
Perhaps because this is a fundamentally more or typically a more scaleable kind
of parallelism than something a little bit less structured.
So the basic idea behind MELD is let's use execution level determinism so
handle all this complicated stuff, right and then let's use a deterministic
language for encoding this very regular computation. Which seems pretty
straightforward.
Of course, the tricky part is what's the interface between these two
approaches? Because there's many things that could go wrong if we don't think
24
about this carefully. So let's say we have some kind of parallel merge sort
algorithm, standard divide and conquer parallelism. We could write this in a
deterministic language, like deterministic parallel java. It would be very
straightforward. And we could verify it in isolation. But as soon as we take
that isolated function and plug it into a larger program, all sorts of
interesting issues come up.
Are there concurrent calls to merge sort. What aliases are input here. Can
other threads modify array while we are modifying it? The whole point was to
statically verify this as deterministic, remove all instrumentation, and in
order to answer these questions, we would naively add back a lot of
instrumentation.
So what do we do with MELD? Well, we propose a lightweight type qualifier
system, which essentially seg grates the data in the program into the part
that's operated on by the deterministic language and the rest of it. And then
we have a simple set of typing rules which enforce certain aliasing properties
so that you can't just take a pointer from one side of the world and then
publish it on to the other side.
So there's a quick diagram of how the MELD compiler works. So we've got our
program. It's got its little data parallel kernel, and we've got the type
qualifier system to keep them ring-fenced. So we take the complicated parts of
the code, run that through our standard deterministic compiler so that adds a
store buffer instrumentation to all the memory axes. And for the data parallel
kernel, we do two things. First, it's written in a deterministic language so
what we do is take that code, along with what we call an abstraction.
Currently, this is done manually by yours truly, but we're looking into
automating this going forward.
The abstraction is basically just a conservative summary of what else is going
on in parallel with this data parallel computation. So it may be the case that
there's nothing else going on. May be the case that there's an I/O thread
running in the background, something like that. Basically, what we do is we
take this extraction and the kernel. We express them both in a deterministic
language, and then if we can get that code to compile, then we have proof that
it's safe for two those two components to run in parallel without
instrumentation.
>>:
[indiscernible] usually the thing that is running in parallel with the
25
kernel thing is completely isolated? They're operating on different pieces of
memory and whatnot? Is that not true?
>> Joseph Devietti: Yes, which is fortunate, because we need to be able to
express that isolation in a static way. So that limits the amount of reasoning
we can do. But yes, that is true.
And then for performance reasons, we found that it's good to count instructions
everywhere. We're still building quanta, because we're doing the standard
execution level determinism approach. And because we don't -- for load balance
purposes, it's hard to know a priori whether we'll have all the threads running
something that is in a deterministic language or not so we just count
instructions everywhere and that turns out to be the best approach for
performance.
So now I'll show you some results. Again, benchmarks on the X axis here.
We're also normalizing overhead to non-determinism like before. And these are
experiments run on a real machine. So these are the blue bars I showed you
previously for our deterministic compiler. And as you can see, in the green
here, that if we are able to leverage these well-structured kernels of data
parallel computation, we can remove a dramatic amount of the run time
instrumentation that we're doing.
So I think this is a really promising approach, moving forward.
>>:
But in these programs, do they even have any of the other kind of stuff?
>> Joseph Devietti: Yes, they do. So we have a workshop paper about this, if
you're interested in more details. I can also talk with you more offline. But
yes, so we have a table there showing that these programs do acquire locks, use
condition variables, and so forth. And so obviously, parts of the program can
be expressed very cleanly in a deterministic language, but it would be
nontrivial to take the entire code and port it over.
All right. So just to summarize the contributions I've talked about today,
first of all, at UW, we pioneered a set of techniques that provide determinism
for arbitrary parallel programs. So even programs that have data races in
them. I've also shown you that relaxed memory consistency is a very natural
optimization for determinism because of the cost of deterministic
synchronization.
26
I showed you how to implement determinism across the stack, right, which parts
are really amenable to hardware support, which should live in the compiler, and
which maybe even have a good case for modifying the language.
So now I'd like to talk about future work. Since I work in the area of
determinism, you might suspect that all of my future work is strictly to find
in terms of things that I've already done. And you would be somewhat correct.
So I am very interested in using determinism to improve program safety and
performance. But looking more broadly, I'm also really excited about this
notion, what I'm calling safe memory, and so this is an idea to battle some of
the upcoming programability challenges I see coming with emerging processor
architectures.
So let's start with determinism. So one of the things I'm excited about is new
use cases for determinism. So one of the things that's nice about determinism
is that it lets you run multiple copies of a multithreaded program, and so
essentially doing replication and you can keep those replicas in synch very
easily by broadcasting the inputs, because you know the replicas diverge on
their own accord because of determinism.
So I think this is going to be useful in many ways. One particular way is that
if we are trying to analyze some multithreaded execution, right, let's say we
want to detect erases, we want to track information flow, we want to turn on
expensive contracts, checking, something like that, typically what we'd have to
do is just layer all of those analyses into a single execution. That was the
only way we could get a sound union of all those guarantees.
Now that we can replicate a program using determinism, we can on one analysis
in each replica. So we only pay the overhead of, say, the slowest analysis.
It's possible to go even further, that we could take a particular analysis and
essentially stripe it across multiple replicas and if we're smart about how we
do the sampling within each replica, we could construct a scheme where we ended
up with the same full coverage guarantees, but we don't pay for the execution
overhead in any particular replica.
I also think it's useful to think about releveraging the isolation that we are
using to enforce determinism in other parts of the programming stack. So
definitely, inside the run time system, one can imagine using this isolation to
maybe allied synchronization, enable new compiler optimizations. It's possible
27
this could even be useful at a programming model level, so possibly part of the
programming language as well.
On the other end of the stack, getting back to Ben's question earlier, all of
the systems I've talked to you about today provide deterministic output, but
they don't provide deterministic timing. So that would be an interesting
avenue for future research. There's many, particularly in the realtime systems
community, people who care a lot about deterministic performance and I think
there's a large overlap in the mechanisms that can be used to provide
deterministic timing and they could also, with slight modification, be sort of
turned down a little bit to allow just a deterministic outcome, if you wanted
to provide both sets of guarantees.
Another thing that I think is interesting is that in terms of information flow
security, there's also another point of connection. That the non-determinism
that people are concerned about eliminating in information flow security,
right, so you want to stop timing channels in that case, is very similar to the
kinds of non-determinism that the systems I've described today are eliminating.
So in some sense, you could think of the execution level determinism stuff I've
described to you as providing a very low level of information flow security.
It's preventing certain flows of information because we fix the flow of data
within the program. And so that could eliminate certain channels.
And then information flow security, you know, eliminating timing channelling,
eliminating propagation from trusted to untrusted data is a generalization of
that idea.
So on to safe memory. So we're all familiar with Moore's law, right in
exponentially more transistors over time. Historically, this has led to
increased single thread performance. We got faster processors. More recently,
this has shifted into providing more cores on a particular chip, and in the
future, we're going to see multiple accelerators, special purpose accelerators
living on the same chip along with multiple cores.
And fundamentally, these phase shifts have been driven by the power demands of
processors. So because we couldn't scale frequency anymore, with he shifted
from a single core to multiple cores and increasingly, we're seeing that
because even though we get exponentially more transistors, the power used by
each transistor is relatively constant even though they're getting smaller.
This means that if we kept all the transistors on all the time, we would have
28
exponentially increasing power per chip, because obviously, not particularly
sustainable. And so that's leading a trend towards specialization.
So we've got special purpose accelerators that we use some of the time, most of
the time that they're off, they are off. And so this allows the entire chip to
stay within its power budget.
So what do these accelerators look like? Well, there's going to be multicores,
obviously. I don't think those are going away. GPUs are now integrated on
chip and we can also, in the future, who knows what those crazy architects will
come up with, right? Crypto, reconfigurable. There's many possibilities.
Unfortunately for a variety of reasons, I think that these accelerators are
going to be more or less black boxes. I think there's an intellectual property
argument here, but also because these accelerators are designed for low power
operation, it doesn't really make sense to add a lot of extra stuff so that you
can see what's going on inside these accelerators. So these are black boxes of
functionality. Because there's no visibility here I think there's going to be
a great programming challenge that is similar to the multicore programming
challenge that we faced, but in some sense much worse, because we can't
instrument the code that's running on these accelerators. Yes?
>>: Can you contrast this kind of thing with a system on a chip?
amount of chip not exactly the same story?
>> Joseph Devietti:
How is this
It is exactly the same story.
>>: So you can buy phones today that have hundreds of millions of phones that
don't have the programability.
>> Joseph Devietti: Right. I think that basically, so system on a chip
definitely exhibits some of these trends. I think that going forward, as we
have more and more transistors that we need to burn, I think that the sort of
scope of each particular accelerator will become much more specialized. And so
because these things will become very specialized, maybe you would use it to
accelerate, you know, part of the garbage collection algorithm or part of the
data structure operation. So they'll be woven into the program in much more
complicated ways than they are currently.
>>:
[inaudible] GPU eventually context switchable and it's going to read an
29
actual memory so get away from the isolated [indiscernible].
>> Joseph Devietti:
>>:
Right.
[inaudible].
>> Joseph Devietti: Yeah, so the question is about AMDs, plans to bring GPUs
on chip. So shared memory is already here with some of the fusion co
processors. And then, yes, you know, certainly the OS people want to
virtualize things too.
>>: Transistors, we really want to develop these programs efficiently.
guess it goes back to [indiscernible]. We are doing that in this
[indiscernible] still with the new types of chips we don't have all the
debugability extensions and everything.
I
>> Joseph Devietti: Right, so the question is how much debugability logic or
budget should we add to each one of these accelerators. And I think that that
answer is very little. And I think historically, that's been the case too.
But I have a new, perhaps, idea about where we should spend some of that's
complexity. So just to flip quickly through all the problems you're certainly
familiar with, right. So we've got data races, atomicity violations safety
issues, right. I think we'll have new kinds of bugs around resource brokering.
How do you ensure that only one person is using an accelerator at a time. How
do you do debugging across code that runs partially on the CPU and then steps
over into one of these accelerators.
>>:
What is resource brokering?
>> Joseph Devietti: Just ensuring that if you had a reconfigurable logic,
right, you want to ensure that only one person is using that at a time, that
they don't have conflicting configurations, so forth.
Setting up, often these accelerators require a little bit of configuration
before you can run the computation that you want so they are configureable in a
limited way. And there's certainly I've noticed a lot in the GPU programming
I've done that there's a lot of issues that can come up with sort of speaking
the right protocol to the accelerator. I think there's a lot of opportunity
for providing richer support in terms of speaking these protocols correctly.
30
So one option we could have that we've sort of been talking about already is
that we can kind of turn these black boxes into gray boxes. So we can peek
inside a little bit, see what's going on. Maybe that will help us figure out
some of these issues. I think that better approach, though, is to instead
think in terms of the communication fabric that we use to talk to these
accelerators. So for various reasons I think this is going to be shared
memory, though I'm definitely up for a good shared memory versus message
passing debate if anyone wants to join me later.
Ultimately, shared memory has lots of problems. Hopefully I've convinced you
of that over the first part of this talk. So what I think we need to do is we
need to build something on top of shared memory, I'm calling safe memory, that
provides much more safety than shared memory does. So I think there's all
kinds of specific safety properties we could build on top of shared memory as
opposed to modifying each of these accelerators underneath.
So we could look at ensuring that an accelerator executes code in an atomic and
isolated fashion. We're definitely going to want sandboxing for security
purposes. We can think about implementing watch points, which would be helpful
with debugging. We can think about ways that we can use the memory have face
to debug what's going on inside an accelerator so we can do the debugging from
the outside as opposed from the inside. Performance isolation issues. There's
lots of things that we could provide strictly in terms of this communication
interface.
And I think that the opportunity here is that if we can do that efficiently,
then we can provide these safety guarantees in a really agnostic way. You can
take any accelerator that you want, plug it in, as long as it knows how to read
and write shared memory you're going to be fine, because the communication
fabric itself is providing you with certain safety and programability
guarantees.
So here just some first steps I would like to take in implementing this
project. At the ISA level, obviously, it's important to define what exactly
the communication model is going to be. How rich is that model in terms of how
accelerators communicate with each other. And then defining interesting safety
properties from there.
At the hardware level, we need to investigate efficient checking circuits both
in terms of the logic and then the storage that's required to enforce these
31
safety properties. I think there's also a large opportunity to specialize the
way that the network is built because these checking circuits will have very
particular traffic patterns and I think we can heavily optimize for the kind of
work that they're doing.
At the software level, I think verifying the control protocols that we use to
set up a computation on an accelerator and then kick it off, wait for the
results, I think that there's a lot of opportunity to use maybe types or other
kinds of static verification to ensure that we are speaking to these
accelerators in the right way.
So thanks again for your attention. So to conclude, I've shown you today how
we can provide determinism for arbitrary parallel programs. I've shown you how
we can trade strong memory consistency for higher performance in the context of
determinism, and I've also shown you how we can implement determinism across
the stack from hardware all the way up to programming language. Finally, I've
shared some of my thoughts on this new idea for safe memory which I think is
going to be important to address the upcoming programability challenges of
emerging processor architectures.
I definitely didn't do all this work by myself. I'd like to thank all of my
collaborators I've had over at UW and also here. And now I would be happy to
take any additional questions that you have. Thanks again.
>>: So with few exception, most of the accelerators that you were pointing
out, GPUs, PDAs, many of the SOC kind of things get a lot of performance out of
managing separate memories and things that aren't in this big pool of shared
memory that you deal with. How do you propose addressing the safe memory when
a bunch of that memory is not even controlled or used in the same way.
>> Joseph Devietti: To maybe a way of rephrasing that is how does safe memory
deal with maybe caching, right. So accelerators who want to cache things for
locality, for power. And yeah, I think incorporating that into the model is
very important. Certainly you can get all sorts of, this is possible even with
current GPUs all sorts of weird consistency problems if you don't do the
synchronization quite right. So I think probably the most straightforward
thing to do there would be to sort of hand off a chunk of data to the
accelerator and then ensure that no one else tries to read or write that data
until the accelerator is done with it.
32
>>:
Do you have any ideas about what that granularity is?
>> Joseph Devietti: I think ultimately, it would be need to be in terms of
bytes, because I think having fine-grained protection is going to be the key to
providing, to eliminating false positives, but also support arbitrary
accelerators that maybe do something complicated to our XML string and memory.
But you don't want to restrict protect the entire page that that string lives
on.
>>: So do you think the notion of XML data is going to bubble its way up into
programming language abstractions?
>> Joseph Devietti: Okay. So is the notion of the accelerator going to make
the up to the PL level? Yes. I think it will. I think just in sort of an
observer of the architecture community, it seems clear that a lot of the
concerns which were typically kept under the covers of the ISA, concerns like
energy and circuit reliability maybe going forward even are bubbling up and
some of them are best handled at the programming language level. So I think
something like verifying the control protocol that you use to speak to an
accelerator, you could have some run time mechanism for that. But it's
something that you don't do very frequently. And when you get it wrong, it has
very bizarre effects. So I think it makes a lot of sense to maybe try to
reidentify that in the programming language and statically check it, for
example.
Shaz Qadeer:
Any other questions?
>> Joseph Devietti:
Thanks again.
Okay.
Download