>> Doug Burger: Okay. It's my pleasure to... Sankaralingam. Boy, I haven't said that in --

advertisement
>> Doug Burger: Okay. It's my pleasure to introduce Professor Karu
Sankaralingam. Boy, I haven't said that in ->> Karu Sankaralingam:
I know.
[laughter]
>> Doug Burger: [inaudible] but I can spell it hands down -- who's visiting
us from the University of Wisconsin. Karu is -- I co-advised Karu for his
Ph.D. He was the lead student on the TRIPS Project hardware which built a
really exciting prototype, and he did a ton of work.
Since then he's gone on to glory at the University of Wisconsin on the
faculty there, has done some really, really interesting work in accelerators,
controversial papers about whether simulators are good, an influential paper
saying that ARM and x86 actually aren't that different from powered
performance in the fundamentals of the ISA, which was actually very
influential within Microsoft. A lot of people read that paper and have
talked about it.
So I'm really just proud of the work he's done since going to Wisconsin. And
the thing I love about Karu's work is that he digs really deep and tries to
get at fundamentals. So you won't be hearing a fluffy talk about the latest
flavor of the day. He'll be trying to solve the problem. And he's currently
in search for a universal accelerator, which may be an oxymoron, but we'll
find out. So thanks very much for coming to visit.
>> Karu Sankaralingam: Great. Thank you for the introduction, Doug, and
thank you all for coming. And it's great to have this opportunity to present
some of the work that's been going on in my group here.
So as Doug mentioned, going to be talking about our search for a universal
accelerator. I hope to give you some evidence that we are nearing success of
some sort. So please feel free to interrupt with questions at any time. You
don't need to withhold your questions till the very end.
>>:
You've caught a glimpse of the unicorn in the woods.
>> Karu Sankaralingam:
I may have.
I may have.
>>:
You know what happens to unicorn [inaudible].
>>:
Yes.
>>:
[inaudible]
Yes, I do.
>> Karu Sankaralingam: So a lot of this work -- or most of the work was done
by my students. A lot of it was led by Tony Nowatzki, who is graduating
soon; Vinay, who is a second-year or third-year student; Chen-Han and Venka
who have since graduated based on some of the early work they've done on all
these projects.
So we have to begin most all architecture talks with some kind of
quantitative graph, so I figured I would go with this cartoon pie chart which
is technology motivation for a lot of work that happened and happens in
architecture and is happening in architecture that energy expanded on
execution is a very small portion of the energy expended on everything else,
which is like fetching instructions, registry naming, all kinds of stuff.
So this is where today's state-of-the-art high-end processors are. And how
did we get here? How did we get into the mess? So this is kind of history,
very, very quick history of 30, 40 years of processors. So up through the
2000s the entire field was focused on building great high-performance
uniprocessors. Power wasn't considered such a big deal so it was okay and we
got to a certain thermal point that it was not okay. It was then led to the
multicore hype, which lasted not as long as some people would have liked, but
that's probably a good thing. And we figured out just putting more cores
isn't going to solve the problem.
Right now there is ongoing specialization era of trying to build various kind
of things on our chip and have them do various things for that. Let me drill
down a little bit and ->>:
Ask a quick question.
>> Karu Sankaralingam:
>>:
Yes.
Why do you end the multicore era in 2010?
>> Karu Sankaralingam: Well, I -- sure. I should continue it for longer. I
wanted to have a small time period so I can say it didn't last for very long,
but that's not quite the right time period. It is something ongoing and
maybe the hype ended in 2010.
>>:
Okay.
>> Karu Sankaralingam: [inaudible] so let me drill down into looking at what
I'm calling domain specialized accelerators. There's a lot of focus on this,
people saying, oh, I'm going to build an accelerator for regular expressions,
graph traversal, this, that, and bake it all in silicon and then hope someone
is actually going to build this and this all is going to make sense.
I'm going to argue this doesn't make a lot of sense in multiple ways, but the
good thing about all this is just like people are writing these papers saying
100X performance, 1000X [inaudible] 1000 times better performance than some
baseline. So it's like you read these and like what the hell is going on
here? How can it be thousand times where it got to be -- yes.
>>:
If the work is still about 8 or 10 percent of the work --
>> Karu Sankaralingam: Well, that is that [inaudible] also. But there's one
good thing in all of this in like one sense, which is like, okay, great, if
your technique is providing thousand times, 50 times better performance, a
few percentage simulated error doesn't matter. That's the only good thing
about this entire paradigm or work I would argue. I wanted to work that in
somehow.
But so what is the problem with all this, right? One, if you're just going
to -- everyone is going to build data accelerator for this that, this that,
then you're totally ignoring how do we provide general purpose performance
for workloads that don't fit into one of these domains. Second, it's like
how the hell do I program this? In fact, almost all of these papers don't
have the word programming in them. It's like here's my blob of silicon, now
go.
And this design complexity, how do I integrate so many of these things, what
memory do they talk to, how to actually feed this. And I coined this word
called obsoletion prone. Basically you say, okay, once stencil computation
goes out of error, you just have the silicon there. That's not doing
anything for you. And I also call something called the snow white trap which
is how do you decide which of these accelerators do I build. Is mine the
best looking accelerator? So build it.
And it's this economics argument, okay, let's find out the domains are
important, we'll build accelerators for them, but still will they continue to
be important. There's this big problem of which ones do I build, once I've
put 16 of them, what is the 17th accelerator I should put there. What
[inaudible] what constraints should it meet.
And [inaudible] just plain boring. What is the point of taking a workload
and putting something in digital [inaudible] like what do I learn by doing
it, what do I learn by writing this paper. It's okay if I don't write these
papers, but I want to read these papers for sure because I don't learn
anything from them. Right?
So I stole this picture from Joel Emer. He calls these domain-specialized
accelerators. And I'm paraphrasing him. He calls them white swans as
opposed to black swans. So I just stuck in these nodes over there. A black
swan effect is this [inaudible] effect that comes as a surprise to the
observer and the effect has -- and distinct event has a major effect. So
domain-specialized accelerators are white swans. There's no surprise at all.
You build the stencil accelerator exactly like you would expect to do it, and
there's no major effect. No one's going to build this thing. There's no use
of this thing.
So what we really need to be doing definitely in research is looking for this
black swan of having general purpose performance, easy programability and
high efficiency, and you can call this [inaudible] accelerator, can go after
various different workloads at very high efficiency. So, I mean, there's got
to be some surprise, maybe not, but it shouldn't -- then you will have a
major effect that you can accelerate very types of workloads and have a large
impact rather than going after one narrowing thing, narrow thing and hope for
the best. Okay.
So I'm going to propose we're calling this thing the ExoCore processor. The
word exo originates from linguistics which is this is originating from
outside the code in some sense, there's also a biology root word for this,
especially for a disease that is caused externally rather than internally.
But the idea is, the principle, we're going to infect a company from the
outside. But no. The technical idea is that principles of dataflow can
actually be used to take some kind of convention processor or a basic
processor and build a hybrid Von Neumann dataflow execution engine that can
provide high performance for general purpose workloads so we don't give up on
those. Okay.
And then you can take these very same principles of dataflow, combine it with
concurrency and build as one substrate onto which you can map many of these
acceleratable workloads that can give you these big integer factor
performance improvements. Okay.
And a summary of the results are based on some simulation studies of the
spine. We're looking at some prototype implementations right now we can get.
Compared to a high-end [inaudible] processor we can get higher performance
and half the power by executing in this hybrid mode of doing von Neumann
execution at some phases of the program and switching into a dataflow mode
during some other phases.
And for these other acceleratable workloads like deep learning, really highly
concurrent workloads, you can use this general fabric. And we will show you
can sustain 50X to 100X lower power than this big [inaudible] core, rather,
this thing that we designed.
And we'll also show that we can match the power efficiency of these white
swans that I just described before, these purpose build domain-specialized
accelerators. So that's a summary of what we can achieve with some of the
architecture ideas I'm going to talk about. Okay.
So I'm going to focus on the fundamental principles that we believe we have
discovered. They are relatively simple but yet very capable. Okay. So the
two principles that combine the wraparound of most all of this work first is
that by building an execution model that can seamlessly allow you to execute
as a von Neumann processor while quickly switching to a dataflow execution
model and having this happen at very, very fine granularity and at low cycles
to switch is real useful to execute programs that have these regions that
prefer von Neumann execution versus dataflow execution.
So this hybrid execution, surprisingly, this is just not being looked at in
the dataflow literature.
>>:
What about Arvind [phonetic] 30 years ago?
>> Karu Sankaralingam: No, Arvind -- in fact, I didn't know if var min was
going to be there. [inaudible], my student, said Arvind is there, be
prepared for some weird question from him.
And so he got up and said something like this is the only dataflow paper from
which I've learned anything. Or something to that effect. And he actually
went -- because we wrote this claim and like surely someone must have looked
at this. And people are nuts. So it's actually you need to have this hybrid
and it needs to be at a fine grain that you want to switch from one to the
other.
We'll get into some reasons on why you need this, and then we can talk about
whether you really need to do all this in the architecture, can you do it all
in the microarchitecture and so on.
>>: Okay. So now this is the question that you obviously came prepared for
coming here ->> Karu Sankaralingam:
Maybe.
>>: -- why do you need for No. 1 to split them? Why do you just have a
dataflow core that replaces the von Neumann core?
>> Karu Sankaralingam:
the dataflow core is.
>>:
Okay.
The dataflow core alone may suffice.
So this is more of a hedge to be more compatible with industry --
>> Karu Sankaralingam:
>>:
Depending on what the microarchitecture of
Sure.
>> Karu Sankaralingam:
>>:
You can.
Correct.
-- where they currently are?
>> Karu Sankaralingam: Yes. And also for them to -- if I have a core,
probably I'm willing to throw all of that away and start with a brand-new
design, I'm probably more likely to say, all right, I'm going to take my
core, I'm going to take your idea, integrate it, I'll see if it works. If it
doesn't work, I at least have my core, I can still sell it. Did that answer
your question?
>>:
Yeah.
>>:
[inaudible].
>> Karu Sankaralingam: Thank you. Apparently still that's my defense.
are the remaining [inaudible] so big?
Who
So the second idea is -- second principle, rather, is that you can combine
concurrency with dataflow. It might seem a little redundant. The dataflow
is inherently concurrent, right? But you can partition programs up into
individual data for regions, and you can have multiple of those going
concurrently.
It's kind of like what you guys have been doing in Catapult and Bing. And
that gives you a lot of the benefits of what all these domain-specialized
accelerators are actually exploiting. Inherently they're going off of
concurrency, and you can build it in a flexible way rather than doing it all
in a domain-specialized way.
So these are really the two fundamental principles that allow us to go after
what we're calling these general purpose workloads and to go after these
highly acceleratable workloads that have lot of concurrency inside them.
Okay. And yes.
>>: So you seem to be suggesting the dataflow is enough to cover all of
these different types of accelerators. Is that really -- like how can you
actually argue that? So for sure, right?
>> Karu Sankaralingam: Yes. So I will -- I'm coming very close to that
statement. I'm going to -- let me refine that a little more since you
brought it up. To go after these workloads, we need little bit more than
dataflow. You actually need to do little bit of communication specialization
where if you have operands being sent a certain way, you need to specialize
for that.
We need to do a little bit of data reuse specialization in terms of putting
in a scratchpad or something. So those are needed also. These two give you
the biggest bang for buck. You do need these other two and they get combined
synergistically. So when we talk about the second part, that will become
more clear, the role of those other two.
>>: So for the potential accelerators, you'd want that -- those different
features would be enough?
>> Karu Sankaralingam: Correct. And we've also looked at some scenarios
where this fabric is definitely not enough. And one great example became of
it was actually some of the FPGA work here on looking at compression where
basically you're just banging on this memory in arbitrary ways. The memory
isn't that big. You just have this [inaudible] hash table that is doing like
[inaudible] compression, then you just need to build a hash table and it
doesn't matter what compute fabric you put around it, you need that weird
multiportal thing. So that's one scenario.
Regular expressions and things like that where you're doing your regular
memory, you just need something else. So right now we have some work that's
looking at how can we actually generalize it and then can that actually
become another fundamental principle. Okay.
All right. So I'll very briefly talk about some tools which were
instrumental for us to be able to get this far long in this research. One is
for about again 30 years people had been looking at various types of spatial
architecture schedulers. So I looked at a couple in my research as part of
TRIPS, and there was lots of like really clever techniques in like doing
simulated annealling and various other techniques to push that forward.
And Tony, my student, came up with this idea that you could actually specify
the original scheduling problem for any dataflow fabric as an integer linear
program. And in the many past 30, 40 years, that literature has made huge
leaps and bounds in the kinds of problems they can solve and how fast they
can solve them.
So based on that, we came up with a technique that can provide the capability
to be a universal spatial architecture scheduler. So one input to the
scheduler is the architecture and the other input is the code you're trying
to schedule and will limit the scheduled core for that architecture. And the
architecture itself was specified as a graph.
So this gave us the capability to go and look at various parts of the design
space without constantly having to build some kind of scheduler and be
worrying whether or not our performance was gated by the scheduler that we
had. And one of the fascinating results of this was because we were using
commercial state-of-the-art ILP solvers, we were able to beat the performance
of published specialized schedulers for various different architectures.
It was also very easy for us to write this paper because all the TRIPS
coauthors who conflicted with this paper and could say whatever I want, how
about that, and no one would bother trying to kill this paper, right?
Because they still did it. It still took us four times to get this paper.
And I'm like, oh, my God.
But, anyway ->>:
So have you been able to compile very large programs with this?
>> Karu Sankaralingam: Yeah, yeah.
that were part of the TRIPS papers.
>>:
So we compile all the TRIPS benchmarks
We --
Very large [inaudible].
[laughter]
>> Karu Sankaralingam:
Aaron is a very, very self-critical reviewer, but --
>>:
I remember it took you four times to get it accepted.
>> Karu Sankaralingam: No, and then it's like bizarre we got this -- I'm not
even a PLDI guy. We got a PLDI Best Paper award for this paper.
But getting to your question, yes, we can -- we have done lot of these big
DySER program which is this function like thing we built in our group which
has 100, 200 instructions. We fed it even bigger ones as part of some
sensitivity studies we're looking at right now.
So okay. So million lines of code, is eventually going to get broken up into
regions. So we had like a huge [inaudible] FPGA that have like a billion
lots, then you're going to try and map the entire million lines of code but
you're typically going to break it down. Did that actually answer your
question?
>>:
Yeah.
>>: Integer [inaudible] known to be NP-hard.
that problem and somebody else's problem?
You know how to get around
>> Karu Sankaralingam: Exactly. That's exactly right. It is somebody
else's problem. And the great thing is -- and that -- you hit the nail on
its head. I wasn't planning to spend so much time on this slide, but that is
exactly why people have been avoiding this problem for 30 years.
They're like they have the introduction for all these papers, it's like, oh,
we have a scheduling problem, oh, we have one graph, another graph, it's
NP-hard, let's go solve some other approximate problem because we cannot
solve this. It's like sure. This is an NP-hard problem. This is what I
hope you guys do. They've been solving this for 40 years.
But they go and have heuristics and they automatically determine the
heuristics that apply for the instance of the problem that you're presented
with, and that can be totally tractably solved and solved optimally. It's
actually totally kick ass. So basically they used all this literature that's
been around for a long time. And not to say we didn't do anything clever
here. The clever thing we did was to specify the scheduling problem as ILP
question which is a bit of work and requires some creativity.
So the other thing is we came up with this simulation mechanism that can
model the role of architecture and the compiler in a single framework we've
been calling the [inaudible] build dependence graph that allows us to look at
various parts of the design space without having to first implement a
complete compiler, a simulator for it, which is all multiple years of work,
which we can then use a very rapid design space exploration methodology,
which is -- will be appearing in this year's ASPLOS and also recently
published as a Cal paper.
So some of the results I will be presenting will be looking at very, very
vast design space. So if you're wondering how we were able to even capture
all of that, it's part of this technology we built about two years ago that
allows us to do this very rapidly.
Okay. All right. So getting back to this graph. So why does dataflow
actually allow us to do that. So the simple answer is dataflow principles
can eliminate these overheads by getting rid of having to fetch and decode
and retract this all at every cycle and allow you to focus on just the
execution part. Okay. That's really the main high-level bit.
So very briefly, the history of how we got here was for the first 2008 to '13
or so, we build this architecture called DySER, which was a way of having a
specialized functional unit that we were going to integrate into a processor
pipeline and have the processor feed this thing with the load slice executing
on the main processor and just the computation mapped onto this thing. And
this itself would execute in some kind of dataflow fashion.
So we built a prototype, we built a compiler. This kind of drove a lot of
these problems that I talked about in terms of the compiler and simulation
infrastructure that we developed. So then after we build this, we did our
prototype, we did lots of evaluation, and we learned a lot of lessons.
Thankfully it wasn't like, oh, we built this, let's move on.
So one thing is, well, if you build this thing and then you try to feed it,
then the thing that's feeding it becomes the power bottleneck. Sometimes
even becomes a performance bottleneck. We then showed how we can use
dataflow principles to go and come up with better ways to feed this thing.
Going to talk briefly about the compiler, but what is relevant for this talk
is we observe from all of that work that program heterogeneity is abundant to
the execution of our programs and you can come up with various types of
dataflow techniques to go after various regions and fine tune the technique
itself to make the execution way better than executing on a conventional
processor.
Okay. So this was the main takeaway from building all of this and figuring
out what it is you're actually trying to do. So that's what I'm going to be
focusing on, this first principle of how can I have a hybrid von Neumann and
dataflow execution to get the benefits of both of these things. Okay. We'll
sidestep a little bit the question that Doug brought up, well, why do you
need the von Neumann core, but we're going to assume you need it just for
compatibilities and so on.
>>:
How do you get memory [inaudible] processor?
>> Karu Sankaralingam: You can deal with it in various ways. Once I talk
about the microarchitecture, if I don't answer it, you can ask me again.
Okay? All right. So here is the -- just to make sure we're all on the same
page. The architecture is going to consist of a von Neumann core. You're
going to then have an explicit dataflow processor which allow its execution
encoded in some kind of dataflow I say.
Okay. And we'll talk about why even von Neumann can complement dataflow
architectures, talk about some program properties we can leverage to have
this hybrid execution model, talk about our microarchitecture proposal here,
which is essentially a synthesis of many existing ideas. It's nothing super
novel here. And talk briefly about some results of how much better you can
do.
And the overall execution model is going to be you execute on this core when
you enter one of these regions that are best suited for executing in dataflow
mode. You'll switch to this thing, you'll power gate or turn this thing off
so you got a lot of power efficiency, and this will execute at much higher
performance and lower power than having to execute it on this thing. Okay.
That's the overall execution model we're going to have.
All right. So I'll go over this very quickly. Lots of you are likely
familiar with dataflow and so on. But the basic high-level idea is if I
compare von Neumann execution to explicit dataflow execution, if you need
speculation because you're highly control dependent and there's some kind of
data-dependent graphs that are correlating [inaudible] very efficiently, you
are likely going to need something like this to get high performance because
predication alone may not get you all the benefits, okay, as opposed to if
you have lots of operand-level parallelism and there's local and local ILP,
as in far away ILP, then you want to be executing in some kind of dataflow
substrate so you can get all of the concurrency without much of the
overheads. Okay.
And this example is just a cartoon that shows how something like this can
happen on very simple straight line dependence with some control flow
happening based on the load I just did. So whatever you do with the dataflow
processor, you're going to be much better on a control flow processor doing
something like this. Okay? And here you have lots of operand-level
concurrency, which could be obtained even with a SIMD processor, but that's
not so important. You can do well on a dataflow processor with this because
the loads are not on the critical path of whatever it is you're trying to do.
Okay?
So here is a more general characterization of what is actually going on. If
I look at control regularity and memory regularity along two axes, if both
are highly regular, right, GPU-friendly cores, then you do it on a SIMD
engine, you do it on GPU. You don't really bother with those kinds of
workloads on a general purpose processor. Okay.
If you have very low memory regularity or you're highly irregular, then it's
just going to be waiting for memory all the time. So you want to have a
processor that's really efficient in waiting around, it doesn't have a lot of
structures to maintain all this [inaudible], so you're likely going to do
well on a dataflow processor here because you don't have all these overhead
structures. Okay?
And if you are highly -- become highly regular but you have some control of
regularity, then again you be -- it's not clear. Depending on exactly what
the code behavior is, you could be better off on a dataflow processor. And
if you're somewhere in this middle region, right, where you have some amount
of control regularity and the memory behavior varies between the regular to
regular, then the low overheads of the dataflow processor are going to allow
you to get more power-efficient execution, and the higher concurrency inside
it will allow you to get higher performance. Right?
And the rest of the regions where you have some control regularity dominate
what's happening, you will need some kind of mechanism that allows you to
execute those regions efficiently, which is why you'll need some kind of
hybrid execution which you can overcome by taking some of these mechanism and
embedding them inside your dataflow core itself. Yes.
>>: [inaudible] one more time what you think the mixed -- or is good at this
[inaudible] because it's speculation.
>> Karu Sankaralingam: It's basically a core requires some kind of control
speculation because you don't have that much operand-level parallelism, so
your dataflow isn't doing anything, it's just called a lot. But your control
flow core will be able to run ahead much better than the dataflow core.
Okay?
>>:
Are these regions detectable dynamically while during execution?
>> Karu Sankaralingam: So we'll get to that in the next slide, whether or
not we are proposing you can actually do this statically and get away with
how you can use these cores.
>>: Isn't most of the energy the outer core because of the outer
[inaudible]?
>> Karu Sankaralingam:
Sure.
>>: So by combining a dataflow accelerator, I mean, you're going to have to
offload a lot of compute to the accelerator.
>> Karu Sankaralingam: Yes. You will. That actually goes back to the
previous question. So why do we offload actually becomes the big question,
right, what can we offload to this thing. And so we'll be spending -presenting some quantitative data here. So for this we particularly
[inaudible] only -- this is kind of subsetting to make it look bad for us.
So we looked only at this back-end workload and in regular workloads for
media bands, so look at only these hard workloads, right? If you throw
SPECfp here, our numbers would look even better.
Okay? All right. So we're going to look at these workloads and figure out
what program properties can we use to actually exploit this graph I just
showed before. Okay?
So the first property is this thing we call affinity phase behavior, or
underlying phase behavior in the program. So this was time and this was
different applications. Blue stands for preferring von Neumann execution and
orange stood for data parallel and the red stood for either one, either one
or the other.
The main thing is that these programs, or most programs, they have these
phases happening at relatively fine granularity. It's not like an execute
for a billion instructions, offload one, that will execute for another 1
billion instruction. These happen at thousands, 10,000s of instruction
cycles granularity.
So programs are constantly moving from one phase to another, okay, in fact,
so much so that if I had an ideal dataflow processor to which I could offload
instantaneously, only 25 percent of the programs wanted to be in this thing
all the time. Most of the time you really wanted to be in some kind of
hybrid execution mode where you're constantly moving from the von Neumann
core to the dataflow core because they have these program properties
intermingling within the entire program's execution. Okay.
So that's going to motivate us to have this kind of processor organization.
We're going to take explicit dataflow engine and integrate it within the
private L1 cache hierarchy. So you could migrate some registered values very
quickly, execute it here for some time using the same private level 1 cache
and then switch back here once you're done with this execution. So if the
granularity was much larger, you could do other things.
Okay.
All right.
So then we need to actually come up with some --
>>: So I understand that, are you switching within the speculative pipeline,
or is it post commit?
>> Karu Sankaralingam: This is post commit. Yeah. So what program regions
we could do, so we could take arbitrary core, right, functions, whatever. If
you do that, then the mechanisms -- goes back to Aaron's question, the
mechanisms you need inside the dataflow processor are high because you need
to cover all kinds of program behavior that ends up causing some [inaudible]
power overhead. Depending on how you design this, you could lower this, but
I'll call it relatively high. Okay.
If you did only inner loops of programs, then you could come up with a very
simple processor organization, but the coverages as measured from these
programs I just showed before is you're getting only 61 percent of all the
instructions in the program. If you did only traces, you get only 41
percent. And depending on if I had longer duration regions only, this comes
down even further. So neither of these is a good idea.
If you look at what we call nested loops, then you could still come up with a
mechanism that is relatively low area on power, low design complexity, and
you get pretty good coverage. So this is 74 percent of the entire program's
execution can be offloaded cumulatively, can get after offloaded to the
dataflow engine, leaving only 26 percent on average on the high-power von
Neumann core. And if you restricted yourself to only long duration regions,
the number still is pretty high. More than 2/3 of the program can be
offloaded to this low-power thing. Okay.
So this is our motivation, and we're going to go after a mechanism that will
allow us to offload nested loops. Okay. And the whole nested loops idea is
kind of -- it's a little bit important to understand because it means you
need to have some kind of control flow support inside this core because it's
not just one outer loop that's been added on to this thing. Okay.
All right. I will go over this relatively briefly.
want to -- yes.
But the main idea is I
>>: Sorry. Can you go back one slide. So I'm trying to understand the
difference between nested loops and inner loops. Because it seems like if I
had a nested loop, the inner loop would dominate. But here there's a 13
percent difference between the two. The one on the right is the subset of
the one on the left.
>> Karu Sankaralingam: So, well, this diagram isn't quite as expressive as
it probably could be. So the trip count of this is small when you have
multiple of these happening and you don't want to unroll all of this because
the dataflow fabric usually has some spatial constraints, then you really
wouldn't be able to fit the whole thing. Right? And if you tried to ->>: Why is your coverage higher with nested loops?
the same issue with nested loops?
Because don't you have
>> Karu Sankaralingam: So if you come up -- yeah. If you come up with an
encoding that allowed you to express this control flow in some way without
unrolling the whole thing, then you can come up with a smaller spatial fabric
that can take the entire program's execution.
>>: But if you -- but if you come up with mechanism, why doesn't it apply to
inner loops?
>> Karu Sankaralingam: You could. That's what I said, which is depending on
how you actually design your fabric, you could move from one to the other.
>>:
Okay.
>> Karu Sankaralingam:
Okay?
Did I actually answer your question, or --
>>: Yeah, well, the -- the -- the -- if you take 13 percent, it's like -- I
just don't -- it doesn't seem right to me that the difference between nested
loops and inner loops should be a fifth, right, of the interloop.
>> Karu Sankaralingam:
Oh.
>>: On the time you spend in the inner loops, the -- I would expect inner
loops to dominate nested loops as well. In fact, dominate them even more.
>> Karu Sankaralingam:
>>:
Inner loops to be more than nested loops?
Well, what do you mean by coverage?
>> Karu Sankaralingam:
could offload.
Maybe that's the [inaudible].
This is the total cumulative program execution you
>>: [inaudible] instructions. So really what you're saying is the -- within
nested loop the code that is not -- that is in nested loop but is not in
inner loops is 13 percent of the program.
>> Karu Sankaralingam:
Oh, I see.
>>: And -- and the inner loop is 61 percent of the program, and that just
seems really high to me given that the inner loop is going to be running a
bunch. Right? Because you're -- because you're going to -- you're going to
iterate some -- that inner loop some number of times, and then there's a
bunch of additional code. But that additional code gets executed once per
iteration.
>> Karu Sankaralingam:
>>:
Okay.
Yeah.
So that -- it just doesn't add up.
>> Karu Sankaralingam:
So this has -- okay, I should --
>>: I would actually point to the long duration, that that's even more of an
issue there.
>> Karu Sankaralingam: So there are some implicit heuristics we're also
using here in terms of when I offload I want to get -- I think the objective
function we were using is I want to get a higher energy delay product on
executing on this offloaded thing so it has some built-in mechanisms of -built-in heuristics on how efficient the mechanism we are using for each of
these is.
>>:
Okay.
>> Karu Sankaralingam: So, for example, if you're doing a trace, we were
saying you would have these sequences of instructions or compound function
units you would have for this. For the inner loops we were saying I had a
little bit of control flow mechanism. So there are some heuristics on how
much energy benefit you can get, and only then we allow the offload decision.
>>:
Okay.
>> Karu Sankaralingam:
breakdown.
>>:
So it could be a lot closer --
>> Karu Sankaralingam:
>>:
Yes.
-- and you're discounting a bunch of stuff --
>> Karu Sankaralingam:
>>:
This is just not a pure I can offload anytime
Correct.
Correct.
So why is it that you can -- not to beat on this graph forever --
>> Karu Sankaralingam:
Yes.
>>: -- why do you get such better coverage when you don't look at the trace?
Because the trace wouldn't encompass the dynamic path through the loop,
right?
>> Karu Sankaralingam:
>>:
So, well, okay.
So here --
You'd also get function call, which would not get --
>> Karu Sankaralingam: So the -- okay. These are -- depends on your code
behavior, right? So a trace is defined as something which has -- you enter
and then you exit single path of control. This is no predication, no hyper
block, nothing. And any time you deviate, they need to get back to the start
of the execution and start over from scratch. So if you're having little bit
of irregular controls, so most of the time you enter these traces, or rather
many traces, you end up basically throwing it away and restarting from
scratch.
>>: So you're saying you have a lot of side exits from the trace.
>> Karu Sankaralingam:
>>:
Yes.
Early exits from the trace.
>> Karu Sankaralingam:
Yes.
>>: So that's why. So maybe the trace -- well, I don't know how you filmed
your traces. Maybe that just needs to ->> Karu Sankaralingam: So this use -- again, like I said, this graph embeds
a lot of heuristics on how we could implement all of these. And some of the
design decisions will come back again when I talk about the actual dataflow
fabric we are using for these slides.
All right. So the high-level overview, just so that we can get to how the
architecture actually works, is you would take a program, you have -- we have
a compiler path that will detect these nested loops because nested loops are
a program property, you can have some heuristics on trip count to figure out
which ones you actually offload.
And then you embed in a data explicit dataflow encoding inner binary for
these things. And then when you run in a program and you enter one of these
regions, you transfer live values -- whoops -- live values from the main core
into this dataflow fabric. You executed on it for some time. When you're
done, you transfer back all the architecture state and you continue on in the
main core.
And since running on this core for long enough, you will power gate out of
our core. That will give you some power gating benefits. Plus this thing
will run at a lower power for the core region anyway. Okay.
And so the main thing would be you would transfer from von Neumann dataflow
at various regions and you will need some intelligent scheduling technique so
that you actually want to schedule or stream in the core that runs on this
dataflow fabric before the region actually starts, which you can do with some
very simple prediction technique that tells you I am going to enter one of
these regions by starting the configuration well before you actually enter
the program region. Okay.
And so that's this whole SEED config instruction that we introduce at
[inaudible], all of the state that needs to be transferred into this
instruction state that needs to be put into this fabric, and the begin
instructions actually tell you now is when this region actually starts -yes.
>>: So over there in the system architecture, are you representing -- so
where you say it's a core, it's a natural core, not a hybrid thread?
>> Karu Sankaralingam:
>>:
Yes.
-- companion typically?
>> Karu Sankaralingam:
>>:
Yes.
So each core would have a SEED --
>> Karu Sankaralingam:
>>:
This -- yeah.
Yes.
Yes.
And how does the threading of the core relates to the SEED?
>> Karu Sankaralingam: So we -- for now let's ignore hyperthreadings or
multiple threads. Each one executes SEED. For now let's assume that at any
point of time only one thread can be using the SEED engine. So when ->>: So if you had multiple hybrid threads for the core, there would still be
only one SEED and you just control the entrance to the ->> Karu Sankaralingam:
Yes.
Correct.
>>: [inaudible] happens when you [inaudible] decides to schedule another
thread after you've sent SEED config, do you have some like regions that kind
of become atomic?
>> Karu Sankaralingam: We can -- we'll get to how we can get precise
exceptions and exceptions at some point once we talk about architecture
design. Okay? Yes.
>>: What about the state for host component? Say if it's a short enough
time like DRAM, stay alive and keep the state? So can you power down these
two?
>> Karu Sankaralingam:
core itself?
Ah.
You want to power down more than the -- just the
>>: You mentioned it's a nonstate, so I'm saying the stateful could be
powered down for about as long as a DRAM cell is.
>> Karu Sankaralingam: You could. Right now we are -- for these programs,
at least the types of programs we are looking at, they have relatively good
L2 cache behavior and we have some decent L1 cache behavior, so we want to
keep this on. We want to play nice with the virtual memory, so we're
actually going to keep the DLBs on so we can do the translation.
So the main power -- we get some leakage power savings by turning this off,
but the main power savings we are going to get really is this thing is more
power efficient at executing the core than this thing. So turning this off
is just little bit of leakage power savings that we're going after.
>>: So jumping ahead a little bit, just I know you'll get there, but when
you run code on the native code on the OOO versus let's say for a region
you've offloaded on to C, what is the factor difference in energy efficiency?
>> Karu Sankaralingam:
Energy efficiency, uh --
>>: How many joules do I burn running code on the OO core versus the
equivalent loop on C?
>> Karu Sankaralingam:
performance.
Okay.
I don't -- let me see if I can jump ahead to a
>>:
[inaudible] might help.
>> Karu Sankaralingam: I don't. Because I notice in terms of performance I
know the overall energy. I don't have the individual region-by-region energy
breakdown off the top of my head.
>>: It'd be really interesting to understand what factor [inaudible] like
just kind of what -- you know, you started this off with the pie chart.
>> Karu Sankaralingam: Yeah. So we have the numbers; I don't know it off
the top of my head. I don't want to give you some bullshit answer. Okay.
>>:
Wouldn't be any different.
>> Karu Sankaralingam:
>>:
I know.
That's a good one --
>> Karu Sankaralingam: I don't want to give you -- I don't want to give you
a bullshit answer that seems too obviously bullshit.
>>:
[inaudible] your defense.
[laughter]
>>:
Okay.
>>:
I have a question.
>> Karu Sankaralingam:
Yes.
>>: A similar question. So what is the -- is there some notion of a minimum
amount of work that needs to happen with the SEED for it to be worth whatever
the overhead of the transfer is?
>> Karu Sankaralingam: Right. So for these results, we have an implicit
cutoff that you need to be executing at least 10,000 instructions. And the
results I will be presenting will be based on an Oracle scheduler that also
after the fact goes and checks if the objective function, which were an
energy delay, actually improved.
We also have a heuristic that will have some mechanisms to detect this. We
call the [inaudible] scheduler which will do this, and it's being optimized
for dual issue core, and that ends up being pretty accurate in terms of
offloading only when necessary. Did that answer your question?
>>:
Yeah, yeah, yeah.
>> Karu Sankaralingam: Okay. All right. So I have about 15 minutes. Let
me see -- I'll describe the architecture briefly, and then we'll get to some
results at least. Okay?
All right. So this is just a brief slide that talks about a rich history of
dataflow work that starts from my dissertation. No. But -[laughter]
>>:
[inaudible]
[laughter]
>> Karu Sankaralingam: Okay. All right. So we have a version of this slide
which goes all the way back to 1977. Okay? I'm not going to get to this,
but the point of this is there's a lot of dataflow mechanisms that are going
after various different design decisions in terms of how they handing control
flow, how they do the dataflow firing rules, what the execution units itself
there are.
And some of these are actually dataflow processors, but they don't call it
dataflow processor like the BERET work. It's basically a very tiny dataflow
engine that's doing compound function units. Right? So we can combine
judiciously a subset of these to come up with one microarchitecture that
covers as efficient execution on nested loops. This is not the only way to
do it. This is one way to do it. It's kind of think of this as a way to
come up with some quantitative numbers that give you a good sense of the
potential for something like this.
So what we ended up doing is our criteria was we wanted to have low area on
power because we're going to take one of these, attach it to every core. If
we said I want to have one of these for my entire chip, I would do totally
different design decisions. So I want this to be small. I want it to go
after certain types of these things. So that's why we did these things.
We also wanted to complement the capabilities of the von Neumann core. So we
don't want to go and reinvent a bunch of load store queue mechanism, memory
[inaudible] mechanisms. So if you're going to do purely nonspeculative
dataflow execution, you will just serialize when you get to some regions, and
that gets embedded into the heuristics which you'll use for offloading core
to this thing. Okay?
So basically we integrate between the level 1 cache. We have some relatively
simple mechanisms for doing control flow with predicate execution. We
absolutely don't allow any control speculation. So this is purely
nonspeculative dataflow. And we'll use some very simple dataflow-based
firing rules. And to get a monetization of how much, how little instruction
fetch and decode we need to do, we're going to have execution units that
allow compound function units by combining multiple types of primitive
operations into a single big block.
So you combine all of this. You get a microarchitecture that looks like
this. We have eight compound functional units that within them allow the
capability to do two to five primitive RISC-like instructions with the work.
We will have a simple instruction storage resembling the classic token store
for the original dataflow machines. We will then have a simple bus which
will allow us to transmit two values every cycle.
So this is, again, a heuristic engineering design decision we came up with
that this was sufficient. That will then end up triggering more dataflow
execution on this entire thing. Okay. We'll have a store buffer to
serialize stores to go out of the memory system. We'll have a transfer block
whose job is to take the architecture values from the main processor and move
it into this thing. We have a configuration unit whose job will be to put
instructions into the instruction store and do some little bit of
configuration of these compound function [inaudible]. Okay. So that's the
main high-level architecture.
And in terms of why this actually performs much better than other four core
is in parentheses you can see the theoretical maximum IPC we can get.
Because we have these compound function units, we have eight of them, we
could get as high as 16. And I'll make some time to go over some performance
numbers. You'll see that we get some really high speedups in some program
regions.
And instruction window ends up effectively much larger because it's
distributed plus each instruction is doing much more work than a primitive
RISC-like instruction even on a [inaudible] machine. Okay. And they got rid
of speculative control because if you get those regions, it will just do it
on the von Neumann core. Okay? All right.
So let's talk about some results. So this is based on a relatively good
scheduler -- I mean simulator. And we've configured it so we're not under -making a low ball out of our core. So I have very high confidence about
these results. Okay.
All right. So all of these graphs are performance versus energy, geometric
being across all the benchmarks that I described. And we're going to take an
[inaudible] code or two-way or [inaudible] and integrate a SEED engine to
them. Okay. So going lower on this graph is basically better.
So we also added just a BERET core to it or a BERET engine to it. We added
conservation cores which basically hardens basic blocks, attaches to a core.
We added in-place loop execution, and we added the set of bigLITTLE
[inaudible] what it would get. Okay.
So as you can see, these existing techniques give you relatively small
performance improvement. They're not big game changers in terms of how you
can push the envelope here. Okay. So with SEED you can get significantly
high benefits. You can lower this core by quite a bit, getting 40 percent,
50 percent higher performance, and do all this at lower power. Okay. So ->>: Can you go back, because these numbers are important. So really you're
looking at roughly equivalent performance at about half the energy.
>> Karu Sankaralingam: Yes. This is only about 1/4 of what we can achieve
by adding more dataflow techniques, which I'll get to in a couple of minutes.
Okay. Is that okay?
>>:
Yeah.
>> Karu Sankaralingam: All right. So this slide actually is probably the
most important part of this talk. This breaks down why this thing is
actually doing what it's able to do. So this is representative of various
different benchmarks we have. The name of the benchmarks are not important.
But what I'm showing here is their individual regions, each of these tacks
represents one of those regions. They don't mean they all contributed equal
time to each program or anything like that. They're just regions which
naturally execute on the outer four core or on the dataflow core.
There are many regions where we do like 5 times, 10 times worse because it's
just control dominated, so you don't want to put it on a dataflow core or
build a dataflow core that can handle these regions very well. Okay. There
are some regions where we get similar performance, and these would still be
better offloaded because they will have much higher energy efficiency because
we don't have much lower power. Okay. This goes back to the question that
Doug was asking. I don't have a breakdown right now.
Then there are other regions, and these are the regions where there are
basically fewer [inaudible] and there's not a lot of -- whole lot of stacks
building happening for x86 versions of these cores. And then there are these
regions where we get really high speedup because it's just a lot of high
instruction level parallelism and we can get like 2X speedup compared to like
a big [inaudible] core. It's a big deal. Right?
And then other regions where you have a lot of high memory level parallelism,
this thing is going after faraway regions in the program and you do like 3
times and even 4 times better than your baseline core. And so by having this
hybrid execution of doing [inaudible] when you're in this region doing
dataflow when you're in this region, overall you're able to get performance
improvement and get power reduction.
>>:
Are you simulating the memory system faithfully here or --
>> Karu Sankaralingam: This is -- yeah. This is like a DRAM SIM-like thing
plugged into [inaudible] 5 which we then extract out using this GDD model.
So the point of all this is with the SEED thing that I've talked about we've
only scratched the surface of looking at this vast design space of program
behavior of which most of which is missing, saying if I have non data
parallel core and it had high ILP and control wasn't on a critical path, then
do this type of dataflow.
There's actually a lot more here. You can have data parallel code which has
low control, then you would do SIMD-like engine which actually can be folded
into a dataflow substrate. If you had little bit of control and you could
separate the data path into an axis slice and a computation slice, then you
would do what I was doing five years ago with our DySER project, which is
where we started, which led us to this entire thing. Right?
And you can populate the design space and look at various different parts of
the space. If going back to what Aaron was asking, if I had very, very
stable control and they had a tight hard loop, it was dominating the
execution, you would just build a trace processor that didn't even have any
of these IMU mechanisms to do different instructions and such. It would just
be compound function unit, one over the other, boom, good to go. Okay?
And so then you can flesh out various parts of the design space. And really
you want multiple we can call them accelerators. You can call them region
specialization units. You can call them whatever we want. We come up with
these simple mechanisms, you integrate them to an [inaudible] core.
And, in fact, people have papers going after each and every one of these
mechanisms. But one of the -- some of them I did, others others have done,
and they've all claimed how these are all very simple to design and can be
integrated into an existing core, and they all have some evidence that they
could be compiled for. Right?
So you could take a general purpose core, this could be an optimized dataflow
core, it doesn't matter, you can integrate these region specialization units
that go after these various different mechanisms, using some well-defined
principles, and each engine is itself very efficient. Okay?
So this, you can combine all of these, and this is the same baseline curve
that I was showing before. The range of the access is different because we
get higher performance improvements. And if I just add SIMD or CGRA to my
existing core, then you move down a little bit on this curve. Okay? If I ->>:
And this was -- this was the curve today because we do have SIMD.
>> Karu Sankaralingam: Correct. Correct. So our SIMD implementation is way
more generous than what you can get out of GCC. But you're right. This
is -- the red one is the baseline we should be looking at. And so by adding
various other things to it, you get a little bit of performance improvement.
And if you add all of these mechanisms, you add a trace processor, you add
BERET, and you add some SEED-like engine, you make a big shift of this curve
down. Okay?
To summarize all of the data here, the thing to take away here is this is
where we are. This is Intel's highest [inaudible] core. You can take a
two-way [inaudible] core, enhance it with these dataflow mechanisms and beat
it a little bit at less than half the power.
And if you feel you need every ounce of performance you can get, you can
enhance this thing as well by augmenting with these high ILP techniques that
are power efficient when you get into certain program regions, and you can
increase performance and lower the power consumption of this red thing. Yes.
>>: So how do you scale this into the future?
is it --
Is this a one-trick thing, or
>> Karu Sankaralingam: Yeah. So that's a great question. I think we have
barely scratched the surface here because we've gone only after three or four
program regions. You're covering about 67 percent of the entire execution.
And I think you can get more. You can make each of these individually more
power efficient. You can make the techniques themselves higher performance.
Right. So I think there's a lot more to be gained here by applying these
principles and individually tuning the microarchitecture itself.
It probably wasn't a great, very specific answer, but it's kind of -- that's
where I think we're going.
>>: Yeah, I'm thinking sort of going into the future and having issues with,
say, Moore's law or ->> Karu Sankaralingam: Sure. So, in fact, again, all -- none of this -- all
of these are designed to be relatively small area footprint. So they're not
going after like big techniques, so I don't think we require these -- the
design philosophy does not require us a lot of benefit from process scaling
or actually any benefits from process scaling.
So my personal view is this is kind of the most promising thing, right, which
is I build something simple and then I can keep improving it in a relatively
nondisruptive way and get lots of performance improvements. Okay.
So the summary of this part is that we can take these dataflow principles,
add it to a von Neumann core, and for modest sized core gives us big
performance improvements. And so can this help avoid the need for
application-specific accelerator for some more time because it can improve
general purpose performance some. That depends on exactly how much benefit
you need for these -- excuse me, specific applications that you are
targeting. Okay.
>>:
So -- so --
>> Karu Sankaralingam:
Yes.
>>: -- maybe -- maybe you'll get there, but I -- you know, if you take an
Intel, modern Intel Xeon, you spend about 30 picojoules to do a floating
point, multiply 10 picojoules to do that whole instruction, so it's back -factor of 300, so your pie chart is way too fat, the ->> Karu Sankaralingam:
Sure.
>>: [inaudible] part, right? .3 percent roughly. And then you're getting
about a 3X gain in energy. So you're still only pulling that up to 1
percent. If you believe those numbers.
>> Karu Sankaralingam:
Yes.
>>: So I -- I mean, maybe you're going to jump ahead and show how you're
going to get much more efficient.
>> Karu Sankaralingam:
>>:
I will try to do that in two minutes if that's okay.
Yeah.
>> Doug Burger:
We're here until 3:00.
>> Karu Sankaralingam: Okay. So I think -- let me paraphrase Doug's
question. So your question is for these workloads, the execution is still
pretty overhead dominant because did not change how much energy really goes
into execution compared to how much goes into overhead. So if you need
really, really high performance, these techniques alone are not going to be
sufficient. Right?
So you're right. Absolutely right. In fact, there are workloads for which
we need to get 50X to 100X performance improvements because that's -- only
then do they become meaningful for end users to use. This could be various
types of like datacenter workloads, data mining, speech recognition, all
these things where you need big improvements before things become realistic
for people to use. Right?
So then if you wanted to make something 50X, 100X better, you can put more
cores. That doesn't change the energy equation. You really need to do it
more energy efficiently. Okay. So that's what the second piece is to apply
concurrency and dataflow to go after these acceleratable workloads where
there's a little bit more structure and you can get after the concurrency
without having to change too much of the algorithm or whatever, right?
So and a very brief overview, I'll go through this very quickly, is we will
look at first what domain-specific accelerators are doing for these
workloads. So I look at four. And I will be most critical of one that Doug
is involved with. Which one?
>>:
[inaudible].
>> Karu Sankaralingam:
Shit.
I should have thought of that, too.
[laughter]
>> Karu Sankaralingam: No, this is the NP work. No, I -- and then we will
[inaudible] identify some common principles and show how you can unify all of
this and try to come up with some more genuine identifying thing. Okay. So
the principles which may seem obvious to you once you kind of look at it this
way but they were not obvious when we started, they're probably not obvious
for all of people who have designed these specialized things, right?
So for all accelerators, what people are explicitly or implicitly doing is
they're matching the hardware concurrency to that of the algorithm. If the
algorithm was hundred-way parallel, I'm going to build some kind of engine
that's hundred-way parallel. And there's going to be some way to do explicit
communication instead of doing it through a register file, which is probably
the most energy inefficient way of doing communication. Put stuff in a
centralized place and then someone will read it later.
It could be great when your communication happens across long time scales,
but if that doesn't happen, why put it in a centralized structure, just send
it point to point. Okay. So there's that explicit thing. There are some
problem-specific computation units that are used and there's some
specialization of the particular type of data reuse that is happening in that
workload. Or you can think of as locality, but data reuse is a more general
way of thinking about locality.
And finally there is some coordination of all these four other things that's
happening which is done through some explicitly hardened state machine of
sorts or a scheduler or something. But its job is to make sure these four
mechanisms work together properly. So these are principles. I will look at
four different examples which are really diverse to show how these same
principles are just implemented in different ways across these different
techniques. So first let's start with NPU, which is -- I mean, actually I
really like the work ->>:
You can slam it; it's fine.
>> Karu Sankaralingam: No, I wish I could, but -- it helps me make my point,
so I don't want to. So this is work that was done -- Doug and Hadi
[phonetic], others at U-Dub. And basically you run a small neural network
accelerator to go after general purpose workloads with a core-designed
embedded profiling compiler.
And so this is the architecture diagram for it. And all of these slides I'm
going to use this color code. So this is light green, this is light yellow.
They kind of show up similar, right? And this is the -- our redrawing of
their architecture, this general purpose core. There are eight processing
elements. That's where you get the concurrency from. You're doing something
eight wide compared to the baseline process. That's first way to get the
concurrency from.
Second in terms of the
specialization here is
sigmoids using I think
baseline processor you
kinds of weird stuff.
specialization of the compute, the only real
there's a sigmoid unit that does either 16 or 64 bits
eight cycles or so compared to if you're using a
have to execute some [inaudible] functions, do all
That takes a lot of time. Okay.
And in terms of communication, there is this FIFO buffer and an output buffer
whose job is to coordinate the communication among all of these processing
elements to do big networks which will then be virtualized to run on the
small eight piece. Okay.
And the final type of communication specialization is these all have a shared
bus that allows them to rapidly communicate values with each other which are
the ways you would send values from one neuron to all of the neurons that
here can belong in one of these eight other piece. Okay.
>>: I think the only thing probably that I took issue with, and I think
you'll agree, is that we didn't pick eight units because that was the
concurrency of the program.
>> Karu Sankaralingam:
Yes.
>>: You know, doesn't matter for energy, right, you can do it in space or
time. We just picked eight because they kind of got the lion's share of this
more area.
>> Karu Sankaralingam:
Sure.
Sure.
Yes.
>>: So if we had 16 [inaudible] decided it wouldn't have been any more or
less energy efficient.
>> Karu Sankaralingam: Yep, I do. Yeah, I misspoke. So the concurrency is
matched a little bit to other. It's based on engineering heuristics on how
you could make, then, that architecture work.
>>:
[inaudible]
>> Karu Sankaralingam: Exactly. Okay. All right. So that's what is there.
Okay. And so -- and finally there is this controller which is too kind of
related by unrelated things. One is within each PE it kind of figures out
when to push stuff here and figures out when stuff arrives in these output
buffers. And it also very carefully schedules stuff on this bus so that
there are no conflicts and things don't get lost. And this is hard coded as
a state machine which is generated by the compiler that will take a neural
network and create all this fancy logic and embed it onto this thing. Okay.
So that's NPU, and then therefore a couple of slides that talk about various
other architectures. I'll briefly talk about convolution engine, spend a
little bit time on DianNao. So this is work from Stanford, and you can see
some of these mechanisms show up again here. There's high-level control
state machine whose job is to push stuff into specialized one leash of
registers or 2D coefficient register which is a way of doing data reuse
because that's what you want to do for convolution, right?
So either 1D or 2D convolution and some specialized storage for output. And
the combination itself is done basically through these relatively widish SIMD
units. They do some amount of computation specialization. They should also
be colored in little bit of red because they do arbitrary -- semi-arbitrary
precision computation over here. And there's some fusion operation that
happens for combining many of these operations before you write them back
out. Okay.
So and within this there's a lot of concurrency inside every one of these
things, and they go into a reduction tree before you get written out. Okay.
So that's the -- you can see these mechanisms show up on this convolution
engine design as well.
And Q100 I'm going skip. Needs a lot of knowledge on TPC-H queries, which
I'm not going to have time to cover to get this across. DianNao, some of you
may be familiar with it. It's this [inaudible] work that's been coming out
from Olivier Temam and his collaborators in ICD. Thank you. And it's a
fascinating work where essentially they update these big networks and stream
some set of weights into a scratchpad. They call this synapse buffer. And
then the actual inputs and outputs come into two other dedicated buffers and
stuff is just constantly streaming in after weights have been locked down.
Okay.
And the actual compute is basically a big array
feeding one sigmoid nonlinear transformation at
does very little for DianNao. If we can't have
cycles doing this, you still get almost all the
from the specialized chip. Okay? Yes.
of multiply accumulates,
the very end. This actually
this at all and we spent 40
performance you would get
>>: [inaudible] I don't know what [inaudible] but we use just a mother table
for that ->> Karu Sankaralingam:
Oh, I see.
Okay.
>>: [inaudible] that was fine, which means that it's already verging on
general purpose.
>> Karu Sankaralingam: Yeah. Yeah. So I don't know what -- so this is -- I
know the 16 bit, but I don't know exactly how it's implemented. They have
some cycle numbers in their paper on what its performance expectation is.
Right. And the communication special ->>:
[inaudible] he was also a coauthor of the MPU work.
>> Karu Sankaralingam:
Yes, I know.
I did --
>>:
[inaudible]
>> Karu Sankaralingam:
>>:
Yeah.
I know.
I forgot -- I didn't hear you say [inaudible].
>> Karu Sankaralingam: And you can see one of the very nice things about the
communication specialization here are these add and multiplies. They just
communicate with each other, getting rid of all of this [inaudible] traffic
the general purpose processor would have to do because that's the only way
you can do it. Compared to a dataflow processor, for example, in which
you've probably explicitly designed it to do point-to-point communication and
do it efficiently. Okay.
So combining all of this, what we can do or what we conclude is all
accelerators, since you looked at four, we can now say all, but all
accelerators, they essentially imply these five common principles. I'll
briefly talk about some examples to prove the negative briefly. Okay.
And you can then as an architect come up with mechanisms to implement these
principles in a somewhat general purpose way or a more universal way. Okay.
To get concurrency, you come up with multiple tiles. Now, what should each
tile comprise of to get the communication and computation specialization, you
come up with an efficient communication mechanism by using a spatial fabric
that does efficient point-to-point communication, and that thing will execute
in dataflow fashion, avoiding the overheads of going through some centralized
structure and so on.
Okay. And to data reuse, oddly a single port of scratchpad is enough for
these workloads. Okay. I'm not saying any accelerator workload is enough,
but essentially when you can get this 500X, 100X, 50X speedups, this is some
inherent simplicity in your algorithm that your algorithm provider has
already provided. So you can get away with a single probably scratchpad,
logically partitioning among different structures you need to put on to it,
different data structures. Okay.
And for coordination, you need to make these things do different things. You
can bake it using some hard-coded state machine if you can get away with a
little bit of [inaudible] inefficiency, I say just put a simple, very, very
simple three-stage processor there and just write C programs for it. Okay.
So you can combine all of this with this high-level microarchitecture where
you will have some set of tiles which [inaudible] embed some computation
specialization, attach a simple, efficient router to all of these so they can
go do efficient point-to-point communication and do a little bit of routing.
Have a mechanism to feed this with values and then have a scratchpad that can
send values into this fabric, attach a simple low-power core whose job is to
coordinate what goes into the scratchpad and sequence values going from the
scratchpad into the fabric itself. Okay.
And you stamp multiple things out of these and you'll get a concurrent fabric
that can do various types of workloads and can be programmed using high-level
language without having to be generating specialized RTL that gets baked into
a [inaudible]. Okay.
So very briefly why I think this is really cool is you can write C programs
to this with some simple pragmas to map different structures, different data
structures to different hardware structures. And you can actually come up
with compilers that will take C code, generate assembly code that can run on
this fabric. Okay.
And at this year's [inaudible] there was
[phonetic] group and my student Tony and
done some great work in basically saying
DySER is not necessary by coming up with
technique that will just profile running
fabrics dynamically. Okay.
some work from David August
Matt Watkins [phonetic], they've
all of the compiler work we did for
a dynamic binary translation
code and generate code for spatial
So you can -- with little bit of things, you can have a static complier that
will map code onto this. This is also evidence that some part of this can be
done dynamically today. There's more work that needs to be done to manage
the scratchpad and so on.
So you can program this thing, that's the main point I want to make, by just
writing C programs. It's also a hardware design component of how many units
do I need, what do I put in every one of these. So that's a design process
you do based on what are your workloads of interest and so on. Okay. Not
going to spend too much time on this because I want to at least briefly talk
about results. I'm happy to come back to this.
So to give you some context of what are the results I'm going to talk about,
we looked at these four different accelerators. We looked at their published
numbers, and we're compare to their published numbers. We got some more
details about the actual numbers from a subset of them. DianNao, I'm still
talking to Olivier for him to double-check exactly what we are reporting.
Okay.
So you'll see graphs where we have their published numbers compared to what
we get out of LSSD. Okay. All right. So this talks about the four
different configurations we have. Each is different tuned to match exactly
one thing. We have in our paper details of coming up with one fabric, how it
will match if I just add that one thing and run various different workloads
on it. Okay. The main difference is what goes into the compute fabric
itself and the total number of tiles. Okay.
All right. So for these results, what we do is we provision each of these
fabrics to match the performance of the specialized domain-specific
accelerators. So that's why you'll see this blue bar is by design matched to
come up close to the domain-specialized accelerators.
Q100 I'll come to in the end. Okay. It's surprising result, but we'll come
to it. And the main takeaway from here is the overhead you pay in power and
area is about 2X. You might be wondering why am I here talking about a
technique that gives you more overhead, but this is a single fabric
programmable in C to which you can unleash a compiler, run various different
workloads that you've not thought of while designing the thing. And we're
comparing that to domain-specialized things that have been baked into
silicon. Okay.
So my one way to paraphrase this result is all of these techniques, their
true benefit of specialization is only 2X in area, only 2X in power.
>>:
So how are you doing the sigmoid in the NPU case?
>> Karu Sankaralingam: So for this we said the [inaudible] we have a sigmoid
unit, that one sigmoid will be embedded inside the tiles ->>:
You have a sigmoid [inaudible].
>> Karu Sankaralingam: Yes. And that there are actually some of the ones
[inaudible] for some depending on the size of the tiles we have and the
fabric, we can be NPU because of [inaudible] and so on, right?
>>:
I'm not worried about that [inaudible].
>> Karu Sankaralingam:
Yeah.
>>: How do your 2X [inaudible] compare with the [inaudible] you pay for FPGA
versus [inaudible]?
>> Karu Sankaralingam: All right. That's a great question. I don't have
the quantitative numbers, but I can answer it qualitatively. Okay. Which
is -- okay. It's in a backup slide. Let me now try and find it. But the
high-level bit is FPGAs give you some other capabilities. They are lacking
in a couple of ways. One is right now if frequency of FPGAs is not very
high. So the performance will be limited.
Second, depending on if you're going to synthesize all of these functional
units using the [inaudible] instead of putting them as specialized DSP
slices, that's going to cost you a lot. Third, if you're going to -- when
you have this much operand level concurrency and actual operations are done
on the DSP slices, you're going to effectively have to synthesize a big
multi-porter register file for these things to talk to each other. That ends
up becoming really inefficient. So I would argue actually better off with
the CGRA than using an FPGA with DSP slices and block RAMs to do the
communication ->>:
[inaudible] the CGRA [inaudible].
>> Karu Sankaralingam:
Oh, coarse-grained reconfigurable.
>>:
[inaudible].
>> Karu Sankaralingam:
>>:
Yes.
-- more pipelining, networks on chip, lots of hard [inaudible].
>> Karu Sankaralingam:
>>:
So --
I think that's a false choice because the FPGAs are moving towards --
>> Karu Sankaralingam:
>>:
Correct.
Yes.
They already have the [inaudible].
>> Karu Sankaralingam:
Yeah.
>>: [inaudible] blocks. So really the question is what could you add to an
FPGA to get more of these benefits?
>> Karu Sankaralingam:
>>:
Absolutely.
So --
And close the [inaudible].
>> Karu Sankaralingam: [inaudible] so, in fact, what I do is I think you can
view this as -- this is on my conclusion slide as either this is the natural
evolutionary point that GPU, FPGAs, all of these will get to by taking these
mechanisms and embedding in the most microarchitecture friendly way for their
native execution model.
So I don't see this as like a competition of FPGA. I think this points to
some somewhat of a natural evolutionary trend where they're going. And some
of this stuff that's already in this [inaudible] some of these things that I
talked about are clearly already there in the roadmap. They're hardening the
DSP slices, they put these transfer and latches everywhere, [inaudible]
frequency to a gigahertz and things like that.
So in terms of where area goes, this is a breakdown of individually how much
goes into the SRAM, the dataflow fabric, and the [inaudible] connection
network. And this is the ratio of what happens compared to the specialized
accelerator itself.
So very briefly, so DianNao -- Q100 is this thing that's been designed to go
after TPC-H queries. Basically they designed an architecture that's actually
not very balanced. So the queries are sort heavy. Q100 is not good at doing
sort actually. So we can just do sort with a simple core we've put in. We
are way better at sort. That's basically why we end up looking better than a
domain-specialized accelerator. So there's nothing magic going on here.
It's just in my opinion slightly imbalanced design. But it's -- I mean, I
don't mean to criticize what it is.
analysis. Okay.
That's kind of what we came up with our
So let me very briefly -- well, let me skip this slide. So going after in
terms of performance analysis with our framework and in our paper, we talk
about it all offline. We are able to break down the performance improvement
that comes from each and every one of these mechanisms of just having a
simple core, multiple of these cores, adding dataflow fabric, adding the data
reuse, adding the communication and specialization.
And so this is why I was saying before that most of the benefits for these
workloads ends up coming from the concurrency and the dataflow execution.
The compute specialization isn't buying you that much. Okay. And it's kind
of also you can say is it something data reuses, something communication
specialization, you can label the it either way. Right.
So, again, I'm not saying this is true for all accelerators, for these
workloads which actually represent a pretty wide spectrum, this is what is
happening and these are the sources of improvement. And our paper has
details for each and every workload domain and goes into this breakdown. And
you can see for different workloads the breakdowns of where the implements
come from change. I'm happy to talk about this offline. And if any interest
in the paper, please let me know. I'm happy to send it to you.
Okay. So very briefly, when will this not work? There are many cases when
this will not work. For sure I know one case where it will not work is
your -- the workload you're trying to run is highly memory dominated,
specifically it's very irregular in that you're getting some data, you go to
memory, or maybe even a smallish memory, you do some bit mangling on it, you
go to memory again. Okay.
So then it's like, okay, I don't need to do any coordination, I don't need a
dataflow fabric, I just need memory and a little bit of arbitrary binary
logic, Boolean logic attached to that memory. That's all I need. And I want
that to be really efficient, and what we have right now is not very efficient
at it. We know it's not efficient. Okay. So there are two concrete
examples of this.
One, it actually goes back to the deflate work that you guys have done here
on doing lossless compression on streaming data using FPGAs. And so there
the biggest benefit or one of the big sources of benefit is this super funky
hash table into which you have eight accesses every cycle, and that is
implemented using the block RAMs, and like it's like awesome and this -- we
don't have a mechanism to match that.
It will be nice to come up with a single principle that will expose SRAMs or
block RAMs in some kind of IS [inaudible] that anyone can access this.
And IBM PowerEN is basically a regular expression IDE sorting engine.
main thing it does is it takes the [inaudible], does some intelligent
The
software, breaks it up into nice pieces, locks it up into a scratchpad, and
they just access this thing constantly [inaudible] build an entire chip for
it.
The only real specialization happening there is this memory which can be your
access with arbitrary transformations of inputs coming in, access the memory,
do some more transformations. So what we are looking at right now is whether
or not there is a general principle here to come up with memory sitting and
exposed that can be addressed in arbitrary ways and can then become a
building block that you can then embed into an FPGA. Because I've got to
believe building the massive cool hash table using the block RAMs isn't the
most efficient way of putting it in silicon.
Yes.
>>: What about the cloud workloads, right?
large memory footprints.
>> Karu Sankaralingam:
>>:
Sure.
They're the -- they have very
So once you get beyond the -- yes.
If you --
[inaudible] little computation to do, right, little execution --
>> Karu Sankaralingam: Yeah. If you go out of the DRAM capacity, then you
might be better off with like some offload engine that is running close to
the thing. To me the question is what should that offload engine be. Can it
be an LSSD and not looked at as workloads enough to answer that either way.
We have looked at some of them and ->>: The problem irregularity, right, it's -- they might all fit in memory,
they all might be designed to fit in memory, but they'll be irregular and
they won't be that much computation to do per byte for a lot of them.
>> Karu Sankaralingam:
>>:
They rent from memory, right?
>> Karu Sankaralingam:
>>:
Yes.
Right.
So --
[inaudible] just have a host.
>> Karu Sankaralingam:
Yeah.
I would argue --
>>: No, then the -- the thing that I was trying to get to is maybe you can
put your [inaudible] not in the CPU but in the [inaudible] memory hierarchy
or something like that.
>> Karu Sankaralingam: Yes. Absolutely. In fact, that's one of the things
we're looking at in my group is how can I take the pieces of this LSSD thing
and go after problems that are basically memory dominated, what is the unit
that I need. And our current results, my bias, pushes us to believe that you
basically get rid of this fabric, have a simple very, very primitive
processing core, and just the brute force approach gets you really far.
And people [inaudible] like funky prefetch engines, this, that. Just this
brute force approach of a power efficient code that's very efficient at being
idle can get you very, very far. So we have some. We can talk more offline.
All right.
So I have these results that show how much worse we are compared [inaudible]
work. So we are 5X worse when you [inaudible] on this thing. After we make
fancy assumptions about the hash table. I won't get into that.
So very briefly can we build this thing today? There are design tech actual
questions. Are RSUs simple enough in can these compliers actually work? So
we built some other research some. I think the idea is [inaudible] compiler
expert, lots of you are, there's every [inaudible] things can be built, these
things can be promising.
And in terms of ISA compatibility, there's a lot of ongoing work from my
group and others that say how you can embed all of this in some dynamic
binary translation engine and pull off many of these benefits. So I am very
exciting at all of these results, and I think they're really promising and
point to a direction of where we can push future processors forward,
providing high-performance and general programability.
So my group, we are looking at basically some kind of prototype of building
this thing and seeing, well, let's run it against a big [inaudible] rebuilt
it, see what happens. And I am confident we can beat many of these emerging
ARM V8 cores and N power and performance using many of these techniques here.
And right now we are in the midst of building this specialized -- not
specialized -- going after the DianNao specialized chip with a prototype
implementation of the LSSD idea that I just talked about.
That's all I had. I'm happy to take any more questions. And thank you all
for inviting me again and listening. I am sorry I ran over by 28 minutes.
[laughter]
[applause]
Download