>> Juan Vargas: Last of this marathon workshop is going... architectures. I will have Josep Torrellas from the University of

advertisement
>> Juan Vargas: Last of this marathon workshop is going on be on
architectures. I will have Josep Torrellas from the University of
Illinois presenting the bulk architecture, followed by Krste Asanovic
who is going to talk about 21st century architecture research, and Tim
Mattson on a topic that is a mystery.
>> : Because you never want to talk about it.
>> Juan Vargas: Right. Immediately I feel bad. We are going to move
some chairs around to start the panel, and the panel is going to be,
"Can Industry and Academia Collaborations be Effective?" And we are
going to have David Patterson, Burton Smith, Jim Larus, Tim Mattson...
>> : [Inaudible] we're getting started. Can you [inaudible]?
>> Juan Vargas: ...Josep Torrellas, and we hope to have a very good
discussion to end the workshop in fire, you know, having real fun. So,
Josep?
>> Josep Torrellas: All right. How do I get this guy out of my way?
>> : [Inaudible].
>> Josep Torrellas: Good. Okay. Welcome to the last session, the
architecture session. In this talk I'll give an overview of the work
that we've done in the last 40 years on the bulk multicore
architecture. I apologize if some of you guys have heard some of these
topics before. But what I'm going to do is I'm going to focus on the
vision and on the most recent work as well. Okay?
So when we started this project about four years ago we were thinking
about what is it that makes a programmable architecture? And that was
the goal at the very beginning, "Let's make a programmable
architecture." And what we've been able to think about is a
programmable architecture is one that attains high efficiency without a
lot of work from the programmer, without the programmer having to do
low-level tasks like placing the data or worrying about the coherence.
And at the same time one that [inaudible] it can minimize the chance of
parallel programming errors.
So we came up with this bulk multicore, and this is a general purpose
multiprocessor architecture which is cache-coherent. And it uses
tunable primitives for cache-coherence. One is signatures which are
[inaudible] filters, and the other one is chunks or blocks of atomic
instructions.
And with this we build the whole cache-coherence. And because of that
it relieves the programmer and the runtime from having to manage the
shared data. At the same time because we work on big chunks of
instructions, it provides high-performance sequential memory
consistency, and that provides a software-friendly environment for a
bunch of tools -- for debuggers, for WatchPoint and what not.
And then we augmented the architecture with a set of hardware
primitives for low-overhead program development and debugging. So
features such as race detection, deterministic replay, address
disambiguation are embedded as part of this architecture through
special purpose hardware. And this we claim helps reduce the chance of
parallel programming errors, and it has an overhead low enough to be
"on" during production runs.
So that's the high-level view. I need to just go over a couple more
slides to give the basic concepts. The bulk multicore is an example of
a blocked-execution architecture, an architecture where processors
continuously execute and commit chunks of instructions at a time. So we
are kind of obstructing a wasting of instructions. We don't do this
anymore. We don't worry about offering the architectural state after
our instruction. Instead processors will execute a thousand, tenthousand instructions and then commit. And for that they need to have
buffering. They buffer the state during these thousand instructions,
and then they commit. And when they commit, it's not that they write
back streams of data, all they do is they send the signature and that
should be enough to make sure that the state is coherent.
So this is the example that we have here. So we have a bunch of blocks,
and they commit in this form. You can either have the processor doing
the chunking automatically driven by the hardware itself. This is like
-- It's not unlike transactional memory but its implicit transactions,
the processor doing this in the background because it likes to do that.
Right? Or you can have the software that itself cuts the -- provides
hints of where you want to cut the code. And with this you have higher
performance because instructions inside these chunks are reordered by
the hardware and also the compiler can aggressively optimize the code
inside doing things that would otherwise be illegal. Okay? And just the
last slide here on this high-level discussion is that a big problem of
this architecture is the squashes. Whenever you are executing this work
and you find that somebody else has changed the state that you rely on
then you have somebody writing data that you read, right, you have to
squash. And each chunk then it's a problem if you squash. And for that
we use a lazy approach. At the end of the chunk we have to check that
there's no conflicts with anybody; we use the signatures for this.
Okay, so what we do is suppose you have this processor executing this
chunk, this one executing this chunk. No state goes out in the
meantime. This one wants to commit. When it commits all it needs to do
is it needs to send the signature, it doesn't have to send any data.
And the signature is checked against the signature of this other one,
and if there's a collision like in this case when the processor read
something that this one wrote, basically [inaudible] you squash this
one. And chunk commit is quite expensive, so we're going to try to
avoid this.
So throughout these four years what we tried to do is to build the
whole ecosystem around these chunks. We started with the architecture,
the hardware here. We started in Communications of the ACM 2009
explains the basic architecture. Since then we have been looking at
different aspects of the architecture. We also worked on feeding this
architecture with blocked code. Okay? We have a dynamic compiler that
is able to take in -- it takes code and is generating these chunks that
are used by the hardware. And we're working on a static compiler. And
then we can also have a profiling pass that runs the code and figures
out what are the communications, and based on that it gives hints on
how to chunk the code.
So the good thing about -- And there is also another part of the work
which is this additional hardware that does all the [inaudible]
detection and the [inaudible] violation detection and so on that uses
signature and hashing card.
So the interesting thing here is that you started with unmodified
source code, so you don't start with code that the user has
instrumented with transactions. Instead you start with locks, barriers,
flags. You have this compiler. You pass it through here and then this
hardware executes this code efficiently. Okay?
So what I would like to focus -- And so these ideas have been out and
we've basically told Intel about this several times, so if they are
interested these are the ideas and they can take them basically.
So what I want to focus on is give you some idea on what we've been
looking at recently which is some of these architecture issues in HPCA
and some of the compiler passes here.
So this is the recent accomplishments on the architecture side. One of
the papers we published a couple of months ago is on allowing
simultaneous multi-threaded processors, meaning hyper-threaded
processors, to work in this mode of chunks, so all the threads inside
an SMT processor working in chunks. We also extended the architecture
to support determinism, deterministic execution. And we have some work
on an architecture to record the program as it's running and
deterministically replay it later on. On the compiler side we had the
recent paper last year on how do you -- what are the interaction of
chunks with synchronization. What is the optimal place to cut the
chunks? And we're working on using alias speculation, another
optimization [inaudible] using these atomic blocks.
Okay? So I'll give just a hint of a couple of things. So on the
architecture side what we've done is we have used this concept of
executing chunks inside and SMT processor, a simultaneous multithreaded
processor. The reason is that many, many processors are simultaneous
multithreaded these days and you use the hardware better with those. So
as you may know, Intel has recently announced that Haswell has support
for transactions and it also works as far as they understand on an SMT
processor. But what we do hear because you have multiple threads
running on the same core, you can tolerate dependent threads. Meaning
even if you have a collision between two the chunks that are running on
two different threads, we're going to tolerate this and execute and
continue executing.
Okay, this is the advantage of having this processor with the thread so
close that you can keep state nearby. Okay? So that's the main idea
here. So we claim that is the first hyper-threaded design that supports
atomic chunks. And we analyze what happens when you have conflict
between the different threads that run on the same SMT processor. You
can either squash, you can stall, or you can order and continue. Okay?
And also we designed a way of having a many-core, a multicore, where
each of them is an SMT processor and all processors run in this mode of
chunks. And this is obviously more cost effective than running a single
thread per processor. You have higher performance for the same core
count and reasonable performance for a quarter of the hardware.
So here's the -- kind of the crux of the problem is that even in an SMT
processor all the threads share the first-level cache. Okay? And as a
result if I have this processor executing a transaction or a block and
writing to the cache then suppose that another thread reads from this
location, now we have a dependence. Okay? Traditionally what you would
do is you would squash one of them. Okay? Now we want to support -- let
them continue. And we take advantage of the fact that it is so close,
the two threads, that they can keep state. So I can squash, stall or
order. Okay? So this is a pictorial representation of this.
So suppose I have two threads. This guy is executing this dark block.
This one reads the cache. We find the dependence through the tags of
the cache, right, or through the log [inaudible]. And then what you
observe here is you can either squash this one and restart -- It would
take this long -- or we can stall the processor after the [inaudible],
stall. And then when this one commits since we know when it happens,
okay, then we can continue this thread. And this has sped up this
program from here to here. Or we can actually record that there is a
dependence and keep a small table that says, "This thread now depends
on this one. I cannot let this one commit. Instead I'm going to have
this one commit first and when this one commits have this one commit."
Okay? So that's the idea. And the significance is that you can run
dependent threads in parallel.
Now I won't bother you with some of the hardware that is needed. These
are the different types of mechanisms that we use. You can do it in
hardware and software; this is a hardware implementation and this is
[inaudible]. Let me move on quickly to the code generation. So the big
picture of how you generate code for a block architecture is very
simple. So what you need to do is you need to have some software
entity, compiler, profiler, whatever, insert chunk boundaries at
strategic points in the code, okay, and then perform aggressive
optimizations in between these two boundaries knowing that the hardware
will guarantee that no matter what it's going to execute atomically.
Okay? And since a chunk may repeatedly fail because somebody keeps
messing with you with your state, you need to create a safe version of
the same chunk somewhere. So that if you fail many times then you can
go off and execute the plain code with locks and what not. All right?
That's what we call the safe version. And there are two key ideas, only
two ideas: the first one is maximize your gains. Second, minimize your
losses. How do you maximize your gains? You want to form the chunks
around the code that you think the compiler can do the most. Okay? So
what are these places? For example, these are pointer intensive areas
where you know that the compiler cannot do anything in a safe manner.
That's where you want to put the blocks, do a lot of transformations
and then do a check before the block finishes and either scrap
everything or continue.
Or in areas of the code that there are many branches but you know there
is typically a path that is highly likely. So you optimize for this
path and then the exits you squash. Or whenever you have many critical
sections that are not contended frequently, rather than having them in
different blocks, you put them together and you remove the locks. And
then the second is, to minimize the losses is make sure you cut at
points when there is likely communication between both the threads. So
we can allow threads to communicate. What we cannot allow is two blocks
that are concurrently executing to communicate; that's what we're
trying to avoid.
So here's an example of the first optimization, trying to group things
where the compiler can do little. Okay, so look at this thing. This is
a Barnes piece of code where I have while loop and I have lots of
pointers, if statements and so on. And in each iteration, I have two
critical sections that grab the same lock but the lock is different in
each iteration. Okay? But there is lots of -- This is a very
complicated thing. You have to [inaudible] a couple of pointers at
offsets and so on so lots of instructions here.
We wonder can we, if we put this in a lock, can we optimize code up out
of the loop. Okay? So the first thing we're going to do is, what we're
going to do is, we're going to do lock elision. With lock elision is an
optimization that what it does is it replaces the lock and unlock, it
removes the lock and unlock, and it puts a read, a plain read, using
while. With a plain load of the lock to see if the lock is free. If the
lock is not free, it keeps spinning; otherwise, it continues. This is
called lock elision. Why do I have to have that and not just remove the
lock and unlock? If all the threads were executing a block then I could
remove the lock and unlock because they either fully succeed or they
fail. Right? But because I'm going to have this safe version of the
code and sometimes some of them will fail and go and execute the safe
version, in the safe version one of them will grab the lock, so I need
to make sure that they don't execute this while somebody has the lock.
So I'm going to check. If somebody has the lock, I'm going to spin
here. When that guy releases the lock, I'm going get squashed and
restart it. Okay? This is a common optimization people do.
The bottom line is that by removing the locks, now we give this to the
compiler and the compiler says, "Whoa, there's no locks here. I can
start doing comments of expression elimination." And the first thing it
sees is these are different locks at every iteration so I cannot
optimize this thing. However, look at this thing. I need to generate
this multiple times inside the same iteration and across iterations. So
let me [inaudible] motion here. So all the stuff that was in here is
computed just once, at the top of the loop and put in a register. And
then based on that, it just uses the register to access this thing.
Okay? So notice that I have removed a lot of code. In fact, the dynamic
total number of instructions per lock moves from 9,000 to 7,700. Okay,
and some of them are loads and stores. And what happens if somebody
collides with me, because I have this thing here, somebody grabs the
log I'm going to hit squash and restart. So that's an example of how
this thing works.
Large chunks are beneficial. So that's one thing. Now how do I cut the
chunks to minimize squashes? That's the second part of the
optimization. We need a profiling pass to figure out what are what we
call squash hazards. Okay? So squash hazards is something equivalent to
[inaudible] hazard but we call squash hazards because it's something
that -- given a synchronization access, you don't know if this is a
highly contended synchronization. If it's not, it's not the squash
hazard. Squash hazards are operations that frequently cause squashes.
Typically it's the first communication in a code region with multiple
communications. Okay? So typically highly contended synchronizations,
data races, not shared data accesses because those are protected by
locks. So once you find these things, the sync hazards -- or the squash
hazards rather, you transform the code with tailored squash-removing
algorithms. So for certain types of locks you want a certain type of
algorithm that minimizes the squashes. Okay? The goal is prevent two
concurrently executing chunks from communicating on the fly and
typically means cutting the chunk before the hazard. Okay? So
[inaudible] explains this thing. So this is all about the base
technology for chunks. What we are building for several years now,
actually, is a hardware prototype for deterministic record and reply
within Intel Labs. And the idea here is to augment the cache hierarchy
of simple multicore with the memory record and reply engine. This is
small -- It's a small thing but what it does is as the programs execute
it basically cuts the execution of the code whenever there is
communication between threads. And then it stores information of the
execution number of instructions within communications. Okay? And then
after a while it dumps it into a lock in memory. Okay? So we collect
chunks of instructions but not speculative chunks, chunks of
instructions within communications and then we can replay in. So it's
primitive to recreate past states in a computer. It records chunks of
instructions until the next communication. And we hope that this will
be the basis of a lot of tools, debugging tools. Because once you have
the execution within communications, you can build the tools, say, to
detect [inaudible] violations.
Okay, so I want to just -- Let's ignore this thing. I want to just talk
about a couple more projects, architecture projects that we have at
UIUC as part of UPCRC. One is the DeNovo. This is my colleague, Sarita
Adve. What it does is hardware for disciplined parallelism. So the idea
is if I knew that my code followed determinism, you know, the simple
code that DPJ -- deterministic parallel Java produces then I could
tailor my hardware to be much more efficient. Right? So if I know that
structured parallel control and explicit effects then I can simplify
the hardware. I can simplify the cache-coherence protocol. I don't need
to worry about invalidations to the other processes because only
process can execute at a time. All that transient states are removed. I
can optimize the protocol better and easier. And then many of the
things that currently cache-coherence has, invalidations,
acknowledgements, indirection through directory, even false sharing is
unnecessary because you know at all times that the data that you want
nobody else wants. Okay? So you cannot have false sharing. And the
second project is the Rigel. Rigel is a project that my colleague,
Sanjay Patel, has been working on also for UPCRC. And the idea is to
build a thousand-core chip for visual computing. So these are very
simple cores and they don't have cache-coherence. And because the
applications are so regular, you're able to optimize a transfer of the
data and it becomes very efficient. Okay. So that's a summary of a work
on architecture at UPCRC. Thank you.
>> Juan Vargas: Thank you.
[ Audience applause ]
>> Juan Vargas: Do we have questions for Josep? Well then thank you
very much, Josep. And let's move to the next speaker is Krste Asanovic.
He's going to talk about 21st century computer architecture research.
>> Krste Asanovic: Thank you.
[ Silence ]
>> Krste Asanovic: Okay. Is the microphone on? Can you hear me? Okay,
good. All right, so -- Because I only have 20 minutes I thought I'd
just pick one fun thing to talk about. So this is a like a mini-rant
about architecture research and looking ahead at what we'll need to do
in architecture research, so what's the next big challenge? Well,
technology's slowing down/stopping; everybody knows this. There's no
device technology out there that's going to save you. You know, sorry.
You know, end of technology scaling. Parallelism is a great idea but
it's a one-time gain. Right? Basically the two tricks we've played to
use parallelism to save energy you either use simpler cores but that's
pretty limited by how small you can make the core. You can only make it
so simple. The other one is running at a different operating point,
lower Vdd, lower frequency, you're making it up through parallelism.
But that's limited by Vdd/Vt scaling and just errors you get as you
load the Vdd. So that was the one-time gain from parallelism. That's
great. Now what?
Everybody seems to agree that we're going to improve efficiency with
more specialized hardware. And I agree with that as well but part -when we did Par Lab, we spent a lot of time working on the software
stack. You know, actually most the money went to the software stack
unlike most parallel computing projects. But now really attention is
focused back on the architecture because we're going to have to change
the architecture substantially if we're going to get more energy
efficient processors.
So we've got to ask these architects to go do research in, you know,
very efficient specialized architecture. So how is research done today?
This is a mini-rant. A little while ago I went and looked up one of the
ISCA proceedings; this is actually 2010. This is our top conference in
computer architecture. So it analyzed, you know, how do people evaluate
their architecture ideas? And I kind of broke it down; this was just me
doing it. So the caveats are I could be doing this wrong. But I went
through all the papers and about two-thirds of them the way they were
evaluating I thought sort of made sense for the ideas. In some cases
like the papers had no numbers in them at all which surprised me. In
reading ISCA you usually see lots of bar graphs. Few papers had no
numbers in. A bunch of them actually used some real machines. Some are
working on some new device technologies. It's very hard to build models
at all, so I gave those guys a free pass. You know? It's very hard to
do work in those areas. Some papers were about outer memory system or
traces and whatever. It seemed to work fine, and I didn't have a
problem with them. But the ones I was really focusing on was, you know,
"We're going to build more specialized architecture. There's going to
be new kinds of pipelines, new kinds of caches." So I looked at those
papers. And about a third of them were in that kind of area, like
inner-workings of the core. And of those I took at look at them and,
you know, only two of them actually had RTL down to the level of do you
know where all your bits are, do you know where all the wires are.
Right? Only two of the papers had that.
One of them was a Stanford paper which used Tensilica. They don't
actually use the RTL but they used a model that was built from the RTL
Tensilica has. And they were happy that it was within like 30% roughly
of, you know, having done the RTL and back created a model, it was
within 30% or so is what they claimed.
There's one industry datacenter paper based on a product, so that was
okay. So the other 16 were academic C simulations without real design
and a lot of improvements in the kind of 20% range.
So do you see a problem here? You know, doing cycle counts. The cycle
counts they got may have been somewhat representative. But remember,
these are the papers that are focusing on pipeline [inaudible] cache
design. These are really focusing on that part of the thing and they're
doing C simulations of that. And remember the guys who had the RTL and
built the model, they're only within 30%. Right? So, you know, what's
going on here. And cycle-time/area/energy? Some of the guys, you know,
they way they modeled area was took some die photos which are usually
incorrectly labeled from ISICC proceedings or whatever and then use
that to build their area model. Right, just, you know, completely
bogus.
So my take on this was most of these papers the evaluations were
completely bogus. Actually they were a waste of paper. Not to say the
ideas were bad; the ideas could've been great. So I'm not commenting on
the ideas just on the evaluation methodology. Right? So again the ideas
could've been fantastic. I don't want to upset my colleagues. It's just
-- I know why they did this but the evaluations are probably completely
bogus.
And this is a bit worrying because now we move into this era where it's
all about energy efficiency and building the most specialized cores. So
this is a, you know, famous painting. You know, this is not a pipe.
It's a picture of a pipe. And, you know, those models are not really
processes. They're not pipelines. Right. They're just models of them.
So we have to get a bit more real. Now it's hard. Like new computer
architecture models are really hard to create. Cycle counts: you need a
microarchitecture. You need to actually have a microarchitecture and
long runs. Cycle time/area/energy: lots of interactions, how you design
the thing, process technology, design. Another thing is design space is
really important. Why do you think your one design that you put up as a
candidate is actually a good point in the design space of that family
of architectures?
Now industry has a big advantage. They keep doing the same designs over
and over again. So they get very good modeling because they're
basically -- they have all the software running on last generation's
chips with cameras and everything. They can really tune their new
models to match it well. And they have real layout. They have real
designs. And usually, you know, sorry Intel but your talks are
basically Tweets. You know? It's looked the same since, you know, penny
and pro days almost. You know how to build that stuff so you can build
pretty good accurate models. But doing far out research in academia
where you don't have that experience, how are you going to get anything
reasonable as a model? You don't have that luxury. Well basically I
can't see any other way -- I'd love to know if there's a better way -you have to do real designs or you have to actually go design it for
real.
So my claims at why you have to do this: you can't really create or
even use an existing model correctly if you haven't built processors.
And, you know, undergrad hardware class projects are a good start, you
know, so we start educating architects but you have to keep doing it to
get good. And you won't get this experience by modifying sort of C-base
models.
Another claim is that only bad models are actually easier to build than
actual designs. Good models are harder to build than real designs, all
right, because you have to build lots of actual designs to build a good
model. Right? So you kind of have to do real designs.
Another thing I'll put up is this slide. A little experience I loved
from when I was at MIT. We taught this class with [Inaudible], a simple
hardware design class. So we gave the students a little project. One of
the labs was do a 2-stage RISC pipeline. You know? Design the RTL,
synthesize it out. And we thought it was such a simple task -- We even
gave them bits of the code. We thought it was such a simple task all
the students would have basically the same answer coming out of this
design lab. Right? This was the result. Right, so this axis clock
period in nanoseconds. So designs were each from 3 nanoseconds to 12
nanoseconds cycle time. Right? That's a factor of 4. Area, this is
going from ninety thousand microns squared up to a hundred and fifty
thousand, so you know almost a bit less than the factor of two
difference in area. This is for just a 2-stage RISC pipeline where we'd
given them a lot of the code already. All right? So first thing this
tells you, you know, what's the best design on here?
>> : [Inaudible].
>> Krste Asanovic: Right. So what are bad designs on here?
>> : [Inaudible].
>> Krste Asanovic: Anything that's [inaudible], you know, as a paretooptimal curve. So there's a lot of contrasting design points. Right? So
you have this pareto-optimal curve. So now go back and think about
people doing these architecture papers. And your student sits there and
does a prototype. So this is the pipeline modeling his core. How good
do you think your student is? Which of those points did they put in
[inaudible] for the baseline all for their candidate work, right?
Right? How would you know? How would you know how well they've done?
The other thing is, there is a big design space so any simple -- This
is a very simple processor. Look at the big design space we got. The
more complicated ones, specialized architectures, you've just amplified
design space by orders of magnitude. Right? So getting a good design
point, you have to do design space exploration. So, you know, what
we've been working on at Berkeley is can we make it easy to develop
lots of real designs to do the design space exploration? Construct real
RTL: if you have the actual RTL, actually know where all the wires are,
know where the bits are, you can get accurate cycle counts. And if you
can actually go through the layout -- In our case we're just using
synthesized layout -- you get cycle time, area, energy. Right? You need
to have real world physics pushing back at you when you make a design
decision. Another thing to do to get cycle counts for long-running
programs, we generate FPGA models automatically as well so we can
actually run things for longer. But the big thing is by doing real
designs you educate students to be the next generation of architects,
actually understand how to build things.
A little unfortunate now, sometimes I talk to people and it's clear
they've never built something and their advisor has never built
something. And maybe their grant advisor never built anything. All
right? So -- And some very simple concepts that just don't -- they
don't understand why this is bad or why this is good. So that's
unfortunately. So we want to actually train people.
So to help us do all this, we've built this language called CHISEL. It
stands for constructing hardware in a Scala-embedded language. This is
an embedded ESL. It's kind of [inaudible] but embedded in Scala. We
built a hardware-description language. And basically hardware is just a
data structure in Scala. And the real great thing about this is we can
use the full power of Scala which is a nice language; it has all these
nice language features to write generators, but also we use it lay
higher levels of language description on top of this base level. So we
can do some very powerful things that build generators. So Jonathan
Bachrach at UCB, he's the main developer of Chisel. So what does Chisel
look like? Well if you have a design description in Chisel, the Chisel
compiler can then output any one of these from the same description. We
can output very fast C++ code and get a cycle simulator out of that. Or
we can generate FPGA emulation going through the standard FPGA tools to
generate FPGA's. Or we can generate verilog. We can synthesize to get
real layout that we can go fabricate chips. And we also use that layout
to extract various cycle time energy numbers from designs we did.
Okay, so one of the -- You know, isn't this too hard? Like building
this stuff for real too expensive? Well, first thing, you have to be
able to design. You've just got to be able -- If you can design
microprocessors, I don't know why you're trying to tell other people
how to theirs better. Right? It's just, I'm sorry but you have to be
able to design microprocessors. Right? Advances in tools make this more
tractable especially in the synthesis tool. So we've been -- Also we've
been working on better libraries and tools and [inaudible] released it
open source. I should have said Chisel's actually out there. You can go
on the website. You can download it. Go to the get help project and get
a hold of it. We've been putting up a lot more stuff over the summer,
more cores and things so that people can download and use this.
And the other thing is you don't actually have to fabricate every
design. The point is you design it to the point you almost could
fabricate it. And the tools are pretty good, once you have the layout
you can extract and get very good numbers out of it. All right? But
what I've learned over the years is building chips is actually fun. And
the main reason you build chips these days is morale. It's actually the
number one reason. And credibility. People really like building chips.
So I started a long time ago building chips. This was my thesis chip.
It was a vector chip. I was involved with this vector VIRAM a little
here at Berkeley. I was on the vector chip. You can detect a trend
here. So at MIT built SCALE; it was a vector chip. Notice the gap
between these. There's like quite a few years in between these chips.
More recently we [inaudible] out something at 28-nanometer another
vector-style core experimenting with, a 28-nanometer. We just
[inaudible] this other one, a 45-nanometer. There's another one coming
out in August. What you might notice is there's a lot more chips. All
right? And the same few students are doing all of these chips. All
right, so how are we turning out all these chips? One advantage is all
of them are similar but not the same. They're being used in different
projects. All right? We've been trying to push a more agile hardware
development methodology which basically consists of going through the
entire process automatically and really automating as much of that flow
as possible. So like, you know, there's no such thing as an RTL freeze
in our developer methodology. We just go, "We'll wade through this
stack continuously and keep pushing through [inaudible]." Well, and you
might notice these are all different process technologies which is
actually the biggest hurdle. So building chips, the RTL's the really
easy part. That's actually trivial. Physical design is the big
challenge and actually getting these things finished then working and
fabb-able.
But so it's fun to build stuff. But you don't need to do most the
research. I think this is just -- Actually these are being fabbed for
other purposes like they're doing work on low-voltage resiliency with
[inaudible] DC converters and the other ones with [inaudible]
integrated. So the [inaudible] are just kind of an afterthought but
they need to run the rest of the experiment but we get the fabbing as a
result.
Someone commented about this, you know, "Is synthesizable ASIC design
close enough?" Well, for handholds, handhelds/SoC, that kind of space,
that is how people do designs in industry. There is the Intel class,
you know, Intel, sort of IBM, stalwarts keep doing this custom design,
you know, five or six years of careful engineering, lots of detailed
synthesis. Those glory days are kind of passing. Right? Glory days of
custom circuit design are over for many reasons on the slide. I won't
go through them. But I think as an architect researcher it's close
enough; you get the insight you need from doing this level of design.
It actually matches actually what you would do in industry for those
kinds of cores, the kinds of cores we're going to see coming out.
Anyways, that was a mini-rant on, you know, people should build stuff.
It's a lot easier to build stuff than it's ever been, a lot closer to
industry probably than we've ever been when we do this stuff in
academia. So people should be doing that.
So I just want to say a little about a bigger project that we've sort
of been starting off based on -- starting from [inaudible], the new
project ASPIRE. So basically starting with applications. You know, how
are we going to -- What's our story on this specialized hardware?
Everybody is working this area. We're going to leverage all the work
we've been doing [inaudible] on patents and the patent-based software
decomposition. So we started with applications. We break them down into
a patent. And we've been sort of pushing this idea of, you know, the
way to do heterogeneous processing is to have central processor with a
sort of satellite array of specialized engines. So it's more a coprocessor model than rather than the idea you have sea of GPU stuff
over on the other side of the chip. So augment -- It's kind of like you
have the [inaudible] units on a regular processor. We go even further
with more kinds of engines stacked around the central core. We think
there's a lot of good reasons to build it this way. We call this ESP,
Ensembles of Specialized Processors. The idea is this ensemble of
specialized engines can execute any kind of app you throw at it with
greater efficiency. So you might imagine having a, you know, standard
core, an LP engine, and then the idea is these side co-processors are
actually targeted for the given patent. So the, you know, [inaudible]
keeps saying these patents are the things that recur. They're the
common operations. So we'll build the engines to match those patents.
So that's our idea of building something that is more efficient than
regular cores but still has the coverage and flexibility and
programmability to cover the application space.
And what's nice is we already know how to program this because we built
these specialized for each patent. And building a specialized enough
for a patent to target an engine design to execute that patent we think
should be pretty straight forward and we won't have to change the
application code, right, whether you have that accelerate there or not.
This sort of patent-specific accelerator is kind of what we've been
working on.
So we're going to generate [inaudible] together. And what's nice is we
can actually take these designs and push them all the way down through
either emulation or down to ASIC. And then the [inaudible] important
part is doing this design space exploration, not just around the
hardware but around the whole stack as well. So on the software level
you're autotuning for a given architecture design point, figuring out
the numbers you get there and then iterating the whole loop, so two
levels of design space exploration. So this is kind of what we're
exploring in this next project. How do we drive up the efficiency of
these specialized engines?
Okay. So in summary: you know, we've got to focus on efficiency as 21st
Century architecture. There's a huge range of specialized
architectures. And really I can't see any other way except actually
having real layout to finalize them. So how are we going to produce
that? I told you our approach. One of the good things is we're trying
to open source all this BSD and I think the important thing is have
people actually study your RTL and say, "Well, your core is, you know,
this really dumb idea in this part of the core. Why don't you do this?"
And you say, "Sure, we'll do that." We're not claiming we know how to
build the best cores, but we'll put them out and host them and have
other people contribute. And hopefully we'll drive towards a commonly
agreed good baseline design from one of these design points.
Yeah, and this last thing: it'd be great if, you know, people did
papers that actually improved fabricate-able designs. That would be
really good. Okay, that's it.
[ Audience applause ]
>> Juan Vargas: Do you have questions for Krste? Yes, John?
>> : It's more a comment. One of the things that I really like about
Chisel which you didn't actually bring up was there is a common
inscription of your hardware in Chisel, in SCALA language, that allows
you to push it in any of those directions. And so for the design space
exploration, I can [inaudible] on the C++ side and I know that that
same design [inaudible] in RTL or [inaudible]. So that's a real
powerful thing [inaudible] different set of tools for each level
[inaudible]. And this really unifies [inaudible].
>> Krste Asanovic: Yeah, that's actually one of the driving irritations
of previous stuff we did. Robert?
>> : My understanding of most of your methodology is that it deals
toward exploring microarchitectures, but it also lends itself to
exploring [inaudible].
>> Krste Asanovic: Oh, no definitely. So these specialized architecture
-- Those are completely new [inaudible], so very rich in [inaudible].
Like graph engines, [inaudible] graphs and things like that.
>> : [Inaudible] exploring [inaudible] to do is write code in the new
proposed instructions [inaudible] compilers can generate [inaudible]...
>> Krste Asanovic: But that's where our specializers come in. So we're
targeting them through the specializers.
>> : I think you have another advantage going on here. You know,
there's lots of different specializations you showed there. For
example, sparse and graph. Maybe they're not different. How do you find
out? Well, you design them both I suppose but then you've got to
program them both. And you've got to take the same problem and target
both of those architectures to decide. For example, we don't need any
dynamic type discrimination support in the architecture. We can do it
all in RISC. I remember [inaudible]. Right? So --. But you have an
advantage here with the programming technology you've got. You could
write SILO's that would go to both targets and compare them from the
same source code to both of the [inaudible] without nearly as much
trouble as you would have inventing new languages and recoding things
and other [inaudible]. I think [inaudible] pretty interesting.
>> Krste Asanovic: Okay. All right.
>> Juan Vargas: The last speaker today is Mattson from Intel. So Tim is
going to end the session with fireworks talking about the many core
processors at Intel. He's going to have message passing for future. And
after this we will get some chairs and start the panel. And thank you
very much for staying with us for so long. And it has been a very long
day but I hope you enjoyed it and I hope you had a lot of good
insights. So, Tim?
>> Tim Mattson: Okay. Thank you. So this is really what I would like to
talk to you about is kayaking and kayak surfing. But, no.
>> : [Inaudible] for that?
>> Tim Mattson: Yeah. But we won't. Have to have this disclosure,
though. You know, I work for Intel but these are my views not Intel's
views necessarily. You will learn absolutely nothing about any Intel
product from what I have to say. This is a team effort but if I say
anything stupid, I own it not my teammates. And I want to emphasize,
it's my job to challenge dogma and to explore alternate realities. So
don't for a second think that I'm telling you about any kind of future
Intel product. Period. All right.
So this is my favorite slide. And I apologize, there are some of you
here from Berkeley who've heard me give this talk many times. You guys
can go ahead and lay your head down, take a nap. I won't be offended.
But this is a slide I pulled out of an Intel executive's deck from
2006. But I just love it because it's talking about this great vision
of many-core and how we're going to have -- You know, we had the
single-core. Now we've got duel core and now in the future we're going
to have these lots and lots of cores with a shared cache and a local
cache and how cool this is.
But notice the implicit assumption, just the automatic assumption that
of course there's a cache-coherent shared address space. Now I want you
to think back about the talks you heard today. We heard about heroic
research to try and find races. We heard about bizarre tool chains to
try and prove that you could do lock and lesion. We heard about this
weird bulk multicore architecture which could do rollback if you had a
memory conflict. All this complexity, this insane complexity, so that
we could have a shared address space. I think a shared address space
perhaps is just a big mistake. The fact of the matter is that if you
get expert parallel programmers -- And I've been there; I know what I'm
talking about here. You get expert parallel programmers who have been
doing it for decades and ask them to explain the relaxed memory models
they work on. Ninety-nine percent of the time they will get it wrong.
All right? So we're going to build our future on a model the experts
can't understand? How smart is that? I think that's pretty stupid. Also
the only programming models that we know for a fact will scale to
hundreds or thousands of cores for non-trivial, I mean nonembarrassingly parallel workloads, the only proof points we have are
based on distributed memory message passing style programming. And
furthermore I know people tell me I'm wrong. I know architects who are
very, very clever tell me I'm wrong. That's fine. They have that right,
but I'm the one up here speaking now. As you add the circuitry and the
chip area to the power to manage that shared cache-coherent state,
that's not scalable. Amdahl's Law is a real law. It's going to eat up
some of your overhead. It's going to become expensive.
So at some point I think we're going to have to get rid of it anyway.
So I want to ask the question, maybe, just maybe, we should bite the
bullet and recognize that cache-coherent shared address space was a
stupid idea and the sooner we get rid of it the better.
So we had a research program at Intel where we actually said, "Okay,
let's build some chips that are arbitrarily scalable, meaning we can
scale them as far as our process technology will take us, and do not
have cache-coherent shared address spaces." So we built two of these.
We have an 80-core FLOP monster also called the Polaris Chip, and a 48core chip, the Single-Chip Cloud, probably the worst-named chip in
history but that's what it was called.
But this was created as a software research platform. So let me say a
little bit about these chips, and I'm being very conscious of the time.
And of course I have to credit my collaborators. In particularly Rob
van der Wijngaart who I worked very closely with on all of these
projects. But also the hardware team, you know, Jason, Sriram and
Saurabh have been just delightful to work with. So let me tell you a
little bit about this 80-core terascale processor. The goal of this
project was could we get a Teraflop for under 100 watts, and we
basically did that.
It was a 65-nanometer process which, at the time, back when we built
this in 2006, 2007, that was the leading technology was the 65nanometer process technology, 8 by 10 tiles, mesochronous clock.
Offline I can go into all sorts of low-level details there, but in the
ten minutes I have left I can't do that. But the interesting part to me
was what the cores looked like on this chip. Now this is not a general
purpose processor. Of course not. But from a point of view of an old
HPC hacker, I love this chip. It's marvelous.
Okay? Two floating-point multiply accumulate units, a five-port router
so I could XY routing between tiles and I could go directly into the
data memory or the instruction memory. So I could write things into the
memory whether it's instruction or data without interrupting the core.
If you want to build highly, highly scalable architectures that's a
wonderful feature. And if we went through all the little tiny ittybitty numbers here and added them all up what you would find is that
this is a perfectly balanced processor. Meaning, I can move stuff from
the instruction memory fast enough to drive the chip at peak speed. I
can pull things out of the data memory enough to keep those floatingpoint units fully occupied. So it's a very well balanced chip. There's
no division. There's no integer unit because, you know, those are for
weenies. If you're a real programmer, all you need is a floating-point
multiply add. The interesting thing is there are 256 96-bit
instructions per core. This is not cache; it's memory. And I could hold
a just stellar 512 single precision words per core. So this isn't a
general purpose chip. All right? We know that.
So we wrote some kernels for it. I won't call them applications but,
you know, we got a stencil kernel going. We got a matrix multiply
going. And we said it was impossible to do 2D FFT's so, of course,
Michael Frumkin working with us had to do a 2D FFT just to show us how
wrong we were. But we got this stencil code running at a TeraFLOP for
97 watts. We're cool. We rock.
For me this was really cool because I was involved on the first
TeraFLOP supercomputer in 1997 and -- you know, where we had one
megawatt of electricity and 1600 square feet of floor space. And ten
years later we're 97 watts in 275 square millimeters. That's pretty
cool in one person's career in ten years. I think it's pretty cool. But
the problem with it is of course it was a stunt. We know that. It was a
stunt. You know? No one could do any real serious software with a chip
like that. So the next one in the family, the 48-core chip, it was
built as a software research vehicle. Now there was the talk that -Kirstic Avery talked about Intel with its hand layout and all that
stuff. That's indeed what we do with our products. But for this we
wanted to go directly synthesizable off the RTL. And there are only a
few cores we had that you could directly synthesize from the RTL when
we went to tape out this chip. And the one core we had to do that with
was the P54C. P54C is the Pentium 3. Ancient core, we know that. But
the idea is we wanted something we could grab off the shelf, drop in
there, X86 so people could write real software and it's a messagepassing architecture. Now here's the interesting thing about his chip,
we have 2 cores per tile, 24 tiles. They have their regular cache
architecture for the individual core. Then there's a message-passing
buffer which is a scratch -- it's a terrible name, message-passing
buffer, because what it is, is a scratch space. It's scratch pad
memory. So to the programmer this is what the chip looks like.
I have 48 cores with an L1 and L2 and Private DRAM. Then I have this
message-passing buffer which is a high speed, on die, on chip scratch
memory space. And then I have this off-chip shared DRAM but there's
absolutely no cache coherence to that off-chip DRAM. So what we're
trying to do with this chip is explore an alternate design where you
have some shared memory. But remember a lot of the problems with shared
memory comes not from sharing the memory, it comes from the shared
address space with accidental sharing where you can accidentally
stumble over each other's addresses. Here all of that is managed at
software. There is no non-scalable cache coherency protocol. This is us
trying to have our cake and eat it too. There's just a little bit of
shared memory but without hopefully the bad parts. So we built this
chip. And another thing interesting about this chip is we have explicit
control of the power. We have voltage control of -- I'm sorry -frequency control at the tile level. You can individual vary the
voltage on the inter-connect. And then you have these blocks of 8 cores
per voltage domain so I can vary the voltage on these voltage domains.
And all of that is exposed to the programmer. So we can do research on
people experimenting with explicit control of the voltage and the
frequency. Marvelous research platform. And in a long version of this
talk I would go through and go through research results and talk about
it. Invite me to come back some time and I can talk to you a whole
bunch about that.
But in the five minutes I have left, I want to talk to you about
something else. So I hear again and again, shared memory is so much
better than message passing. Shared memory is just the only way to go.
And the people who say that I submit haven't written much if any
message passing code and shared memory code. Or they base that
conclusion on the following: they take a matrix multiply code, matrix
multiply kernel, and they run it. Or they take some other trivial
little toy program. All right? What I want to point out is I've done
professional software development of products that use both message
passing and shared address space, and you measure when did you start
the project and you end when you deliver optimize validated code. Not
little toy demonstration code, I mean real applications. And what you
find is with message passing, by plotting effort over time, you indeed
have a brutal learning curve. And if you never do the optimization
validations, you're just looking at this beginning piece of the curve,
then indeed you're going to conclude that shared address space
programming is much easier because you can just sort of, you know, add
a directive here or there or do a couple fork joints. It's real easy to
sort of sneak in that concurrency control. Whereas in the message
passing, I have to break up my data structures, decide how they're
going to be distributed. I got to do a lot of work up front. But the
point is, and here's the key thing, when you look at real code that
you're going to deliver [inaudible] in application that must be
optimized and validated then you have to go through after you've done
that initial parallelization and optimize and validate. And I'll tell
you, once you've broken up those data structures into chunks, because
so much of optimization is managing data locality, that job's pretty
much -- you've done the hard part of it already.
What we find in these shared address programming is you don't do that
work up front but by the time you get it optimized you've done that
work. You have to cache block. You have to break things up. So I
submit, look, just do it up front. In the long run you're better off,
and you don't have race conditions. If you're disciplined in how you
use the message passing -- And I could tell you exactly what I mean
about that, it's basically don't use wild cards. If you're disciplined
in your use of message passing, you write code that is almost
automatically race-free. Whereas of course with shared address space
programming, proving you're race free is NP-complete and you change the
data set. And your program that your tool told you was race free all of
a sudden is full of races.
So I submit that someday we as a community will wake up and go, "I
can't believe we were insane enough to ever push this shared address
space cache coherence," and we will all change to a message passing
road.
And I want to -- Okay, so I want to emphasize two things and then I'll
be done. And I'm watching the clock; I'm going to be done exactly at
four-thirty. All right.
>> : You're always going to be last.
>> Tim Mattson: I know. That's what I get for that. A lot of people
criticize message passing because they say, "Well, gosh, no one wants
to put all these sends and receives in their programs." Well, if you
look at a serious message passing program there aren't very many sends
and receives. And I have an example project here that are in the slides
I can talk to you about. It's where they build the linear algebra
library. But what they do is they break their job and they decompose
into panels of matrices. And they're doing dense linear algebra so they
have domain-specific collective communications of all gathers, reduce
scatters. This really -- They called it send-receive; it's really
exchange is a better name.
The point is that in the real application you have collective
communications that reply to a domain-specific type library. There's
very, very few instances of MPI send, MPI receive, few to almost none.
And so I'm telling you message passing, if it's done right, is usually
structured -- Like in the SCALA pack world, they have the BLACS, basic
linear algebra communication subroutine. Same thing there. Experienced
message passing programmers do not write a lot of sends-receives. They
do these global, these collective communication operations. It's
actually not as hard as many of you think. Give it a try. You might
find you like it.
And then the other thing I close with is Barrelfish. I love Barrelfish.
All right, why do I love Barrelfish? Because they take the concept of
an operating system that's split between a host and devices, and it's a
message-based operating system that they give you a consistent view
across both host and device. It's fundamentally based on message
passing not shared address space underneath with cache coherence and,
therefore, it maps onto these modern heterogeneous platforms. I love
it. And I hope to spend a lot more time with it. It's four-thirty. I
have to finish. Thank you.
[ Audience applause ]
>> Juan Vargas: Well, that was very nice to wake up again.
[ Inaudible audience talking and laughter ]
>> Tim Mattson: We have a panel.
>> : That's okay. Let's [inaudible]...
>> : It's good that you don't speak for Intel because all of the Intel
processors are cache coherence.
>> Tim Mattson: You notice how many times I said at the beginning I
don't speak for Intel.
>> : So I'm sorry, Tim, but I have to [inaudible] the following: so
I've gone on the record many times saying, "Shared memory is the
world's best compiler target and the world's worst programming model."
>> Tim Mattson: Okay.
>> : Okay? Don't confuse the two. Don't say, "Well, we have to build
our architectures to pass messages because we need the handcuffs to
keep us from killing ourselves with data races and things like that."
How about this? Don't use coherence for everything. Don't use shared
memory for everything. Instead, write in a functional or mostly
functional [inaudible] little updating [inaudible] as possible and do
everything with messages at the programming level or maybe even at
higher-level map.
But don't blame it on the fact that the hardware underneath is shared
memory. The great thing about shared memory is that it allows load
balancing and it allows dynamic localization. And if you use it for
that, it'll be your friend and you won't kill yourself with cache
coherence. If you use it the way you were using it today, I agree with
you. But we have to stop doing that at least.
>> Tim Mattson: Excellent point. You may very well be right, though I'm
not ready to necessarily concede it. But I think where you and I do
agree is I'm a software guy. I'm not a hardware guy. What I'm really
talking about is...
>> : Yeah, the software.
>> Tim Mattson: ...the people writing the software should write MPI.
They should write message passing. They should write a code -- Yeah,
that's what they should write. And if underneath you hardware guys who
are much smarter than me -- I know that -- figure out that the best way
to support it's with a shared address space, you go for it. But I think
programmers, they're just not going to get straight the shared address
space.
>> Juan Vargas: Okay, so there are more questions.
>> : Sir, I think you maybe [inaudible] examples in the beginning, but
aren't patterns supposed to make -- hide a lot of these details from
most programmers?
>> Tim Mattson: So the questions was, won't patterns hide most of these
details from programmers? That's the ultimate dream. But let me be
really clear on software development and the patterns world because I'm
kind of one of the people pushing that really hard. When I go deeper
and I talk about the whole software stack, I also stress the importance
to support opportunistic refinement so that the developers on your
teams can drop lower. So even though that top-level domain specialist
programmer probably never will go below the patterns, there will be
plenty of people on the team who need to go all the way down to the
lowest level. And so it's not true that when you look at the whole
software development stack the people are not going to have to pay
attention to the low-level details. And, therefore, it is important
what's the programming model we expose to those people, the efficiency
programmers.
>> Juan Vargas: John also had a question.
>> : Oh no, it was [Inaudible].
>> : Yeah, so I
drivers manual.
message passing
It stops you as
hurt you in the
mean I think the [inaudible] are kind of like the
And what we're really looking for is guardrails. And
model can [inaudible] a lot of those guardrails, right?
a programmer doing the things that are going to really
long run with, say, shared memory just...
>> Tim Mattson: Here's the -- It does that. But here's the other thing
it does is it forces you to make the hard decisions up front. And what
I'm saying is by the time you optimize a highly scalable code, you're
going to make those decisions anyway. You know? I've written and
optimized more open MP code than I could ever count. And I'm telling
you by the time you're done, you have figured out how the data is going
to block, how the locality is. I mean, you know, same thing with Pthreads code. You know? You're going to have to figure all that stuff
out anyway. Kind of what I'm saying is, "Okay, do it up front at the
design phase when you're starting out. That's the best place to do it.
In the long run you're going to be better off."
>> : So I agree with that statement but then you make the conclusion
that I should be doing it in MPI.
>> Tim Mattson: Well, okay, fair enough. Fair enough. My conclusion is
you should do it with a programming model that does not expose a shared
address space.
[ Multiple audience voices ]
>> Tim Mattson: You're right.
>> : Suppose it was a single-assignment language, Tim. Okay? Suppose it
was shared address space, but you couldn't overwrite anything. All you
could do is collect it when it was dead.
>> Tim Mattson: So...
>> : SISAL.
>> : Like SISAL, yeah.
>> Tim Mattson: Like SISAL.
>> :
Like SISAL.
>> Tim Mattson: And you know, SISAL was such a successful language. I
mean how many people are writing SISAL right now? Huh? Really?
Interesting. You know, I was involved with PARLOG. How many people are
writing PARLOG right now?
You know the fact of the matter is, we're not going to be able to
dictate what language people are going to use so we have to come up
with a collection of abstractions that work for the families of
languages people actually use.
>> : Yeah, but if you look at languages that are emerging and are
successful and so on -- Like SCALA, for example. SCALA is very good
language for a lot of reasons, but one of them is it has a very
powerful functional subset. You don't have to write it in this
imperative style if you don't want to.
>> Tim Mattson: Right.
>> : And there are others like that. So I [inaudible].
>> Juan Vargas: So John [inaudible] had another question.
>> : So [inaudible] brought this up before, but it seemed to me that
there should be a separation between shared memory and cache coherence.
And you link the two. And I say that from the standpoint that I work
back projects where I'm explicitly using message passing to pass tokens
to data, but I'm leveraging the shared memory so that I don't have to
copy stuff. I think that you're misleading things by saying shared
memory's evil but it's shared memory plus cache coherency.
>> Tim Mattson: Right. I try to be careful and I do slip up. So let me
be clear. What I'm attacking is a cache coherent shared address space.
The most productive programming model I've ever worked in is global
arrays. Now granted, that's my application domain so it's -- But that's
a shared memory model.
>> : Right. Right. So that's why I'm saying it's very interesting...
>> Tim Mattson: Yeah, you are absolutely right. Let's be clear. I'm
attacking a shared address space programming model.
>> : [Inaudible] is good.
>> Juan Vargas: So I think we have to stop here. And I learned
something after listening to Tim that instead of having breaks we just
have him talking during the breaks.
>> Tim Mattson: So no one has to listen to me. That's good.
[ Multiple audience voices ]
>> Tim Mattson: Yeah.
>> Juan Vargas: Okay. So now we go to a panel, and we need to put some
chairs so that we can start with the [inaudible].
[ Multiple audience voices ]
[ Silence ]
Download