1

advertisement
1
>> Doug Burger: Good morning. It's my delight today to introduce Kermin E.
Fleming, who goes Elliott. The E. Is for Elliott, and he's visiting us from
MIT as an FTE hiring candidate who is, I think, very intellectually aligned
with a lot of the work going on here at Microsoft, and he's done some really
tremendous work as part of his dissertation, both on systems, FPGAs and
compilation to FPGAs.
So really excited to hear your talk.
And thank you for visiting us.
>> Elliott Fleming: Pleasure to be here. So my name is Elliott, and today I'm
going to talk about how we can scale programs to multiple FPGAs. So before I
get started, I'd like to thank everybody that I've worked with. So basically,
my advisors were Arvind and Joel Emer, and all of these folks here were
involved in the papers covered in this talk.
And here's a bibliography. So the compiler work is categorized under LEAP,
Airblue is a wireless transceiver project, and there are a few other designs
presented.
So let's get started. So basically, FPGAs are traditionally been used as ASIC
prototyping tools. There's drop placements for ASIC. However, recently. The
FPGAs have gotten quite large and also much easier to integrate into systems
with PCIE, ethernet, various IOs, and so now we can talk about designing big
systems with FPGAs in them as a first order compute with a goal to accelerate
some algorithm.
So we have some algorithm that we were running in software. Software is not
fast enough or maybe burns too much power and so we want to run it on an FPGA.
Okay.
The goal here is time to answer. So time to answer has two components. One
is, of course, accelerating the algorithm so it runs faster. But the other is
also to reduce the amount of time that it takes an engineer to build an
implementation. Okay?
The second goal is functional correctness. So here, unlike in traditional FPGA
flows, where we cared really about preserving the behavior of the ASIC that
we're going to produce and we have to make sure works right, otherwise we've
wasted a lot of money on a mask set, here we only care about functional
2
correctness. That is, that whatever answer we wanted to compute was computed
correctly. And, of course, as fast as possible.
Okay. So here are a couple of examples of this
HAsim, which is a simulator for processors, and
a framework for building wireless transceivers,
with commodity hardware so you can talk to, you
station.
kind of program so one is
the other is Airblue, which is
which are actually compatible
know, your [indiscernible] base
Okay. And again, the goal here is functional correctness for both of these
codes, as long as we produce the correct answer within some high order time
bound, we're good.
So now that we're writing programs on the FPGA, we can ask the question, what
happens if our program is too large. Remember, the FPGAs, of course, are
structural things so we can, unlike in a general purpose processor, express a
program that is too big to fit on to the substrate.
So here, we're laying down CPUs and eventually, we have too many to fit on the
single FPGA, so what are you going to do? Right? So one thing we can do is
optimize so we can try to make our design smaller. That works to first order,
or we can go out and buy the biggest FPGA we can. Again, you know, these are
patches. But at some point, we have to use multiple FPGAs.
So what does that entail? One, we're going to have to partition our design so
here, you know, there's quite an obvious partition, right. We just put the CPU
on the other FPGA. Two, we're going to need to map or design this partition
design down on to multiple FPGAs. And finally, we'll have to synthesize some
network between them. And, of course, we an always do this manually. So we
can take our engineers and have them implement this whole thing and they'll run
it and it will probably run quite fast.
However this can be tedious and error prone, particularly if one is exploring a
design space and need to change the implementation. And the question is, can
we do this automatically. So the remainder of this talk is going to discuss
how we can achieve this goal automatically.
>>:
Automatic is non-tedious and error prone?
>> Elliott Fleming:
Yes, non-tedious and error prone.
Right.
Probably error
3
prone is the most important there. So before we get started on how the
compiler actually works, let's talk about what we should expect when we map a
design to multiple FPGAs.
So more FPGAs mean more resources and just like in a software program when we
throw another core, a better cache hierarchy at a problem, we should expect
more performance.
So one thing in FPGAs, one metric of performance is the problem size that we
can implement and so what I'm going to show is one of the examples actually can
be ten times larger when it fits on multiple FPGAs. So on a single FPGA, we
can fit a 16 core model and on multiple FPGAs, two to be precise, we can fit a
model that can model up to 121 cores. So that's an a 10X problem scaling for
this particular problem.
Also, when we give more resources to a problem, just like in software, we
should expect it to run faster. This can happen for a number of reasons. For
example, you get more DRAM banks on multiple FPGAs, but also since you're
asking the tools to solve a simpler problem, when you partition a design,
sometimes they can come up with frequency scales as well. What I'll show you
is that one of our examples can actually achieve a super linear speed-up when
mapped to multiple FPGAs. So this is performance normalized to single FPGA.
And up is good.
Okay. So in summary, what can we expect? Design scaling so
designs when we have more resources, more FPGAs. We can get
And then finally, although I'm not going to discuss this, we
reduced compile times, because again we're asking the highly
solve simpler problems.
we can get bigger
faster run times.
can also get
nonlinear times to
Okay. So the good news is, so again our goal is to sort of produce these
implementations automatically, and the good news is that multi-FPGA compilers
exist commercially. So if they were good, we could stop, right? And they
operate on arbitrary RTL, which is also good. The problem, though, is that
they have to preserve cycle accuracy. So what is cycle accuracy?
Cycle accuracy is a bijective mapping between the original RTL description,
which was clocked, right, and whatever implementation we put on the FPGA. So
in the FPGA implementation, there's a precise way to resolve the original
behavior of the RTL. And so kind of what you can see here is that the model
4
clock, which represents the original behavior of the RTL, is ticked
infrequently. The FPGA clock, of course, is running very fast, and then
between model clocks, we're doing some kind of communication between the chips
in order to preserve this idea of cycle accuracy.
Oh, and feel free to stop me at any time if you have questions.
Okay.
So of course, you can see how this would be useful in ASIC verification because
we want to preserve the behavior of the RTL, because if we make any mistake in
that translation or if our RTL is in any way buggy, we could break our chip.
The problem, though, of course, is the cycle accuracy gives us low performance
so what you can see here is that the FPGA wants to be fast. It wants to run
fast. But because we're having to preserve this cycle accuracy, we're actually
going at a very low speed relative to what we could get out of the FPGA. And
again, this comes from the need for distributed coordination. It also comes
from the fact that there are very poor semantics here.
Here, in maintaining cycle accuracy, we have to transport every bit just in
case some logic might behave funny in the case that even if a bit is invalid,
it was transported, right. So you can imagine here that if this data is
invalid, so the control here is invalid, right, but we still have to transport
all the data in case some point in our circuit might misbehave. So we have -we could some random data vector here, right.
But that might cause a bug in our design. We don't know. And if our objective
is verification, of course, we need to preserve that behavior so we can fix it.
Yeah?
>>: Sorry if this sounds like a moderately hostile question, but it seems to
me like you're setting up a little bit of a straw man here. You know, you're
saying, well, I want to synthesize RTL to a large logical FPGA, but then I'm
going to partition it to multiple ones and I've got, you know, comparatively
[indiscernible] slow communication and low bandwidth between them, and so that
won't work well unless I partition the design.
>> Elliott Fleming:
Well, so --
>>: And it really does seem like a straw man, because there's no hope of
getting to that magical point where you can partition any design and have it
5
run at your FPGA or your ->> Elliott Fleming: Precisely, and that's why we're not going to partition a
design. We're, in fact, going to restrict designs in a way that leads to good
partitioning. So basically, we're going to give programmers a new primitive,
which we'll talk about in the next couple slides that will enable them to
describe designs in a way that we can easily map.
>>:
Okay.
>> Elliott Fleming: So we'll see how that works. Okay. So again here we're
preserving cycle accuracy, but remember what I said. The goal of this new use
case for the FPGA is functional correctness. As long as we get the right
answer, we're happy. So the question is, do we actually need to reserve all of
this cycle accuracy. Answer, of course, is no.
So what I'm going to advocate is this new style of design called latency
insensitive design. The basic idea here is that inter-module communication
occurs only over latency insensitive channels. The idea is to decouple the
behavior of different pieces of the design from one another so that we can
change their implementation. Okay? Changing the timing behavior of a module
then does not affect the functional correctness of the design, right, as long
as the data flows between the modules and that data flow is preserved, then the
behavior of the design will be the same. The functional behavior, right. Of
course, the timing behavior will definitely change.
Many hardware designs already use this methodology, so most hardware designs
are described in terms of these FIFOs explicitly for the obvious reason that
there are many unpredictable latencies in hardware designs. Of course, you
know, hardware designers also want to do design space exploration. Why?
Again, improve modularity, improve design space exploration.
And today, what we do is we
modules and the design, and
room to in Q data, we don't
express our design in those
simply insert FIFOs, guarded FIFOs between the
we don't include data into the FIFO unless there's
DQ unless there's actually data in the FIFO, and we
terms. This is a very simple model, okay?
So let's think about that a little bit. So what I said was we could change the
behavior inside of any module in any way we wanted to while preserving the
functional correctness. But what this implies is we can also change the
6
behavior of the channels themselves. So if I can change the behavior of the
channel and I have a design described this way on an FPGA, mapping to two FPGAs
is straightforward, I simply stretch the channels between the boundaries.
And, of course, logically, these are still FIFOs, okay? But there's a problem.
I can have lots of FIFOs in the design and not all of them can have this
property, because remember that a compiler, an RTL compiler sees only wires and
registers. It can't even tell probably that there's a FIFO here.
So semantically, it may seem some wires and registers with some logic, but it's
very difficult to even determine that there's a FIFO. And additionally, of
course, we think about cycle accuracy is difficult. So it's very hard in these
things even to decide whether or not it's safe to add an extra pipeline stage
in a FIFO. But the programmer knows about this property, this latency
insensitive property. He expressed his design this way to sort of get the
benefit of modularity, right.
So what are we going to do? We'll just give the programmer a syntax to
describe these kind of latency insensitive FIFOs. Yeah?
>>: So can you give me a more precise description or semantics of what you
mean by latency insensitive?
>> Elliott Fleming: So what I mean is first, let me clarify that latency
insensitive does not mean that we don't care about latency. So this is a
common problem with when we use the latency insensitive, right. What it means
simply is that we're free to change the behavior of the FIFO. This will come
up. So we'll get to it in a couple clicks, but basically, we're free to change
the behavior of the FIFO, and the programmer is asserting that they've
described the rest of their design in a way that permits us to make this
change.
So, for example, they won't try to NQ data into the FIFO if the FIFO is full.
So they're leveraging the back pressure on both end of the FIFO. Again, this
is not the only way write a FIFO in your design. You're always free to use the
register and wire FIFO.
>>: So is latency insensitive then defined in terms of this particular
implementation technology of FIFOs. Is that the only way to characterize it?
7
>> Elliott Fleming: I don't think so, but it's hard for me to imagine any
other way of characterizing it.
>>:
Asynchronous logic?
>> Elliott Fleming:
>>:
What's that?
Asynchronous logic?
>> Elliott Fleming: Yeah, you could think of it that way, perhaps. So that's
fair. You could think of it as asynchronous logic and that whole field.
That's sort of what we're doing here. Again, compute is happening on data flow
tokens and we're decoupling the notion of clock from compute. I mean, that's
the fundamental difficulty, right, is in cycle accuracy, you know, clock is the
first order thing. Here we're trying to remove the notion of clock that we can
perturb the design in ways that are beneficial to the programmer. Yeah?
>>: So indeed, if you do have these NQ commands on one end and the data valid
[indiscernible], I think you were saying, how would you have a latency
insensitive channel that you can't -- that you can't stretch?
>> Elliott Fleming:
insensitive?
>>:
So the question is why aren't all FIFOs latency
Exactly.
>> Elliott Fleming: The simple reason in this is because you may make
assumptions about the particular implementation of a FIFO. For example, you
may make the assumption that this FIFO has a depth of one. That is when I NQ
something into it, it will be full and that will maybe -- you'll use that
control logic to determine some other things in your pipeline.
So, for example, you may say if this thing is full, I will issue some other
request.
>>:
So FIFOs don't have proper flow control?
>> Elliott Fleming: No, the FIFO may have proper flow control, but you may
make some assumption about the buffering, for example. You could also make an
assumption about the latency. But I think the more common case, at least in
8
the designs that I've worked on, is you make some assumption about the depth of
the FIFO. For example, that it has one or two buffer slides, and you write
other logic to expect that.
And so actually adding more buffer slots than perhaps one or two would break
your design. Because that assumption that you baked into the logic is no
longer true, right.
I mean, you could imagine having a single entry FIFO with flow control, you
know, not full, not empty, right? And then using that assumption that it's got
a single buffer slot in it to actually implement some other logic. I mean,
people do that.
The basic issue is if you allow people to leverage that assumption, then it
makes it very difficult to make the kind of changes that I'm going to propose
in the next set of slides.
>>: Seems to me that the thing that you want is to find large regions of code
that have no recurrences.
>> Elliott Fleming:
Large regions of code that have no recurrences.
>>: In other words, if -- think of it like a pipeline, you know. You're
partitioning and the degree of latency insensitive is the amount of compute you
can do that decoupled through a FIFO before you have to go back and close the
loop.
>> Elliott Fleming:
That's right, so feedback.
That's right.
>>: And I know this is a pipeline for decades, and so if your tools can
analyze your design, find the partitions, talk back to ear regions
[indiscernible] partitioning, that's how you can actually map this. Are you
taking an approach like that?
>> Elliott Fleming: So we haven't studied map, right. So mapping hasn't
become a problem for us yet. I'll talk a little bit about how we do mapping,
but it's quite naive. But you can imagine some approach like that being
necessary as a refinement to this.
However, I'll also point out that generally speaking, even if there is feedback
9
in a pipeline, very often hardware pipelines will have a pipeline depth that is
sufficient to cover the latency of inner-FPGA communication. This is certainly
true in DSP algorithms. I think it will be true in others. So, for example,
HAsim, because of the way it's implemented, kind of a time multiplex pipeline
will actually have enormous potential to hide the latency with useful work.
The latency of communications.
>>:
I'm just thinking about arbitrary applications.
>> Elliott Fleming:
>>:
Sure.
And general underlying approach.
>> Elliott Fleming: Yeah, so generally speaking, yes, you would want to try
not to partition across feedback paths too often, I think. But we don't have
any way of doing that automatically now.
Okay. So anyway, here's the syntax. Basically, one frames one's design in
terms of sends and receives. And a compile time, the compiler will choose an
implementation. One example of an implementation is just the vanilla FIFO that
you could have written anyway.
On the other hand, you may choose to synthesize a complicated network. Again,
depending on placement and other design goals. And again, here we have an
explicit programmer contract. When the programmer writes down the send/receive
channel, right, he's willing to accept unspecified buffering and unspecified
latency. And he's guaranteeing that he's written his design in a way that will
admit that choice by the compiler.
So, of course, you know, the programmer can write a buggy design and the
compiler will happily generate a buggy implementation. It's more of a
programming tool in that sense, okay? However, generally speaking, we found
that this primitive is pretty easy to use. Often, you can take FIFOs in your
design and just substitute them out. Ask it can be a simple substitution, and
this has been our experience for most FIFOs, except the ones which I was
describing to you in the back, wherein you're using them for control, the FIFO.
The depth of the FIFO for control.
Okay. So now we kind of talked about a syntax for describing latency
insensitive designs and we've seen that latency insensitive designs can at
10
least in theory be mapped to multiple FPGAs.
compilation flow.
So now let's talk about a
Now we'll now discuss a compilation flow to do that. Afterwards, we'll talk
about how we synthesize netted works between FPGAs, okay? So what we're going
to do is we're going to start out with an arbitrary RTL augmented with latency
insensitive channels. So what we've got here are little state machines that
can be any RTL or they could be software, for that matter. Doesn't necessarily
even have to be an RTL connected by latency insensitive channels shown by the
dotted lines so we just have some graph, okay?
Again, I mentioned that to produce a multiple FPGA implementation, we three
phases. First, we have to build in graphical representation of this which we
can partition. Then we're going to have to take that partitioning and map it
down into some network of FPGAs. Finally, we're going to have to synthesize a
communications network carrying the -- yeah, sorry.
>>: Who's responsible for establishing that the RTL behavior doesn't change
with the latency of the channels?
>> Elliott Fleming: The RTL behavior may very well change, absolutely. The
point is that you're asserting that those behavioral changes are not going to
impact functional correctness, okay. Again, the RTL behavior will absolutely
change. And much the same way that your RTL behavior would change in you
interposed a level of cache hierarchy, right. It will change. But it's the
programmer's job to ensure that these changes don't perturb the functional
behavior of their design.
In practice, this is not a very difficult thing to do.
>>:
When you say RTL behavior, are you talking about timing events?
>> Elliott Fleming:
>>:
It's correct.
>> Elliott Fleming:
>>:
Timing, yes.
[indiscernible].
Oh, yeah, correctness could be --
11
>> Elliott Fleming: Absolutely. If you wrote a bad design or at least a bad
design in terms of, you know, this property, right, you could very well get an
incorrect implementation. Although, of course, I'll ask the question if our
design was too big to fit to begin with, how would we have implemented a
transform to preserve that correctness? We would have just -- we can do it,
right. There are tools that's do it. But you pay out your performance. So
you lose order of magnitude of performance to preserve the property. So this
is the trade-off that we're making here is instead of preserving that exact
timing correctness of the original RTL, we're giving freedom to the designer to
express points at which that behavior may safely be changed, and we're going to
leverage that.
>>:
[indiscernible].
>> Elliott Fleming: This is a [indiscernible]. So for full disclosure, we
actually implemented [indiscernible] for a number of design choices, mainly
because [indiscernible] is easier to augment with compiler-like features just
like Haskell, its predecessor. But you can imagine these RTLs being
[indiscernible] also. So you can think of this as just putting these send and
receive points into [indiscernible], that's certainly admissible.
Okay. And then finally, what we're going to do is given this sort of
implementation, we'll produce an RTL for each FPGA and you can run it through
the back end tools to produce implementation. Okay?
All right. So first thing we do is we're going to construct a graphical
representation that we can partition. Of course, remember that the only thing
we know how to modify are these latency insensitive channels and that's going
to kind of give us this graph structure over here, where we have blobs of RTL
connected by latency insensitive channels.
We call these latency insensitive modules. Although I've shown RTL here,
again, it is possible to put whatever kind of computation you'd like in there,
including software. As long as it ascribes to the latency insensitive channel
communication model.
And then what we're going to do is chop up the design in this way and map it
down on to a set of FPGAs. And again, the vertices [indiscernible] and the
edges are latency insensitive channels. And here's an example of the syntax
here, so we have some channel A, and it induces the edge here.
12
So now that we've got that representation, our next objective is to place it
down on to a network of FPGAs. So the first thing we need to know is actually
what the topology of -- yeah?
>>: So you mentioned previously [indiscernible] this borders between devices
is that I'll have one cycle data out to data in.
>> Elliott Fleming:
No.
Not at all.
>>: Every cycle, I can produce data in such that the bandwidth between each of
the devices is full then.
>> Elliott Fleming: No, not necessarily. Again, these are just queues, right?
So if you don't put anything into the queue, there's no communication at all.
>>: Fine if I push something into the queue every cycle, then the interfaces
between, say, the block A and the block B is insufficient to run at that speed.
So, for example, the [indiscernible] that you described very early in one of
your very early slides is [indiscernible] because the bandwidth externally is
so much smaller than the bandwidth internally.
>> Elliott Fleming: That's true, and we'll do [indiscernible] in this approach
too, but we'll also have back pressure, okay. So if the bandwidth -- oops,
sorry. I hit the wrong button. If the bandwidth on C is insufficient to carry
all the traffic between A and B, then A will stall. A will get back pressure
and A will stall.
The hope I of course, that as things scale up, you'll get more and more
bandwidth. But it is a problem. If C is enormous, if C is 10,000 bits, then
yeah, yeah.
>>:
So how do you know how many ins to allocate?
>> Elliott Fleming: We'll talk a little bit about that when we talk about
compiler optimization. But generally speak, first of all, the conception that
these are pins carrying traffic between FPGAs is a little bit mistaken. So
actually, they turn into these high speed transceivers, right. So actually,
there aren't pins at all in some sense. But we will talk about how exactly we
allocated the bandwidth of the transceiver, you know, in a sort of intelligent
13
manner, perhaps in ten slides, or maybe less than that.
Okay, did that answer your question?
some point in the future.
We'll talk about bandwidth allocation at
All right. So anyway, we need to know what the physical topology of FPGAs is.
This is a little syntax for describing that. So basically, we have two FPGAs.
FPGA zero, FPGA one, and they're correct connected by bidirectional channels.
And, of course, you could scale this up to have whatever system topology you
would like even though I'm only showing a short example here.
And so this is the physical system where we have two FPGAs and they're
connected by some high speed transceivers, okay? And then what we'll do next
is we'll map the modules, based on area. So, of course, you have to have
feasible implementation, but also you want to minimize communication between
the FPGAs, at least ideally we would have some algorithm that it did this
automatically.
So you get some mapping like this, where A and B are on one FPGA and C and D
are on another, but currently require user inputs. The user's going to have to
tell us which module goes where.
Now, of course, an important feature work is doing that automatically, being?
I'd like to point out at this time that this configuration file is the only
thing in the design that differentiates a single FPGA from multiple FPGA
implementation, or even a three or four whatever FPGA implementation.
This is the only part of the input to the compiler which is changing.
program itself is fixed. Which is, of course, an important property.
The
All right. So now that we've done this mapping, we have to synthesize the
network. So basically, this entails choosing an implementation for each of the
channels. Local communication, of course, just turn back into a vanilla FIFOs,
so they're just this sort of ideal high bandwidth interconnect.
However, remote communications will actually go through some kind of network
hierarchy. Which we will synthesize based on the program. We'll talk about
that in the next set of slides.
So all of those channels will be tied down to some router, which will manage
14
the FPGA interconnect. Okay. And, of course, this link will appear as a FIFO,
but the routers themselves will be quite complicated. So we've seen a flow of
how we can get from RTL to a multiple FPGA implementation and now we'll talk
about specifically how we build the routers. So here's a cartoon of the
network architecture, right, so basically, the program is seeing FIFOs at back
pressure, okay. And these FIFOs are going to be multi-flexed on to the router
infrastructure, okay.
So this programming model is quite simple. The hardware to support it is
actually quite sophisticated. So basically, we have this automatically
synthesized layer of network hierarchy, so the first layer is marshalling, so
we have to be able to handle some wide links and convert them into a fixed
packet size.
Then we have to have some virtual channel buffering to ensure that channels
don't block each other. And then finally to improve the parallelism, we'll
actually run multiple lanes across the link in order to try soak up as much
inter-FPGA bandwidth as we can.
Is this clear? So the first thing we do is channel marshalling. So in the
original user program, of course these can describe whatever data types they'd
like to be carried through FPGAs. But the network width is fixed so we have to
introduce some layer to packetize the data types.
So for very wide data types, of course, you just do the shift register. But
for narrow data types, we'll actually just pack everything into a single
network word. And we will do this based on the links, so this will be
automatically chosen by the compiler.
And this is actually important, because remember we're working on hardware
designs. And, of course, hard wire designers are always trying to economize on
bits and it turns out in many hardware designs, the width of the channels is
actually quite narrow.
This is an example of HAsim, and what we see here is basically that the
overwhelming number of channels are narrow. Okay?
So next layer is channel multiplexing, right. So the good news is most
channels actually don't have a lot of activity and remember, we're only
carrying data between FPGAs when data is explicitly NQed so if there's no
15
activity, then there's no bandwidth consumed.
The bad news is we don't control message creation consumption, and this can
lead to dead locks, because we have a shared network infrastructure to see how
that can happen. So an example. So now we need both A and B to do the star
operator. A sends a value and, of course, A is going to send again, right.
And you know how this works.
So now B is going to send something that we actually need to proceed and, of
course, it's got head of line blocking so we're deadlocked. How do we solve
head of line blocking? Well, so one option is we could try to compute the
dependencies and do something intelligent with virtual channels. But in
reality, we'll just give every channel its own virtual control. Okay?
How do virtual channels work? Oh, so this is going to be deadlock-free via the
[indiscernible] sites theorem, because, of course, we're broken all the channel
dependencies, since each channel has its own virtual circuit, then we can't
have a deadlock.
So how does this work? Well, now A sends but A doesn't have any more flow
control credits so it can't send again. So B will send and, of course, it's
now out of flow control. Now the operation can proceed and we'll send flow
control back and A can proceed again. This is very simple. It's kind of how
flow control works in general.
Now, of course, we have some options in implementing this. One option, of
course, is to very small buffers. Small buffers are inexpensive. Of course,
there's a problem because inter-FPGA latencies can be quite long and so if we
have small buffers, then if there's a hot path between the FPGAs, then we can
stall.
On the other hand, large buffers are expensive. So if we just give a large
buffer so say we give eight registers per channel, then we end up using most of
the area of the FPGA. And, of course, this is problematic because what we want
is the user program to have most of the area of the FPGA for its own
implementation.
So what are we going to do? Well, so observe that the channel connecting the
FPGAs is actually serial. So what that means that is we're basically getting
one data word per cycle. What that implies is that the store for all of our
16
virtual channels can also produce data at one word per cycle and will satisfy
the full through-put via Little's law. What this means in practice is we can
use a serial structure specifically BRAM, to store all of these virtual channel
buffers.
What that means is because, you know, BRAMs are quite dense, we can actually
have enormous channel buffer for virtual channel and we'll still be
deadlock-free because the virtual channels don't block each other in the show
structure. Okay? No, yes maybe?
So basically, right, what we'll do is because this is serial and the SRAM is
serial we won't lose any through-put, but we won't have a very deep buffer per
channel which will also cover the latency of the FPGA links.
>>:
Are you saying you use a [indiscernible].
>> Elliott Fleming: Yes. Single right port, that's right. Single right port,
single [indiscernible] port so you get full [indiscernible]. A little bit of
latency, maybe, but we get full through-put.
So this is what the multiplexor micro architecture looks like. So basically,
like I said, we have all the virtual channels mapped down on to the SRAM and
then we have some bookkeeping bits out to the side also mapped into
[indiscernible]. What's that? Oh, okay. So data comes in, it's stored in the
BRAM and we have some arbiter that selects which virtual channel we're reading
out of based on the bookkeeping bits.
And the great news here is that even if we give enormously deep buffers, more
than 100 buffers per channel, we use only a small percentage of the FPGA for
typical designs. And this allows us to scale the size of our implementations
so we can actually, you know, instead of -- we can have connections between
several different FPGA devices without overwhelming our area usage.
>>: [indiscernible] using this architecture, previous architecture based on
how -- allocation.
>> Elliott Fleming:
>>:
What do you mean?
This has, of course, the issue of you are kind of doing [indiscernible].
17
>> Elliott Fleming:
That's right.
>>: How do you know when you can afford a [indiscernible] architecture
versus ->> Elliott Fleming: We'll get to that in a couple slides, I think. Okay, so
the last level, so at this point in time, we'll get to it in this slide. So at
this point in time, we have a fully functional router so we could lay this down
and we'd have a fully working multi-FPGA implementation. The question is can
we do better, and your answer, as you alluded to, is yes, we can do better.
So in order to do better, let's kind of look at the properties of user designs.
So specifically, what the widths of the channels look like and what their
traffic looks like. What we see here is, of course, as I already mentioned,
channels are narrow and also that these narrow channels can have some high
occupancy right. So ideally, what we want is to sort of service these channels
as best we can.
So user designs have pretty low clock frequency, and there are channels,
whereas the inter-FPGA physical layer is very fast and is hundreds of bits wide
as a result. So, of course, you have to this clock frequency sort of gear
boxing, right. So if the user design is running at 50 megahertz, and the
inter-FPGA is running at hundreds of megahertz, then we have to sort of
multiply up its width. And you end up with a few hundred bits per cycle of
data that you need to stuff into the file in order to get full bandwidth.
Okay. And what this is telling us is basically that in the presence of all
these narrow channels, single channel at a time is very wasteful. So if we
just do a naive time multiplexed approach, we're going to waste a lot of
bandwidth, okay? So how do we do better? Well, we'll have multiple lanes,
okay, and they will share the bandwidth.
So how does this work? Basically what we'll do is we'll instantiate several
multiplexors on top of the wide PHY, okay, forming lanes. So here we have one
multiplex or two multiplexers, three multiplexors, all on top of the same wide
physical layer. And these can all go in parallel so we can recover some of the
parallelism of the system.
So we have some time multiplexing. Of course, these remain time multiplexed
but they can all transmit data in parallel.
18
So now that we have the capability of adding these lanes, we have to ask the
question, how many lanes should we have, and how do we allocate channels to
lanes, right. So these are free parameters in the router architecture.
So we could look at the dynamic behavior. Of course, ideally what you would
not do is you would not allocate two channels that are instantly being NQed at
the same time to the same lane because then, of course, they're fighting each
other for bandwidth.
We can't really reason about that behavior at this point in time, although
maybe with some better perimeter analysis techniques, we could. But what we
can do is observe aggregate channel loads so we can do is instrument the design
and look at the traffic across each of the channels in the design and try to do
something with that.
The idea being that what we'll do is we'll minimize the maximum load on a given
lane. So we take that maximum load as kind of a measure of how fast our
program is running, assuming that it's communication bound and we'll try to
make that as small as possible. Okay?
Unfortunately, this is a processor scheduling problem, or at least it turns
into a processor scheduling problem that's NP complete, but there is a good
heuristic longest job first.
So how does longest job first work? Okay. So what we have here is a set of
channels and a program that we're going to route through two FPGAs. The height
of the bars represents the loading. That is the absolute amount of traffic
across the channel and the width represents the physical width of the channel.
So you may have some channels which are wide and some channels which are
narrow.
They produce more or less amount of traffic. So the first thing we'll do is
sort according to load and we're going to try to make the situation the best
for these heavily loaded lengths, okay? Because, of course -- what's that?
>>: Is the right way to think about this is that the width represents the
packet size and the height represents rate?
>> Elliott Fleming: The height represents total traffic, right, which could
represent rate, although, you know --
19
>>: If you've already taken a packet size, it's total traffic [indiscernible]
packet size the same as rate, packet rate?
>> Elliott Fleming: Yes, across the run of the program, right. But I guess
what I'm trying to say is there needs to be a distinction drawn between the
aggregate behavior across an entire run and dynamic behavior, right.
>>:
Is that distinction just [indiscernible]?
>> Elliott Fleming:
>>:
So how are you capturing burstiness?
>> Elliott Fleming:
>>:
Yes, burstness, right.
We're not.
That's where I was going.
>> Elliott Fleming: Yeah, we're not capturing burstiness. So obviously, you
know, if the total program run time is something up here, then the rate may be
low, but you may have burstness and that might perturb your reader
architecture, but I'm not trying to capture that at this time.
Okay. So anyway, basically, what we'll do is we'll take our heaviest loaded
lanes -- heaviest loaded channels and synthesized lanes for them, right. So
one, two, three for the three heaviest loaded channels. Then we'll allocate
those heavily loaded channels to the lanes.
Now, with the remaining channels, we'll try to load balance, allocating the
channel to the least loaded lane.
So now we put this one here and we'll put this one here. And so on. And what
we've got is basically load balancing. So on average, the total amount of
traffic across each lane is more or less equal.
Okay.
And it --
>>: Question, are you talking into account the width as well in this
allocation?
20
>> Elliott Fleming: Yes, yes. Because although we're not doing it here, if we
put this fat channel on a narrow lane, then its traffic will change. So yes,
we actually do account for that. So when we make the choice, we change -- so
you can, of course, because you've kind of statically allocated the widths, you
can see how much traffic will be across each lane. Yeah?
>>:
So this is a [indiscernible] simulation, right?
>> Elliott Fleming:
>>:
It's [indiscernible].
That's right.
And that depends on how you set it up?
>> Elliott Fleming:
>>:
Yes.
Yes, so it's workload dependent, absolutely.
So it's a non-FPGA, simulated on not FPGA?
>> Elliott Fleming: Oh, no. You could simulate it on [indiscernible] and all
the channels for you and find all the loads.
>>:
So you make it remapping?
>> Elliott Fleming: That's right. So basically, you can -- you're always free
to synthesize a crappy network, and ideally, the loads will not change very
much. Or you could do it in simulation. Of course, that's an infinite
capacity FPGA. Although generally speaking, for most of these designs, they're
of a sufficient size that simulation is not your most attractive option because
you can't really run a large enough workload.
>>:
Are you going to talk about topological mapping of the FPGA networks?
>> Elliott Fleming: What do you mean by that? So this idea of perhaps adding
route-throughs to handle strange physical topology routes to logical
topologies?
>>: Strange is perhaps pejorative.
your consumers are adjacent FPGAs.
>> Elliott Fleming:
>>:
Okay.
You're assuming that your producers and
No, not at all.
So FPGA producers and consumers are not FPGA --
21
>> Elliott Fleming:
>>:
The [indiscernible], and now you're talking about a routed network.
>> Elliott Fleming:
>>:
The [indiscernible].
That's right.
And now your virtual channel approach is a little bit trickier.
>> Elliott Fleming:
[indiscernible] because again --
>>: Let me finish the question with the observation if you have a shared link
with a direct communication and then a route-through, you've got to run it off
both sets of virtual channels.
>> Elliott Fleming: If you're trying to do something clever, I could imagine
that being the case. However what we do is at each inter-FPGA crossing, we
will give a new virtual channel. So basically, the virtual channels are only
handling deadlock-free on that single inter-FPGA link. So basically what would
happen is you bounce -- what's that?
>>:
[indiscernible].
>> Elliott Fleming:
Right.
>>: You in some sense turn it into a statically-routed network.
better off if there are no dynamically routable paths.
It would be
>> Elliott Fleming: You could have a better implementation, perhaps, if you
had some capability to dynamic load balancing. But, you know ->>: It's like we talked about this this morning. If I have a failure and I
want to remap, you know, a FPGA's rolled to another FPGA, now all my routes
through the network change and the virtual allocation changes and it doesn't
sound to me like you've provisioned for that dynamically.
>> Elliott Fleming: No, no, no. But it's again the virtual channels are so
cheap that it wouldn't be beyond the realm of possibility to have spares.
Again, these things are very inexpensive. The cost of a new virtual channel is
the cost of adding extra space in an SRAM and the SRAM has, you know, 64
22
kilobytes of space.
of space.
Once you have one of them, right, you actually have a lot
>>: I think maybe for the problems you've allocated the packet sizes are
relatively small. But if you start having large packets running through the
virtual channels [indiscernible] get really high.
>> Elliott Fleming:
>>:
Maybe.
But the packet is four kilobytes, for example.
>> Elliott Fleming: Of course, you could break that packet up into chunks and
just do channel allocation on the chunks. You could certainly do that. Yeah.
And, in fact that's what we would do. We'd marshal it and do full control on
the marshals.
>>: The number of [indiscernible] channels, do you see the [indiscernible]
channel which one you send out, do you see the effect?
>> Elliott Fleming: That's right, it does go up, and, in fact, in the virtual
channels, you can choose different architectures. So we have several layers of
pipelining. So, of course, if you have a handful of virtual channel, you get
single cycle scheduling. Otherwise, you have to do one or two cycle
scheduling. But two cycle scheduling goes up to several hundred lanes. So
it's scaleable. I mean, you could even if you wanted to add a third level of
hierarchy there, but it's just dropping it in, right, and the compiler could
choose based on the number of virtual channels, yeah.
Okay. So what happens when we do this optimization to a real program that is
HAsim? So here we have the naive implementation of HAsim, right. And again up
is good. So up is aggregate MIPs for the simulation. And so when we do this
longest job first algorithm, we do get some ten percent performance gain.
Here, what we've done is eliminated sort of collisions between packets. So in
HAsim, there were perhaps tokens being generated simultaneously and having more
lanes removed some of that effect.
I'll point out that HAsim is not communications bound. So it uses only about a
third of the bandwidth between FPGAs. So that's why we don't get some higher
through-put, because HAsim actually isn't stressing the network.
23
If, of course, you use some kernel which is, in fact, producing a large amount
of traffic, then you will get linear speed-up issues scale the number of lanes.
As you might expect.
Okay. So now we've talked about sort of how we synthesize the inter-FPGA
network and how we actually describe and implement zips that can be partitioned
across FPGAs. Now let's talk about a couple of examples.
So we'll get two case studies. Airblue, the wireless transceiver, and HAsim, a
simulation framework for modeling multi-cores.
So the basic idea of Airblue is we want to implement wireless transceivers such
that we can operate on the air with commodity equipment to test out now
protocol ideas. So this works well if the protocol is stainless steel like
802.11G, but newer protocols, particularly those with MIMO, of course, require
much more area and so they don't fit into a single FPGA, and that also includes
the need for multiple antennas.
So what do we do? We just throw another FPGA at the problem so we go from one
FPGA on the front end to two FPGAs on the front end. It's that simple. Okay.
So the baseline, 802.11G implementation looks like this. You've got a TX
pipeline and RX pipeline. What we want to do is implement some new algorithm
spinal codes. Spinal codes is some new error correction algorithm.
Okay. The problem is it's much larger than the existing -- it's new. So it
was at Sigcomm in August. It's actually quite good. It's actually better than
turbo in those respects.
>>:
[indiscernible].
>> Elliott Fleming: Oh, yeah, you know, maybe talk about how the name
[indiscernible] came about. It wasn't my choice. I always think of spinal
tap.
So anyway, basically, the problem with this code is that as good as it might
be, it's much larger than the turby and so we exhaust the area of the single
FPGA.
>>: Does [indiscernible] mean anything until you start playing the rate of
frequency?
24
>> Elliott Fleming: That's right. So basically, it's the part of the wireless
transceiver between the RF and the Mac working on packets. So it's the thing
that's taking that baseline signal and taking it into packets with error
correction and various other algorithms running. Okay?
So anyway, of course, as you might expect, we simply partition across two
FPGAs. These little FIFOs here are latency insensitive channels, and no source
code modification is required, right. So that same design that you would map
in simulation, you can map across two FPGAs and meet the high level protocol
timings. Again, the high level protocol timings being at the scale of tens of
microseconds. So the latency of the inter-FPGA interconnect is not a problem.
And, of course, also because this is a largely flow-through pipeline with a
tiny amount of feedback here, you would expect that we would have no problem
with feedback latency. Okay? So the second thing we're going to do with
Airblue is actually simulation. So often when we're evaluating protocols, we
care about operating points at bit error rates of one in a billion. Of course,
you know, if you want to test that operating point, you need to generate
billions and billions and billions of bits.
Which, of course, is a problem in software, because the software simulator is
running at kilobits. And, of course, the FPGA is running at megabits. So by
choosing FPGA, we run a thousand times faster. Of course, we can implement
this on one FPGA so we can simulate on one FPGA, the question is why would we
want two.
The reason you want two is because the tools can actually find better
implementations. So we can do when we take a simulator and partition across
two FPGAs, even though we're talking about one FPGA, is get speed up. So here
what we show is speedup relative to a single FPGA implementation. Most of the
speedup comes from clock frequency improvements. So we just take the part
under test and we amp up its clock frequency as high as possible and this gives
us a faster simulator. Okay?
So in summary, basically, Airblue and wireless pipelines in general are these
deep pipelines with infrequent feedback and at the protocol level, we only care
about ten microsecond timings so this is an ideal solution to sort of take a
prototype wireless transceiver and actually get it to work on the air. Okay?
25
So now let's talk about something with a little bit more complicated
communications graph. That is, the processor simulator HAsim.
So what is HAsim? HAsim allows you to basically simulate complex multi-cores
so full cache hierarchy, out of order and cycle accurate. So one key point
about HAsim is that it's time multiplexed, which means that we don't -- say
we're simulating 64 core processor. We don't lay out 64 cores. We lay out a
single compute pipeline and multiplex it among all of the cores. This is like
SMT.
Okay. And, of course, with that approach, it's very easy to parameterize the
design for scaleability. Of course, HAsim can go anywhere from one core to ten
thousand cores. The question is whether or not you can actually implement it
on the FPGA.
>>: [indiscernible] you're time slicing architectural state on underlying
substrate. You're not dynamically provisioning [indiscernible].
>> Elliott Fleming:
>>:
That's right.
It's much more like [indiscernible].
>> Elliott Fleming: Okay. So anyway, it is multi-threaded, and, of course, it
has a complex communications graph and lots of feedback. So it's different
than the wireless pipeline in the sense that all of these ports are
communicating and wherever they're communicating almost constantly. Although
the time multiplexing is going to help us cover some of the latencies.
So what happens when you add HAsim to multiple FPGAs? Well, the first thing to
notice is that on one FPGA, we can map 16 cores and then on two FPGAs, we can
map more than a hundred. Again, this is because in HAsim, this time
multiplexing means that we're not replicating the entire structure of the
processor to add another core. We're only adding some state. So there's a big
constant cost to building a core model. But the cost for adding a new core is
not so high. And that's why we get this highly nonlinear scaling.
>>: So if you look at a [indiscernible] microprocessor, most of the area is
devoted to micro architectural state, whether it's branch predictor,
[indiscernible] cables, buffer, caches. And very little of it, relatively, is
control state.
26
>> Elliott Fleming:
Okay.
>>: So you have to multiplex all of that, it seems like, as you add threads or
logical cores, you're going to see a linear increase [indiscernible] buffer.
>> Elliott Fleming: That's right, and we do. So part of, I think, the savings
here is unless these things are mapping to different structures. So it's not
quite linear. There is some room to scale. So, I mean, going from 16 cores to
a hundred, for some of the structures means just stuffing more data into a
BRAM. And for many structures, that means actually that they don't increase in
size.
Only some of the structures are increasing.
>>: I guess [indiscernible] of course the networks fine.
that that's the fact --
I'm just surprised
>> Elliott Fleming: So remember HAsim is a model, not an actual
implementation. So much of the cached state is stored in an interesting way so
we actually synthesize a cache hierarchy, scop this hierarchy actually goes out
to host virtual memory.
So if -- in some sense, that cache hierarchy is sensed for any choice of cores.
And the pressure on it, of course, changes, and its performance will change.
The more cores you have, the more misses you will take. But the size in terms
of FPGA area is not changing.
>>:
Right.
>> Elliott Fleming:
>>:
Okay.
Yep?
[indiscernible] meaning they are using DDR 3?
>> Elliott Fleming: Yes. So I'll actually talk about how the memory hierarchy
works in detail in a few slides. It's actually very interesting. But we'll
get there in a couple of slides. Looks like we've got plenty of time to do so.
This is actually the first talk that I've made it this far in this amount of
time.
27
>>:
Before you move on, what are the different dual FPGA --
>> Elliott Fleming: Again, so we mentioned that as we add cores, we increase
the amount of implementation area, but that has impact on clock frequency. So
the more things you try to stuff on the FPGA, typically the worse that the
tools do. Again, we're not -- we're just naive users of the tools. We're not
trying to floor plan everything. So we just take whatever frequency is given
to us by the as to.
So basically, what happens is let's take this bar, for example. So this is,
say, 36 cores. So either a 64, a maximum 64 implementation or a maximum 128
implementation can handle this model. It's just because the maximum 64
implementation is smaller, you get a higher clock frequency and so you get some
performance benefit as a result. Okay?
So one last thing to note is how much performance you lose going from one FPGA
to two FPGAs. So basically, it's these two bars here, right, so the gray bar
is a single FPGA implementation. When we go to two FPGAs, we lose at most
maybe half of our performance.
This is already much, much better than the traditional tools which would lose
maybe an order of magnitude or more in terms of performance. Okay? And, of
course, as we scale the number of cores we can cover more latency and so our
performance comes back up, right?
So in summary, single FPGA gets filled at 16 cores. But with multiple FPGAs,
we can go to 128 and actually we're trying to build a thousand core processor
on some -- Richard, yes?
>>:
So you mentioned [indiscernible].
>> Elliott Fleming: We never attempted to run commercial tools, in part
because we think that that's going to require major surgery. So the commercial
tools are not quite to easy to use. They usually require that you do some
modification to your RTLs.
Anyway, so and also, of course, you have to buy a box that costs a lot of
money. The emulator boxes are not cheap.
>>:
I'm assuming Intel might have helped you there.
28
>> Elliott Fleming: Yeah, we talked about it and we decided, you know, it
wasn't a productive exercise. Yes?
>>: So in some sense, your previous graph here, it's not necessarily when
people make certain [indiscernible] that first they have a through-put
requirement and then they build hardware [indiscernible].
>> Elliott Fleming:
Sure.
>>: This is sort of clouding making that sort of design space a little cloudy
because of the fact that [indiscernible]. How would you see [indiscernible]?
>> Elliott Fleming: So I'm a big believer in getting a system to work and then
understanding its bottlenecks before trying to optimize. You have a
through-put target, but it's very hard to know where bottlenecks in the system
are, particularly a new system without having something actually working. I
view this tool as first and foremost enabling implementation. So it's entirely
possible that we'll get something to work here and we'll discover that there's
a bottle neck. Where the bottle neck might be could, I mean, my feeling is
probably, you know, the compiler is not going to produce the bottle neck. That
there will be some intrinsic bottle neck either in the inter-FPGA through-put
or maybe in memory or something like that.
And then we go solve that, right. That's just my approach to problems in
general. Get something to work first and then debug later. So I also ask the
question if we have this requirement of running all these channels between
FPGAs, would the architecture that you hand code be substantially different
from the ones the compiler's producing automatically for you.
So I think that's another way to look at the problem, right. And I think if
you consider it that way, the answer is probably not, that at the end of the
day, you're going to be building this router infrastructure anyway. It's just
you're going to have to go through the pain of debugging it by hand. Moreover
if you make any slight perturbation to it, you'll have to rework the whole
system.
So it may be the case you can do the kind of longest job first optimization
that I'm advocating, but I'd hate to have to write that code myself. Something
like that.
29
>>: Yeah, but how much [indiscernible] what you call the driver?
that or do you use another machine, how much work --
Do you use
>> Elliott Fleming: So it's actually very simple, right so we abstract that
layer as just being a FIFO. So if you look at, for example -- so we used these
high speed inter-FPGA transceivers, right, so basically all we have to do is
get the core code, test it out, make sure it runs and abstract it as a FIFO and
feed it into the compiler. So actually, it's quite straightforward.
If you look at something like PCI, respect, going between host and FPGA, that's
a little more complicated. But the end of the day, it's still a FIFO as well.
And it's just multiplexing on top of that FIFO.
I mean, at the end of the day, if you look at the drivers that you're writing,
this is what they look like. And it's not clear to me that you're going to do
better than this. Age, of course, if you are doing better, there's probably a
way to generalize what you're doing, feed it in here, right. I mean, generated
router is just a phase of the compiler, right, so you could easily come up with
a new router architecture and test it out, right.
Again, that's the advantage of the compilers.
easier, anyway.
It makes things like that
So now let's talk about resources in multiple FPGAs. So again, I mentioned
this in the very beginning, if you remember all the way back, what we get when
we get more than one FPGA is access to more -- most obviously more slice, but
we also get more access to memory.
And there's this analogy to multi-processing here, right, where if we have two
cores and two threads, right, both threads get a full cache hierarchy, at least
parts of the cache hierarchy so they run faster. What we need in the FPGA is
an abstraction to sort of allow our FPGA programs, our HDLs to exploit these
resources. So we need an abstraction layer between us and the physical
devices.
So what I've shown you to this point is abstraction for communications, right,
these channels extract the communication between FPGAs. Nows I'm going to talk
about abstracting memory in FPGAs.
30
So basically, what we've got is this very simple interface and this is how
we'll do memory in our designs so just like a BRAM, you've got read request,
read response and write. So very simple interface, right. Is that clear to
everybody?
So a few points about it. One, we have
this will permit us to specify any size
doesn't fit on an FPGA. So if you want
probably don't have 32 gig of DRAM, you
an unlimited address space, right.
that we'd like, even if that side
to specify 32 gig of space, you
still write that down.
So
And we'll provide a virtualization infracture to back that storage space.
>>:
The freedom to run as slow as you want.
>> Elliott Fleming: Sure. Okay. You also have arbitrary data size, of
course, so again it's a parametric interface. If you want 64 bit words or 36
bit words, we'll generate the marshalling logic for it. And then finally,
again, as I've said before, it's latency insensitive. You aren't writing a
program assuming that there's going to be some latency that you get between the
read and some fixed latency, right. So you actually write your program in a
way that can basically decouple the read request and read response, okay?
So how does this look on a single FPGA? What we'll do is each one of these are
memory client. We'll aggregate them all together on a ring and feed them into
the on-board memory. So the first thing you'll do is have an L1 cache here.
If you miss out of the L1 cache, you'll go to the on board memory, which will
be DRAM or SRAM, depending on your board. Finally, if you miss out of that
cache, you'll go to host memory, okay? So host memory is what will take care
of this arbitrarily large address space in the case that you need it.
>>:
Host memory as the PC --
>> Elliott Fleming:
>>:
So it's not local DRAM attached?
>> Elliott Fleming:
>>:
We'll assume that there's a server attached.
I see.
So we use the local DRAM as an L2 cache.
31
>> Elliott Fleming: So basically, the flow will be something like this. You
make a request to your local BRAM cache. You may miss. If you do miss, you'll
scurry off to the board level resource, which will be shared among all the
boards. If you miss there, then you'll go back to host virtual memory. And
again, it depends on your address space and how big it is and how much data
you're accessing.
But again, the point here is if you need the large address space, we give you
the ability to scribe that. If you don't need it, say, you know, you say I
need an aggregate, a gig of memory, you will never miss in this cash, for
example. Okay?
>>: What is the use case, the motivatingy case that would be going down to
the ->> Elliott Fleming: I think the most obvious use case is portability, so
that's the first thing. I've given you an abstract interface and I can build
you a memory hierarchy on any board you want to implement on, including a board
that doesn't even have memory, right.
The problem comes in hardware designs frequently, when you bake in assumptions
about the underlying infrastructure to which you're mapping and suddenly that
infrastructure gets pulled out from under you, either because your build the
next generation or maybe because you have to do something like move between
boards.
At that point if you baked in some timing assumption that isn't true anymore,
you've got rework all your code. And that's a big problem.
Now, again, you're trading something for the abstraction, perhaps, right.
We're introducing all of these layers so maybe you add a little bit of latency.
That's certainly true. But the latency and the various performance loss comes
at what I view as a very important price, you know. The price of distraction
and portability.
So I can frame a design in terms of these caches, and as long as the platform,
whatever FPGA it I it doesn't matter if it's [indiscernible] or Ulterra or what
generation it is, I can run that design on any board, and that's pretty
powerful.
32
>>: [inaudible] going through the effort of making [indiscernible].
hope -- first of all ->> Elliott Fleming:
>>:
I would
Again --
Trying to develop this system.
>> Elliott Fleming: Again, performance is critical here, right. And I don't
know that we're trading a lot in terms of performance. Of course, I haven't
ever done the study. Again, I can tell you, what would it look like if you
were doing this yourself in hardware. Would you have an L1 cache here? Maybe.
We also have a way of eliminating the L1 cache so if you want to go directly to
the DRAM, you can certainly do that. We give that as an option.
>>:
[indiscernible].
>> Elliott Fleming:
>>:
Okay you can synthesize in cache [indiscernible] DRAMs.
>> Elliott Fleming:
>>:
No, not having read those things, I can't.
Right.
And distributing around.
I don't remember if he [indiscernible].
>> Elliott Fleming: I don't know. So anyway, I mean, whatever technology you
have to generate L1s is certainly useful here. I'll say that much. But again,
the idea that is we're providing an abstraction layer. And that's going to be
important, again, right, unlimited address space, fast local caches. And what
happens when we map a design across multiple FPGAs, right.
So here are two things that are happening, right. Again when we have multiple
FPGAs, the words may be homogenous, and again maybe they're not so we want some
portability of design, right. I mean, intrinsically, we've already said we're
going to have multiple FPGAs. So we expect asymmetry, right.
And, in fact, we may not even know what pieces we're mapping to what boards.
So it doesn't make a lot of sense. The more you fix a piece of a design to a
board, the less of this automation that can actually happen. So what are some
cases that can happen here?
33
One, we automatically route clients to the nearest cache, even if it's on a
different board. So here, we have clients that are sitting on a board that
doesn't have an L2. And they will simply route to the local L2, the closest
one, even if it's on another board.
>>:
I have a question.
>> Elliott Fleming:
Yeah.
>>: So you have all these [indiscernible] FPGA [indiscernible] why do you want
to build a central cache as opposed to a distributed cache?
>> Elliott Fleming: Right, so remember that each of these clients has its own
BRAM L1 cache. And those things can soak up all the resources on the board.
In fact, we're working on an algorithm now wherein we do area estimates for
placement. So we place design on board. We look and see how much BRAM is
available left over after the user design and just scale up all the caches to
soak it up.
>>:
Didn't you say each client has a cache, is that the single cache?
>> Elliott Fleming:
Yes, a private L1 cache.
Each one.
>>: But still, I'm trying to understand why do you want to have -- so in a
processor, there's a single cache because you want to have a limited number, of
course. But if on an FPGA, there is always these BRAMs with all these ports,
don't you think that having a single cache, as opposed to multiple ones that
you can access in parallel is ->> Elliott Fleming: That might be fine in architecture. So if I understand
you correctly, you're asking why is it that I don't just give each one of these
guys a place in their shared chip legal BRAM cache. In fact, that's one
option. We do have a BRAM central cache that you can use. Of course, the
scaleability and clock frequency issues there are pretty obvious, right. That
is, once you have a resource that's being used by a bunch of guys, the
multiplexing logic can be problematic.
But that's a perfectly valid implementation. We'll lay out a half Meg or a Meg
of BRAM cache and you can use that as a your shared L2 if you'd like, right.
But again, you'll never get the clock frequency to that that you can get to the
34
local caches, right. I mean, if you have to run wires all over the chip, you
have to run wires all over the chip and it's going to be slow.
So there is that trade-off to be considered too. And as I mentioned, we do
give an option when you can disable the L1 caches if for some reason you don't
want to pay the latency. For example, if you have a streaming workload or you
know that you're never going to hit an L1 for some other reason, you can just
eliminate that cache.
>>:
Is it safe to say you punted on coherence?
>> Elliott Fleming: At this point we have, so these are independent address
spaces. Although what we're working on now is so ideally, let me tell you how
coherence works in my mind, if I can just get the junior grad student to work
on it, right. So basically, what you would do is when you say scratch pad, you
would also specify coherence domain. You would say I want this set of scratch
pads to be coherent, and you would synthesize some directory-based protocol on
top of those specific scratch pads. Again, the approach is automatic
synthesis, right. But it would some kind of directory-based protocol, sharing
the space.
Again, in the applications that we're considering primarily being these DSP
algorithms, although HAsim is starting to run into the co Terrence problem now,
which is why we're specking it out, what we want to do is basically slice that
thing across 16 FPGAs and, of course, then suddenly the functional memory has a
coherence problem, right. So we will be synthesizing coherence algorithms
soon, I expect.
>>: So the dine space [indiscernible] that you mentioned before, that's highly
manual process. Can you see, is there some sort of plan that you can see to
keep [indiscernible], or is that something that ->> Elliott Fleming: So in some sense what you just described to you with
inflating the cache sizes, in some sense that design space exploration
constrain to the memory sub sit tem. So I think you F you could express
parameterizations and maybe their relationship, you could look to maybe have
some machine assistance. So if the compiler throws down an algorithm and that
algorithm has some parameter by which it could be scaled and it turns out that
there's area on the FPGA, we could easily scale it up.
35
So generally speaking, and I think modeling this as some kind of linear system
or I don't know if it would work as a linear system, but that's how I'd
approximate it something like Pecora, a thing out of your colleagues down
south, is an interesting approach to this idea of allocating area to different
pieces.
But yeah, I mean, if a compiler knows how to scale, then certainly you could
maybe have some assistance in that.
So anyway, what do scratch pads do? Basically, like a processor architecture,
wife all seen this diagram before in internal purpose processors. We have a
bunch of plateaus, so we have some plateau at L1 where we get lots of hits.
Oh, so this is stride versus working set size. And up is bandwidth. So up is
good. These values are hitting in the L1 cache and then we kind of start -- we
have some region where we're sitting in the central cache and some reason where
we miss out [indiscernible] and our performance sucks. But it stainless steel
works, right?
So just like in a processor, if you have to go page, then your performance will
be horrific. And that's just the way it is. So programs is a good locality, I
guess.
So what else can we do with memory?
>>:
Why is this -- why is there [indiscernible] on that very last --
>> Elliott Fleming:
>>:
Somebody else have a question?
Yeah.
This data here?
All of those.
>> Elliott Fleming: So this data comes from the way in which we do our L1
caches. So we actually do SKU caches. Unfortunately, with SKU cache,
sometimes you get collisions. And some of these patterns produce pathological
collision in the SKU.
>>:
When you say SKU, you mean SKU to sensitive?
>> Elliott Fleming:
caches.
No, these are drive mapped.
So a SKU associated drive map
36
>>:
Sounds like an oxymoron.
>> Elliott Fleming:
straight hash.
>>:
So what is the SKU?
We hash the addresses into the cache.
We just do a
Any cache is hashed.
>> Elliott Fleming: Yeah, yeah, yeah, but it's not a linear hash, right. We
do some XORS. So there's some hash function perturbing the index of the cache.
So it's a SKU cache.
>>: The literature that benefitted you is you associate caches was reducing
the probability of hash sets and why would you put the SKU functions on in from
a standard, you know, just [indiscernible] hash function.
>> Elliott Fleming: Well, for a simple reason, actually. So if you do that in
HAsim, wherein you're doing indexing based frequently on some processor
enumeration, it can be the case that you actually do get a hot set on a single
block in the cache, right.
So you still have a hot set in these things, right. It's just one entry. So
if you have many addresses hitting the one entry, right, so we just want to
provide some randomization there. That's all. Although we are now working on
doing a set of associative L1s, but we'll still do SKUing there too.
Okay.
So anyway, that's how this works.
So now, you know, we mentioned that we get more memory so we can actually
increase the size of the cache. We can also introduce new algorithms. So one
of the things that we can do in terms of introducing new algorithms to a memory
hierarchy is adding prefetching. Again, we're optimizing under the memory
abstraction. So the user says I have this read request, read response, write
interface and the job of the compiler is to soak up FPGA resources to make that
thing as fast as possible.
So we're going to do here is add prefetching. So how does this work? So if
you remember back to architecture class, you have some table, which you index
into based on PC. And then if you have sort of a stride, right, so a stride
pattern on a particular PC, then you'll try to prefetch out in front of that
stride to get a new value to cover some of the memory latency.
37
So we can do the same thing in FPGA. Although there's one key difference. And
what key difference is that, anybody? One of these fields is not right.
There's no PC. So there's no PC in the FPGA so we can't use that as a hint.
So it turns out that [indiscernible] programs don't have a PC, which might make
prefetching a little bit more difficult. However, balancing this difficulty is
the fact that harbor programs have a much cleaner access pattern so if you
think about how software program works, right, you're passing data through the
stack. You're constantly inserting new accesses to memory, which really have
nothing to do with the data flow of the program and everything to do with sort
of like stack frames and function calls.
So that could screw up the prefetcher. But in hardware, of course, we don't
have that problem. It's just a clean access stream. So even without PC, we
can actually do a pretty good job of prefetching. So the idea here is if we
have extra resources on the FPGA, we can add more complexity at the compiler
under the abstraction and hopefully get more performance.
So how does that work? So basically, here we have an implementation of matrix
multiplication, and this is one-time normalized to a single -- I'm sorry, an
implementation without prefetching. So, of course, we're we do matrix
multiplication, which has a very predictable access pattern of the kind. The
prefetcher should do a very good job with and we find that depending on the
size of the matrix multiplication, size going this way, we actually get a lot
of performance benefit with prefetching.
And again, this comes because the original hardware wasn't doing a very good
job of Hanning the edge cases in the matrix multiplication algorithm. Of
course, as the matrix gets large Iraq the edge conditions are less important.
We spend more time running down the rows and so the benefit of prefetching is
much less, although is still measurable.
But for mall matrices, 64 by 64, the performance gain is enormous, again
because we hide the latency.
So that's great, so matrix multiplication, we should also be able to prefetch.
So what happens with H264? So H264 has a data dependent access pattern. Still
predictable, and we do pretty well here. Again, we get a maximum of about 20
percent performance gain with prefetching. And again, these are codes that
38
have existed for many years.
them as workloads.
We wrote papers about them and we're just using
Again, because we frame them in terms of the memory abstraction, we can
actually extract performance as we improve the algorithms in the cache
hierarchy.
Okay. So in conclusion, latency insensitive channels enable automatic multiple
FPGA implementation with minimal user intervention. And when we do such an
implementation, we can get higher performance, better algorithm scaling and
although I didn't talk about it, we get faster compilation. And the conclusion
here the high level take-away is that high level perimeters can offer powerful
automatic tools.
And as we move to more complex systems, these kind of tools will be necessary,
right. They can no longer be in the purview of the hardware designer to
produce these implementations.
Future work. So place and route. So as we know, those of us working with
FPGAs, place and route is taking a long time, and it continues to scale up as
the FPGAs get larger. However, the latency insensitive modules provide a way
of dealing with this in that you can place and route the latency insensitive
modules independently and then synthesize the network between them. The
general idea is that long wires with long latencies, which you would get if you
get the sort of distributive approach can be broken with register stages much
in the same way that we use buffer boxes and ASIC process.
So additionally, right, if we only have to recompile part of the design, then
network resynthesis should be cheaper than full chip resynthesis. And then
finally, hardware/software communication, we already do this a little bit, but
I'd like to formalize it a bit more. So again, the problem with hardware and
software communicating is that hardware -- sorry, software is a
nondeterministic thing in terms of timing, but generally speaking, the latency
insensitive channels allow us to capture this. So basically, you could have
latency insensitive channels in software, latency insensitive channels in
hardware, some ensemble program composed of pieces of software, pieces of
hardware, and we would synthesize all of the communication between them. So
that's how that would work.
With that, I'll take questions.
39
>> Doug Burger: Thank you very much. Since we're at time, we may have time
for one question. One or two questions. And if not ->>: You started off saying that FPGA use [indiscernible] and yet spent 99
percent of your time talking about processors.
>> Elliott Fleming:
>>:
No, because it doesn't [indiscernible].
>> Elliott Fleming:
>>:
Processor simulation is an application for FPGA.
No, it absolutely does.
It's something --
>> Elliott Fleming:
works.
No, no, no, no.
Let me explain to you more of how HAsim
>>: I'm just kidding you [indiscernible] what you've been talking about
there's a disconnect and you're into architecture and we are more -- I don't
get the caches. Caches don't do nothing for me.
>> Elliott Fleming: Well, maybe they do and maybe they don't.
on what your access pattern is, right?
It just depends
>>: What I would like to see, for example, something that has [indiscernible]
going through and then you get into [indiscernible].
>> Elliott Fleming: All of these things have multiple [indiscernible], right.
All of these implementations are multiple [indiscernible]. In fact, what is
nice about this approach is actually once you have a latency insensitive
module, you can pick a clock for each module, right. If it's beneficial.
Again, you know, synchronizers change the timing behavior. So actually, this
model kind of supports intrinsically multiple clock domains.
>>: You spend all your time on time insensitive and as fast as possible
application.
>> Elliott Fleming: Sure. Well, I think Airblue is actually timing sensitive
at the high level rate, because you have to produce a result in ten or 20
40
microseconds, and then it's a question of whether or not the latency between
the FPGAs is tolerable.
Same is true in H264 encoder. I didn't talk about H264, mainly because I think
the results are not terribly enlightening. They're no different than the
wireless transceiver. I mean, the basic idea is yeah, we can partition it.
Yeah, the bandwidth between the chips is sufficient. And yeah, we meet the
millisecond or whatever timing, even with introducing this new latency
component.
But, you know, I think what is different here, let me address the process
simulation problem. So understand that the processor simulator is actually a
hybrid design because, of course, we're not modeling things like disk. We're
not modeling things like most of the operating system on the FPGA. Actually,
that stuff is running in the software simulator on top.
>>: [indiscernible]. I understand. The reflection is more like all that
work. Switch your mind to a certain set of problems and bandwidth, you
actually have an [indiscernible] with specific time requirements, et cetera,
you end up finding a whole different set of problems for the tools. You get
much closer to the tools. You get to understand [indiscernible] constraints
and your timing is quite [indiscernible] spend a lot of time and then the
difference between synthesis and simulation and the [indiscernible] becomes a
problem. So that's more [indiscernible].
You phrase it as this is good for application.
[indiscernible].
But in fact it's not about
>> Elliott Fleming: Well, I mean, I only talk about ASIC synthesis in the
sense that this is what multiple FPGAs were used for in the past, right.
>>: I think more to the point is that HAsim represents a single -- a very
specific application.
>> Elliott Fleming:
Sure.
>>: And many of their applications, with a certain set of constraints, a
certain set of characteristics and many other applications have different,
widely different constraints.
41
>> Elliott Fleming:
>>:
Sure, absolutely.
Absolutely.
System level behavior, bandwidth level behavior, that sort of thing.
>> Elliott Fleming: Sure, I don't doubt it.
accommodate of those designs, though.
I would think that this would
>>: I think as we move forward into more automation, I think it's really
important to understand these different spaces and think carefully about -that's an easy statement to knock off, but, you know, there's a lot of FPGA
intuition in the room that's not lining up with the approach that you're taking
for many problems.
So I think we should -- it will be important to understand those phases.
>> Elliott Fleming: If you have to do things by hand, you have to do things by
hand. I mean, you know, sometimes we have to write assembly too. And there's
nothing in this that prevents that. It's just that if you can get away with
not doing that, then it's probably best that we stay at the high level. But,
of course, you know, you can't always live in that world.
>>: [indiscernible] it's really time consuming. So you spend more time
[indiscernible] for example as opposed to saying [indiscernible] see what
happens. We're using as much as possible -- do you see what I'm saying? It's
different kind of things [indiscernible] than if you're doing something else.
>> Elliott Fleming: Sure, I completely believe anytime you use the compiler,
the compiler may always be bad, but I think the [indiscernible] history is that
compilers inevitably get very good.
>>: I'm saying, I would like to see more of compilers report in what you've
done.
>> Elliott Fleming: Sure. I mean, I completely agree that there is more work
needed here. Particularly on the quality of service front. But I think it is
possible to model some of those things in the compiler and get good answers. I
think. I mean, at least I think so. But we'd have to look at particular
applications before we could get some conclusion.
>> Doug Burger:
All right.
Thank you very much.
Download