>> Alessandro Forin: Good morning. Today it's my... Vahid from UC Riverside. Frank has been involved in...

advertisement
>> Alessandro Forin: Good morning. Today it's my pleasure to introduce Frank
Vahid from UC Riverside. Frank has been involved in FPGAs, and that's what
he's going to talk to us about in architecture. But he's also teaching embedded
systems and has some work in eBlocks. So if anybody wants to talk to him,
there's still at least one slot open, either about this talk or other areas related with
it. Frank.
>> Frank Vahid: Okay. Thank you, Alessandro. And thanks for coming. I
wanted to thank Alessandro, first of all, for hosting Scott Sirowy who's doing an
internship here and for arranging for this visit. And I just received this last night,
and I wanted to thank him for this too. And I actually did receive this last night,
came from Microsoft Corporation. It looks like I won your long-awaited
sweepstakes results from your Foundation for Software Promotion. So $250,000.
And I assume you had something to do with that, Alessandro, so thanks. I
appreciate that, you sly dog, man.
So let me jump right into things here. So we have lots of transistors available
today. And what Intel's doing is putting more and more cores onto chips and
putting more and more things on there. Another thing we can do is put different
types of cores, so we can have big cores, small cores, and so on. So some really
neat things going on in architecture, trying to figure out what to do with all these
extra transistors that we have. Well, another direction that we're seeing
happening is that people are putting FPGAs onto chips. And I'll talk a little bit
more about what FPGAs are in a minute.
But what you're looking at in the upper right there are four power PC
processors -- that's the four yellow boxes in the middle -- surrounded by Field
Programmable Gate Array logic. On the far right we're looking at an ARM9
processor, which is pretty small all the way up in the upper-right corner there, and
lots and lots of FPGA. That's happening at the chip level. It's also happening at
the board level. So right here is a Cray XD1, which unfortunately isn't made
anymore, but it had a number of processors, along with FPGA. And SGI is
actually making supercomputers with FPGAs at the board level. And we actually
acquired one last year. So 64 Itanium processors and a number of FPGAs all with
equal access to memory.
AMD supports FPGAs now. Intel has announced that they're going to be
supporting FPGAs more and more. So FPGAs are starting to come into maturity,
and they're starting to be supported by more and more platforms. They've been
around about 20 years. And they're a very interesting way of doing computation.
You guys have some really cool research going on here about using them.
Let me give a quick background of what FPGAs are. So the fundamental concept
of FPGA is that you can implement a circuit, a very simple circuit using a
memory. That's the basic idea. So let's suppose you want to implement this
simple circuit here with A and B as inputs and F and G as outputs. I could use a
4-word-by-2-bit-wide memory, have A and B go in as the address lines, and just
program that memory with ones and zeros so that F and G implement this
Nangate and this inverter.
So that's the basic idea. You could have bigger circuits of course. And then I
would take -- I would have a bunch of these memories and I need to connect them
somehow and I want to be able to program those connections, because every
circuit's different. And so I can create these programmable switches, or what are
called switch matrices. So, for example, I have two inputs on the left, two outputs
on the right, and by programming the select lines of those two [inaudible], I can
have A go to X, I can have A go to Y, I can have B go to X, I can have B go to Y.
So I could A go to both X and Y.
So just by programming bits, I can implement circuits and I can implement
programmable connections. And so the idea of an FPGA is just to take lots and
lots of these little memories, which are called lookup tables, and lots and lots of
these switch matrices, literally thousands of them, and create a very regular grid
of these things. And then just by programming zeros and ones I can put any
circuit onto that chip, any custom circuit onto the chip.
And that's what you're looking at down in the corner there, that very regular
structure that you're looking at, other than the four processors, is just a whole
bunch of those lookup tables and switch matrices.
By the way, under the tradition here would be questions anytime? Is that right?
Okay. So please do.
Okay. So that's the basic idea. And normally we don't have to actually figure out
what the zeros and ones are. That's what the CAD tools do for us.
FPGAs have a very strange name, Field Programmable Gate Arrays. The reason
why is that when they came out in the '80s, the most popular custom-made chip
was called a Gate Array chip. And so these were like Gate Array chips but you
could program them in the field. There's no gate array inside of them. But it's
just in contrast to what was the standard technology back then.
Okay. So why are FPGAs a big deal? Because they can implement circuits and
because circuits do computations and can do computations much faster than
microprocessors sometimes. And here's why: For example, suppose that you
want to just reverse 32 bits, you just want to take all the bits like this and flip
them this way. Do you like this example?
>>: [inaudible]
>> Frank Vahid: Okay. So, you know, you'd write some -- there are some very
efficient C code to do it. And it would compile down to this assembly
instructions and you'd require somewhere between 38 and 128 cycles. Yes.
>>: [inaudible] faster than the four table lookups?
>> Frank Vahid: What's that?
>>: Is this code up there faster than the four table lookups, one on each byte?
>> Frank Vahid: That would be another way to do it. So that would probably get
you close to that number of cycles.
So, on the other hand, as a circuit, it's just wires. So you could do it in one clock
cycle. That cycle might be a little longer than a regular microprocessor, but still
it's just one clock cycle.
Here's another example: An FIR filter where you're doing 128 multiplications
and additions, and that would require thousands of instructions on a
microprocessor. But as a circuit, you could create this kind of interesting tree
where you have -- there are 128 multiplications right up front, and then you just
create an adder tree going down and you get your result.
And this could be done in one cycle if you wanted one big cycle, or you could
pipeline it and get it down to just seven cycles or so. And you could even get
very high throughput if it's pipelined. So you can see that you're going from
several thousands cycles down to just a few cycles.
So that's the basic idea why circuits work well on some types of computations.
Not all, but some.
There's been lots of work over the last 10, 15 years showing that compared to
different microprocessor platforms ranging from a 200-megahertz ARM all the
way up to a 3-gigahertz Athlon or Zion processor that you can get pretty good
speedups on certain types of computations. So looking at 10x speedups, 40x -we basically compiled this from a variety of papers. This comes from maybe a
dozen papers from embedded system conferences, from architecture conferences,
from supercomputing conferences.
500x speedups to do placement for CAD. Fourier transforms, simulated
[inaudible] and so on. So very being speedups we're talking about. Not 10
percent, 20 percent; we're talking 100x sometimes.
The thing to note is that even though these are circuits, these circuits are software.
It's just bits going into a prefabricated chip just like we download binaries into a
microprocessor. However, circuits in hardware are sort of synonymous, right?
When we say circuits, we think hardware, right? We say hardware-software
partitioning when we take things and put them on FPGAs.
That's something that I think we need to stop doing. Software and instructions is
not the same thing. They're not synonymous. Software just means bits. Those
bits can be either instructions or circuits. By the way, I just put up a quote there.
I was just kind of curious, so I found the first believed use of the word "software"
was in 1958. So that's the quote that came from from a paper.
So it's important to realize that circuits are software. They are just -- they can be
software. They're just bits that are going to into the lookup tables in the switch
matrices. You download them into a chip. Just the same way that you download
a sequence of instructions into a chip.
I'm trying to get the community to stop calling circuits hardware, so this was an
IEEE Computer article that came out a few months ago. I don't know if you guys
have it. Maybe you guys can help too by in your normal conversations always
use circuits instead of hardware.
Okay. On the left here is a chart of Xilinx revenues. Xilinx is one of the two big
FPGA manufacturers. You can see it's very much a growth industry. Altera is
about equal in terms of its revenues also.
So what we're seeing is that we have a pretty good steep increase here. They were
just invented in the late '80s. Multibillion dollar industry. You're starting to find
FPGAs in more and more products. We have some recent announcements,
especially by Intel, saying that they're -- that major computer makers are
supporting FPGAs just in the last two years or so that these announcements have
been coming out and that actual products are coming out.
So I think we're at a point now where FPGAs are about to take off. It's, of course,
very hard to predict, but it looks like we're at a transition here where it's widely
supported that people are going to start using FPGAs to do computation far more
than they did before.
It's hard to predict. I like to put up this example of trying to predict technology.
This was -- it's a little story about Alexander Graham Bell when he tried to go to
Western Union with his telephone patent and see if they wanted to license it. And
they said, Why would we need a telephone, we have a telegraph. The
information's getting across, what does it matter whether it's actual voice or not.
So big mistake. Western Union is now just the fastest way to wire money as
opposed to something bigger.
So I think FPGAs, we might be at that point now where we're transitioning from
the telegraph to the telephone in terms of using FPGAs. I think they're going to
really take off in terms of the computation platform. But you never know. The
future is very hard to predict, right?
Okay. So let me give you a little bit of background on the actual project that I'm
going to talk about today, starting with what I was working on back when I was
doing my Ph.D. work at UC Irvine with Dan Gajski. From '89 to '94 we worked
on a tool called spec send. And the idea was to take high-level language and
synthesize it down to a microprocessor and circuits next to the microprocessor
that would speed up the most important things. So rather than just compiling
down to microprocessor, we were trying to compile down to two things, take the
noncritical code, put it on the microprocessor, take the critical code and create a
custom coprocessor just for that program.
Back then FPGAs had just been invented, 1986. Is that about right? '86, '87 was
when Xilinx came out? And they had very little capacity. So I hadn't even heard
of them back then.
Okay. Fast-forward a little bit to around the year 2000. And right around the year
2000 is when dynamic software optimization and translation was getting really
popular. So, for example, Hewlett-Packard had their Dynamo system where as a
binary was executing they would look for the hotspots and recompile those
hotspots into a more optimized piece of code and then replace the existing binary
by this reoptimized binary. And they could get 10, 15 percent improvements.
Around this time Java just-in-time compilers were coming out. Around this time
Transmeta Crusoe was talking about their code-morphing platform, where instead
of trying to implement an x86 binary in a somewhat more native way, they would
just create whatever architecture they wanted, in that case a VLIW architecture,
and on-the-fly just translate their x86 binaries to VLIW binaries. So underneath it
was all VLIW. But you could run x86 binaries. So -- and they got lower power
and pretty good performance.
So and they were getting -- all of these techniques were getting, 10, 20, 30
percent, maybe, improvements. But remember the slide I showed you about
FPGAs and they get 100x improvements sometimes, not 10, 20 percent. So and
remember FPGAs are software, just like a VLIW is implementing software. So
we thought why don't we dynamically take this binary and via some process put it
on an FPGA or put the critical spots. And instead of getting 10, 20 percent we
may get 100x. Yes.
>>: Why would you start with a binary? People [inaudible] x86 binaries and that
was it. It was a worst possible source to compile from.
>> Frank Vahid: True. On the other hand, if you want to add an architectural
feature without having to change the entire industry, you have to work with
binaries, because that's the lingua franca of computation.
>>: There's two -- at least two products here that start with say Java classifiers
and compile those.
>> Frank Vahid: Oh, sure. That's what I was doing back in 1989 also. That's my
roots. Okay. So I come from that. But I said, look, if I can do this dynamically
from a binary, then I can put this into computers and people don't even have to
know it's there, they don't have to recompile, they don't have to do anything.
So that was the inspiration was to see if we could do a just-in-time compilation,
some sort of binary translation, and so that the FPGA would be treated sort of like
a cache. It's just invisible. It just gives you invisible performance improvements.
So we started working on this project back in 2002 to dynamically translate
binaries to FPGAs. So here's how it works. Initially you would download a
standard binary onto a platform, and it would run on a microprocessor. And the
programmer, the compiler, nobody knows that there's an FPGA here. And you
would get some time and performance -- time and energy characteristics from
there. Meanwhile, a profiler would monitor the executing program, and it would
eventually detect some critical loops. And what it would do then is read those
critical loops while the program is still executing. It would read those into some
on-chip CAD program, which could be running on the same microprocessor or it
could be running on a separate one.
What we would then do is decompile the region into a control/data flow graph,
and I'm going to talk a little bit more about decompilation. And then we would
synthesize it down to a circuit, so there's like an adder tree from the FIR I showed
you earlier. And then we would take that circuit and map it into the small
memories and the switch matrices that are on an FPGA. So we'd maybe put the
adders on those two blocks there and then we'd run this wiring through those
switch matrices.
And once that's all done, we would modify the original binary so that it would
make use of the FPGA now for those critical regions. So the original code is gone
and it's replaced by jumps to this FPGA here. And all of a sudden you get maybe
10x improvement in time and energy for some programs; for others you won't get
any. But you might get 10x, maybe even 100x improvement. And all of a sudden
we call that warping. Suddenly your program just starts running faster and using
less energy. So that's the basic idea.
So when you look at that idea, you're immediately faced with two big problems.
The first problem is what you were addressing, right? Binaries are just a horrible
starting point. Right? I mean, you've lost your loops, you've lost your rays,
you've lost your functions, you've lost all that high-level information that you
really want in order to synthesize a good circuit.
So the question is can we somehow recover those high-level constructs from
binaries. That's the first question. The second question is for people who know
what CAD tools do, how long they take to take a binary -- or, I'm sorry, to take
source code and synthesize it down to a circuit on a FPGA. You know that runs
for tens of minutes, sometimes hours, right? So frustrating. So that's a long time
for this thing to be trying to do some sort of optimization on the fly. So is there
some way to speed that up.
So those were the two big problems and those were the two things that we worked
on from 2002 up until about 2006 or so. So one was decompilation. This is what
happens if you don't decompile. If you just take the binary and feed it through a
synthesis tool, okay, I mean, create it into a control/data flow graph and then feed
it through a synthesis tool, you want to be below this line on performance energy.
And you can see that for these examples. Everything's worse, okay. The circuit
you get is much slower than a microprocessor. So we need to recover some
high-level information before we try to synthesize a circuit.
So here's an example of four loop that does some accumulation. And this is what
it would get compiled to at the assembly level. So our first step is to get that
control/data flow graph. So it -- the control/data flow graph looks a lot like the
assembly. That's why synthesizing from it doesn't give us much improvement.
But what we can then do is start applying decompilation techniques. And what
we did was we actually built on about 20 years of previous work on
decompilation, which was more intended for binary-to-binary translation. So we
looked through the literature and found a bunch of techniques that we said, okay,
this technique would work well for us, this one would also be very useful for
synthesis. These two are not so important.
So we went through and we found about seven or eight decompilation techniques.
We had to build some of our own techniques, like we re-roll loops, for example.
If you have an unrolled loop, we figure out what it was and re-roll it, okay. And
we go through and apply these techniques. And what starts happening is little by
little the code that would come out of that control/data flow graph starts to look
more and more like the original. You do data flow analysis, so you get rid of
some of the intermediary registers from the assembly. And then you do some
function recovery, so you can actually see where functions are in the code and
you can isolate those. Start to recover control structures, so we can detect loops,
detect IF statements, even switches sometimes. Yes.
>>: [inaudible] assembly some kind of risk code. Can you do it for the 86
through this incredible 30 years of baggage?
>> Frank Vahid: So we've been doing it primarily for clean instruction sets.
We're doing it for ARM, MIPS, we even did it for the MicroBlaze, which is a
processor designed for FPGAs.
Most of the decompilation work has been done on x86 stuff. So the techniques
have been developed on x86. Most of our results, because we came from an
embedded world, were for ARM and MIPS. But actually my student, who's now
professor at University of Florida, he's as we speak doing it for -- he's porting his
tools to x86. So there's been plenty of decompilation work for x86.
What ends up happening, you can recover arrays. What ends up happening is you
can recover a large amount of the high-level information from binaries. So we
started doing some experiments to see how much could we recover. We looked at
in this case a bunch of embedded system benchmarks. We did synthesis from C
code. And we basically took the techniques that are used in various high-level
synthesis tools that will take C code and generate circuits. And we basically did
them manually to make sure that we weren't suffering -- that we weren't using a
tool that wasn't ideal. So this is sort of the best-case scenario for synthesis from
C.
And then we ran our -- the binaries of those same programs. We took the
binaries, ran them through our decompilation tools and then synthesized. And
what you end up seeing is that there's no time overhead, at least for these
examples, some area overhead, basically one example that was really bad there.
And just because synthesis and compilation there's a bunch of arbitrary factors,
for one example from a binary was actually better. But that doesn't really mean
anything. That's just noise, really.
Okay. So we did this for actually dozens of examples. But we really wanted to
make sure -- and I'll tell you why. We got a lot of, to put it mildly, hostile
reaction to the idea of synthesizing from binaries. In fact, my student back in
2003, he submitted a paper and he got a rejection. And it was the most violent
rejection that we've ever experienced. And then so one guy was just blasting us
for being just morons, you know. And then another guy, his review was just one
sentence. It said: Synthesis from binaries is just wrong. That was the entire
review, right? Like we were committing some sort of moral transgression here.
>>: [inaudible]
[laughter]
>> Frank Vahid: So in addition to those studies that we did, we went to one of
our partner companies, Freescale. And we said, look, we really want to do this on
a serious benchmark that -- a real benchmark, not just the stuff that you get online
but a real piece of software, and we want to see if we can speed it up at the binary
level. So after several months of negotiations with them and lawyers involved,
we got access to their actual H.264 video decoder, the one that they actually put
on cell phones.
So it was a several-million-dollar piece of software that they spent several man
years developing. Highly optimized, 10 times faster than the reference H.264
code that you can get. So we took that code as our starting point. We did our
analysis, we found the critical regions. And we started to say, okay, let's see how
we can speed these up. So this is the ideal curve. This is if we could take these
critical regions of code and just eliminate them, okay, execute them in zero time.
That would be the ideal speedup that you could get with an FPGA. So you can
see if you did about 40 of those critical regions, you'd be getting about 10x
speedup. So that's our upper bound. We can't do better than that.
If you do it from the C level, this is what you get. So this takes into account the
time that it takes to transfer data over to the FPGA for the FPGA's computation to
send the data back to the microprocessor. This is the speedup that an FPGA could
give you over a microprocessor. I think is over a 200-megahertz ARM using a
similar FPGA, so similar technology.
So now the question is how do we do it from the binary. So we went through and
this was about a four-month study that we went through and did this, and that's
how we did with the binary. So barely off. We're maybe 3, 4 percent off. So
even with this highly optimized code, we were still able to be fairly competitive.
But I still wasn't satisfied, right? Because we really got very hostile reaction. So
we looked even further. We said let's try different architecture. So we tried the
MIPS, the ARM, and the MicroBlaze. And we said let's try different optimization
levels, because a common question we got was, well, if you don't do much
optimization, decompilation works fine, but what if you do a lot of optimization
in your compilation? Maybe it confuses the assembly so much that you can't
recover loops, you can't recover arrays and so on.
So I said, okay, let's try a different optimization level. For the MIPS, this is the
speedup we get using an FPGA if the code was originally compiled with a
dash-01 optimization level. And this is the speedup we get if it was compiled
with dash-03, with higher optimization. So about the same. So we didn't lose
anything there. With the ARM, that's the speedup we could get with dash-01.
That's the speedup we get with dash-03, actually better. Much to our surprise.
We weren't expecting that. And, likewise, for the MicroBlaze it actually gets
better with the higher optimization.
So we looked into this a little bit and we found that it's doing more constant
propagation, it's reducing memory accesses, and so it's doing things that are good
for the FPGA too. And we're not losing our ability to recover the high-level
information. So that was a very surprising finding.
Okay. Any questions on the decompilation work, the binary synthesis work?
Yeah.
>>: [inaudible] optimization you might actually get better results because
[inaudible] when you had the optimization up higher, does the resulting
decompile code still resemble the original C code or it looks slightly different?
>> Frank Vahid: It's very similar to the original C code. So in almost all cases -so Greg actually did a comparison. He looked at the various features and tried to
figure out what percentage of features we were able to recover. And the numbers
he got were like 95, 97 percent. So regardless of the optimization levels, we're
able to get -- we're able to recover the loops, the arrays, the functions, the IF
statements and so on.
>>: Another thing, what if whoever wrote the original C code just did something
that's completely erroneous, that doesn't make any sense, but the [inaudible] was
able to catch it and compile it down so they look better, and then you decompile
it, does it still look that way or ->> Frank Vahid: No, no, no, no. No. Whatever the compiler does, we can't -- we
can't reverse engineer -- that type of stuff, right? But for straightforward code -and like embedded code tends to be fairly straightforward, there's loops walking
through arrays and things, and we can get a lot of it back. Yeah?
>>: I assume this is coming up, but I'm curious of course about relative to the
initial time to execute the kernel how long does it take you to do this
decompilation and this FPGA synthesis.
>> Frank Vahid: It's coming up in a bit, but I'll just give you the quick answer. It
takes forever. So it's still an eternity, but it is practical in some cases. And that's
if you're using like regular CAD tools. What I'm going to show you in the next
two slides is how we can at least ameliorate some of that problem.
Okay. So the second big problem with warp processing was this, exactly this,
was how long does it take to do all this decompilation, the synthesis, the
placement and routing onto the FPGAs. And so I'm just going to show you the
results of that rather than diving into the details.
For a set of benchmarks that we looked at using the most popular FPGA, Xilinx,
and using their synthesis tools, we analyzed how long it takes to actually run the
entire sequence of taking a high-level piece of code and mapping it down to an
FPGA implementation. And these are the big contributors. Decompilation is
actually just tiny. It's negligible. So we don't even show decompilation here. But
once you've got that high-level representation running through register transfer
level synthesis, which is also trivial, so we don't show it, logic synthesis,
technology mapping, placement and routing, which is trying to figure out how to
connect all those wires of the circuit on the FPGA, that's how much time it takes
and it takes about 60 megabytes of memory.
So a different student, Roman Lysecky, who's now a professor at University of
Arizona and still working on this, his job was to figure out how to shrink this. So
he went and looked at each one of these and came up with techniques that were
not solely focused on getting the fastest circuit, which is almost all work in CAD
is how to optimize that circuit. And instead he said, well, how can I get a
reasonable circuit in a shorter period of time, and our goal was 10x improvement.
We wanted to get 10x faster flow here, so we wanted to get down to .9 seconds,
basically.
And just to make a long story short, this is what he came up with. So he was able
to take each of those and shrink them down like this so that overall we got .2
seconds of execution, a 30 percent slower circuit. So the tradeoff is that you have
a slower circuit, about 30 percent slower, but you have almost 20x now
improvement, 15x or so.
Just for fun, he took that -- his tool set, which we call the Riverside FPGA tools,
and he showed that you can actually run these on a 75-megahertz ARM. The idea
of running CAD on an ARM is -- if anybody knows CAD, you really would never
do that, right? And to run it on a 75-megahertz ARM7, I mean, that's really a
wimp. That's a wimpy, wimpy processor. He showed that his tools run on that
and they would only take 1.4 seconds and still of course use only 3.6 megabytes.
So that's kind of neat. If you can do CAD very efficiently, then warp processing
becomes a little bit more feasible.
These are some results we got using our entire tool flow. Four, number of
embedded system benchmarks compared to a 200-megahertz ARM and using
roughly equivalent FPGA. If 200 megahertz looks slow to you, we use a
600-megahertz ARM, that's fine, but then I'm going to use a faster FPGA. Okay.
What we were talking about in the hallway this morning, right?
So we got -- for these benchmarks we got between 50, 200, sometimes 5, 10x
speedup. On average we got about 40x speedups using the FPGA as a
coprocessor. Yeah.
>>: [inaudible] geometric mean and the harmonic mean.
>> Frank Vahid: Greg usually does geometric mean. This was Roman's work. I
think he just took arithmetic mean. I don't know, but we can sort of visually see
it, right? We're not throwing anything way off, are we, with the 191 and the 130
there? So maybe 30x, just looking at -- just kind of eyeball it, and geometric
mean would be about 30.
Now, this is just for the kernels. And actually when you look at papers that are
doing speedups using FPGAs, they usually only show the kernels. That's a little
misleading, because you really want to -- you really care about the application.
So we also did this for the application. So looking at the overall application
speedup, about 7.4, which is still pretty good. Still getting close to an order of
magnitude speedup.
I have to point out that these examples are embedded benchmarks, they do get
sped up. There are lots of examples that don't. For example, we tried to do this
with spec, which is the normal desktop application suite. And we basically
couldn't get any speedup at all. So it just didn't work on those. There just weren't
loops that were amenable to being sped up by circuits.
Have you had any luck speeding up spec?
>>: [inaudible]
>> Frank Vahid: Yeah? The V-zip [phonetic]? Yeah. Okay.
>>: My problem is that they're all floating point.
>> Frank Vahid: Yeah, I know.
>>: [inaudible]
>> Frank Vahid: Yeah. That includes a lot of them. Although we do have -now with the modern FPGAs the floating point can be done reasonably
efficiently, but the resources are somewhat limited.
Okay. So moving on to some more recent work, some more recent directions of
thread warping -- of warping, we looked at multicore platforms, which are much
more likely to be implementing multithreaded applications. So let's try to take
this warping concept and apply it to multithreaded applications.
So we looked at an approach where the binary would be loaded under a
microprocessor, operating system would start scheduling threads onto available
cores. And any threads that weren't executing would be waiting in a queue. And
so we looked at warp processing from the perspective of, well, let's look at that
queue and take those threads, run them through our on-chip CAD and actually put
those threads onto the FPGA a custom circuits. So rather than being limited by
the number of cores that we have on our platform, we can synthesize as many
threads as we want. We can synthesize 48 threads maybe, if you had 48 threads
waiting. And they would each be executing then in parallel, and then the
operating system would schedule the threads onto those circuits.
And potentially you could get very large speed. That's because you're looking at a
level of parallelism here that is even greater than at the circuit level. You're
basically creating on-the-fly multicore systems, creating as many cores as you
want, but each core is specialized to execute just that thread.
It's a fairly complex framework. I'm not going to talk about all of it, except I do
want to sort of zoom into the one thing that was sort of the Achilles heel of this
thread warping and the solutions that we came up with.
If you know -- if you've been working with FPGAs and trying to speedup, you
know that the memory bottleneck is often the Achilles heel, right? You can put
48 threads on here, sure, but they're all just sitting there waiting for some bus to
return data to it so it can actually compute, right? And so you don't actually get
any speedup at all. You probably had this experience, right? So you can have as
much concurrency here as you want, it's all moot if you don't have enough
bandwidth from the memory to feed the data to these things that they need to do
their computation.
A lot of publications, when they talk about speedups and FPGAs, they ignore that
point actually. They just assume the data is in the FPGA already. So that's a little
bit cheating, so same for multicore -- yeah, that's right.
Okay. So we looked at a number of multithreaded benchmarks -- there's a bunch
out there -- and we noticed a couple of patterns that were common. Here's one
pattern, which is that you've got a main program generating your threads, so here's
one that's creating 10 threads. They're all calling -- they're based on function F.
They're all accessing array I but they're using some different constant here to
multiply it. So there's -- the basic pattern is that you've got a whole bunch of
identical threads all accessing the same array.
So the solution there is don't have each thread access the array itself, instead have
one thing access the array and feed that then to all of the threads simultaneously.
So we wrote some algorithms that would identify the groups of threads, identify
those constant arrays, those constant memory accesses, and then would actually
synthesize a separate custom DMA that will handle the access of that array.
So these threads, when they're activated, this DMA will actually on their behalf
fetch the data they need, push it into all of them, and then those guys all have their
data. And you didn't have to have all 10 of those functions trying to access the
RAM independently. So what we end up getting is the data being fetched and
then pushed into the threads, rather than the threads fetching. So before we did
that memory access synchronization, it would have been about 1,000 memory
accesses through this example, and after it was reduced to only 100. So it helps
alleviate the bottleneck.
The other pattern that we saw very frequently was this idea of windows. So you
may have a number of threads and they're operating on a slightly different region
of an array. So here they're operating on four array elements, and each thread is
operating on a window shifted by one, so one thread's operating on those
elements, another thread's operating on those elements, another thread's operating
on those.
So we can still use this memory access synchronization technique to reduce the
amount of data being fetched from RAM. Again, you have your smart DMA here
and you have a buffer here that you're using. And so when the threads get
activated, so you synchronize the access of -- the activation of the threads, and
then you read in a big chunk of data, put it into the buffer, and then push into each
thread the actual data that they need.
This is kind of a neat thing that you can do with FPGAs that's a little harder to do
with multicores. You can really synchronize the threads in this way. Yeah.
>>: [inaudible] a custom-made cache memory?
>> Frank Vahid: Yeah. That's what you're doing.
>>: And all your reading and writing [inaudible].
>> Frank Vahid: Well, you could extend this concept to the other side so that the
threads could be writing out to another type of buffer in a very intelligent way.
It's the neat thing that you can do here, is you can do this sort of analysis and
you're basically creating a custom multicore system with a custom memory
system. So all this customization enables you to get rid of a lot of the overhead
that occurs in more general-purpose systems. And if I understand, that was part
of your project this summer was to find that overhead and show how FPGAs are
much more efficient because they can do things more directly rather than having
to have this very general way of doing things, which results in any specific thing
having a lot of overhead.
Okay. So in this example we've got 400 memory accesses reduced down to 100
or so.
Okay. So compared to a 4-core device, in this case 4 ARMs, and I think now
we're up to 400 megahertz -- as our experiments went on the frequency was going
up. So compared to a 4-ARM device, we took a bunch of examples that were
embarrassingly parallel. I mean, these are standard benchmarks, but these are -they're highly paralyzable benchmarks, just to be honest about it.
Compared to a 4-ARM, what we got was using thread warping we can get
speedups of 130x. And here we are just -- when Greg shows the results, he does
the arithmetic and the geometric. So there's your geometric.
By the way, you can see why it's important to show the geometric, because look at
the arithmetic is 130 and the geometric is 38. But that's because of that big outlie
there, the 502 and the 308. We didn't have those huge outliers in the other data.
Okay. But then there's a very reasonable question. Sure, you're using FPGA, but
remember the pictures I showed you of FPGA, they're huge, right? So you have
the little ARM processor up in the corner and then the rest of the chip was FPGA,
right? So, well, gosh, what if I just put more and more cores on the chip, maybe I
would have gotten this sort of speedup also, so why do I really have to use
FPGAs.
So what we did was we also looked at multicore systems ranging up to 64
ARM11s. And so we actually wrote a very optimistic simulator, one that
doesn't -- assumes that memory is always accessible, there's no contention for the
bus, there's -- there's no cache coherency issues, nothing. So very optimistic
multicore simulator. And then we compared that to a more pessimistic thread
warping approach where we actually are paying attention to the memory
synchronization. So the speedups I'm going to show you now would actually be
greater. They would be better.
But even with that approach, that fairly pessimistic FPGA approach, we see that
an 8-core system doesn't get us that much. 16-core, you know, it's better than a
4-core, but it's not approaching what thread warping can do.
There's the 32-core system, which is using about the same amount of area as the
FPGA that we were using. So you can see it is giving you some speedup, but
nothing compared to what we're able to do with the FPGA. And that's the same
amount of area now. And we even went further. We said what if we even have
64-core, and, again, it's getting better, but your thread -- you can see that the
FPGA really buys you a lot.
Any questions on thread warping?
Okay.
>>: There you assume that there is no [inaudible] memory bus, more or less
everybody hits on the cache, that [inaudible] parallels [inaudible] no
synchronization [inaudible] on the multicore [inaudible] as good as it gets. Is that
right?
>> Frank Vahid: Well ->>: Was there anything you could do [inaudible] multicore to make it better?
>> Frank Vahid: The assumption on the applications, we didn't make any
assumptions, okay, we just chose applications that had lots of parallelism, okay,
because we wanted to show the potential of thread warping.
In terms of the assumption that I was stating, the assumption was working against
us. I was giving a multicore system the absolute benefit of the doubt, assuming
that data was always available when it needed it; whereas when looking at the
FPGA that we were doing, which is we're trying to show the potential of that, that
was accurate. So when I said that I made assumptions in terms of the access of
data, that was working against thread warping.
>>: And the cores [inaudible] single-thread simplistic things, not [inaudible]?
>> Frank Vahid: ARM11s. Yeah. Which are good in Meta processors, but
they're not -- nothing compared to desktop, of course. Yeah.
>>: So were you using a precomputed offline profile to find these hotspots that
you're accelerating the FPGAs or are you finding those on the fly as well?
>> Frank Vahid: So let me show you. So all the data that I showed you doesn't
take into account that time to do the actual CAD, right, which is what you're
getting at, right? Okay. So how can we then apply this technique for real, right?
When can we actually do this sort of warping. So we have this SGI Altix
machine that has the 64 Itaniums and the FPGAs. Some jobs I run on there run
for dozens of hours, if not multiple days. You're doing storm simulations and
N-body simulations, you know, physics types calculations and so on.
So those are very long-running applications. Might be dozens of hours. Whereas
the CAD might be a couple hours, just using normal tools, not our faster, 10x
faster tools. Just using normal Xilinx tools, CAD might be a few hours. And
once the CAD is done, then you switch over. Instead of running it in its normal
software mode, at this point it switches over to using FPGA. And so then it runs
to completion in much less time. So you can get speedup that way. So for these
long-running applications, warp processing is immediately applicable.
The other scenario where it becomes applicable is in recurring applications, which
is common and embedded as well as desktop systems. And so here when the
application first runs on the microprocessor, we might fire off our on-chip CAD
tools, but the on-chip CAD tools might run 10 times longer than the actual
application. So the application may end and our CAD tools still haven't finished
yet, okay. But they can keep plugging away, just keep churning away that; that
application may come and go several times and we may not be able to do
anything for it. But then at some point in the future when CAD is done and we
load up that application and we say, oh, we've got an accelerator for it, let's use it
now, and so then from that point on whenever that application runs, you can get
that sort of speedup. So does that sort of address your questions?
>>: Yeah, that's a very big part of it, just that I was also curious about before you
run the CAD and so on, you also -- presumably, you know, you're only building
specialized circuits for particular subkernels. And my question is, if you're using
profiling, et cetera, to find those, is that also included in this, what I'm seeing on
the screen, or is that also run magically offline in some space that we're not seeing
the [inaudible].
>> Frank Vahid: So the profiler was part of the whole warp -- warping process.
If you remember the initial slide I showed, the things were going by on the bus,
we have a profiler and the CAD tools. Those are both dynamic. So you don't
have to know beforehand what the kernels are. Is that what ->>: And so it is finding it during this initial microprocessor ->> Frank Vahid: Yes. The profiling is part of the -- yeah. Yeah.
>>: When you run the application a second time, how do you make sure you're
running exactly the same application, not one revision up?
>> Frank Vahid: Good question. Depends on the platform we're dealing with.
The complexity of solving that problem depends on the platform we're dealing
with.
>>: One thing that [inaudible] did back when they were introducing the alpha
was to compile overnight large binaries [inaudible] the next day [inaudible]
deployed systems, you know, changing the binaries is actually a serious
[inaudible].
>> Frank Vahid: That's a good point. I should use that as an example, as this
concept of optimization offline -- or runtime optimization to having been used
before.
>>: The other thing [inaudible] one thing we can do with an algorithm
[inaudible] processor since you have as much time as you wish anyways, and that
application would run for months [inaudible].
>> Frank Vahid: Okay. The most common question that I usually get is why
don't you just do this statically, why are you going through all this trouble to try to
do it dynamically.
And this is the answer. First of all, the static approach definitely has a big role to
play, the offline figure out what part should go where, create a binary that consists
of the microprocessor binary and the FPGA binary and distribute that. That has a
big role. But it has some limitations. It requires some specialized language in
many cases and definitely requires a specialized compiler. And people don't like
using specialized tools. The software industry is very big, FPGAs are still a
relatively small part. And we can't ask -- it's sort of like the tail wagging the dog.
We can't ask the whole software industry to start supporting FPGAs when we
know that it's going to be a smaller piece of -- component of people that are going
to use it.
So if we can do it dynamically, you can use any language, any compiler, you can
have object files that you're getting from third-party vendors and putting that all
together into a binary. You know, you don't have to have any knowledge of the
FPGA existing. It doesn't affect your tool flow at all. So that makes it available
to everybody, just like caches are available to everybody.
If you have warped processing, it enables this concept called expandable logic,
which I named it that to compare it to expandable RAM, expandable memory. So
think about expandable memory. You have your binaries, they're all downloaded
on your processor, you've installed them, you've been running them, and you
decide you need more power. Things are running too slow. So what do you do?
You add more RAM. And invisibly the OS detects it and uses it and you get
better performance.
So, likewise, we'd like to be able to support expandable logic, logic representing
the FPGA. So you might have no FPGA, you might have one FPGA originally
and you decide I'm not getting enough performance, and so what you'd like to be
able to do is just add in more FPGA and you get better performance and you just
don't have to reinstall anything, recompile anything.
This just shows how expandable logic can be added to improve things. Here's an
N-body simulation, which is a physics type simulation. Adding some logic helps
in that at some point it just tapers off. On the other hand, here's an image
processing application where the more FPGA you added, the more performance
benefit you got. So up to 250x, and it may have even gone more if we could have
added more FPGA. So there is -- you know, for most examples there's someplace
where it tapers off, but for many you can keep adding and it keeps growing.
This is our most recent results, by the way. This is just a few months ago out of
University of Florida, and this is now comparing to a 3.2 gigahertz Intel Zion.
And you can see we're still getting speedups of up to 8x compared to this very
high-end -- fairly high-end processor. And this is taking into account all the
communication over the PCI bus and so on. So I thought that was kind of neat
that he was able to get that running. That's Greg's work.
>>: [inaudible] multiple boards? Sounds like you plug multiple RAM chips
[inaudible].
>> Frank Vahid: The expandable logic concept? Yeah.
>>: Okay.
>> Frank Vahid: Yeah. This is some very recent work that actually Scott Sirowy
had done. So it depends a lot -- the question is can you take these circuits that
people have been putting on FPGAs and can you model them as C? So I've been
looking at it the opposite for most of this talk. I'm just looking at C code, and
people had no idea FPGAs existed, and seeing if we can speed them up. But
there's a whole community out there that's designing circuits for FPGAs. Can we
model their circuits and see and have them execute on microprocessors and then
speed them up with FPGAs?
And so what Scott did is he went through -- there was about 70 examples from a
conference called FCCM, Field-Programmable Custom Computing Machines.
And we found about 70 papers that talked about special circuits that were
designed just for FPGAs to speed up computation. And he looked at every other
one, basically half of them, and tried to see if he could write C code that he could
then synthesize back to the circuit. And it turns out 82 percent of the cases he
could.
>>: I'm just curious [inaudible] deceleration?
>> Frank Vahid: Is that one of the examples?
>>: [inaudible]
>> Frank Vahid: Scott, remember do you remember what that is?
>> Scott Sirowy: I don't remember what that one was offhand, no.
>>: Strange title.
>> Scott Sirowy: Yeah.
>> Frank Vahid: Sounds easy, right? I can slow down software. Yeah. There
must be something to it. I don't know what it is.
Anyway, we found out that 82 percent of those circuits that people had designed
for FPGAs as circuits, that we could model and see and synthesize back. So that's
very promising. That tells us that FPGAs used for computation can be used in an
environment where there's microprocessors and we just move things over to the
FPGA as needed, to the extent that the FPGAs are available.
Okay. And these were some of the results that Scott got where he showed that the
custom-designed circuit that the people had published in the papers had some
speedup. We normalized all those to 1. And then the speedup that we got by
writing C code and synthesizing it down to a circuit, you can see we get similar
speedups along the way. And in some cases we -- oh, these are -- going up is bad
here, right, in this case. Yeah. So in some cases we were slower, but in many
cases we were the same.
Okay. Have you guys heard the parable about the three blind men who walk up to
an elephant and one grabs the tail, one's holding the -- touching the side and one's
grabbing the trunk and saying -- and then they have an argument: One says an
elephant's like a rope; the other one says, no, an elephant's like a wall; and the
other one says, no, an elephant's like a tube, you guys don't know what you're
talking about.
Well, I like that parable in terms of FPGAs here. So the whole world has been
looking at software for many years now as microprocessor instructions. And I
think those of us in the FPGA community are starting to say, look, there's another
aspect of software, FPGA circuits. It's not just instructions anymore, this very
spatially oriented, highly parallel way of doing computation is just as much
software as instructions are. And we need to start thinking of computation in
terms of these spatial ways of doing things because we can get huge speedups in
execution if we do. I used to have this on the other side at the tail, and I got
comments about how that looked kind of bad, so I put it at the front there.
So warp processing brings -- potentially brings the FPGA speedups to all
computing because we make it invisible. So we got a patent back in 2007. It's
been licensed to these companies via SRC. Microsoft isn't part of SRC yet. So
you guys are starting to do more and more architecture research, FPGA research;
hopefully you guys will come into SRC soon. And we're doing extensive work on
this right now. Scott's working on it. We have a couple more students working
doing online CAD algorithms, architectures, and how to model things at the high
level in order to get them to work well on FPGAs. So any other questions? Yes.
>>: If you go back to the 82 percent slide, a lot of [inaudible] says requires
extensive modification to architecture. Let's see. Heavy modification of original
algorithm. It's not very common. It's a few of them.
>> Frank Vahid: Uh-huh. Heavy mod, okay.
>>: [inaudible] did you [inaudible] the FPGA or took the [inaudible] ->> Frank Vahid: Well, when we say the original algorithm, sometimes what
these papers did was they took some common algorithm, let's say quicksort, and
they said how can we implement this as a circuit. And they went down and
implemented some crazy circuit that didn't look anything like quicksort anymore.
It was maybe mergesort of thing like that. And then we said, okay, can we take
that circuit and reverse engineer it back to something and see that will synthesize
back to that same circuit. So when we go back up to C, it doesn't look anything
like this very intuitive algorithm. It's some very spatially oriented thing. So
instead of just having a four loop that's walking through an array, maybe we have
12 functions and each one's connected by global variables or something like that.
So that's what we mean by heavy modification, right? But it will still execute on
a microprocessor correctly and it will still synthesize down to circuits.
>>: [inaudible] you were making the case for [inaudible] as opposed to
[inaudible].
>> Frank Vahid: We're saying C is surprisingly effective at modeling these
circuits, yeah. As long as you have a good high-level synthesis tool, you can
regenerate these circuits quite easily. So you don't really have to do register
transfer level modeling for a lot of these things.
Now, that's true for this domain because these are mostly computations; they're
not timing specific. And so, you know, it doesn't really matter that we lost the
timing, some of the clocking information. If there is some clocking information,
it's mostly in terms of coarse-grain pipelining, which we can capture at the C level
by different functions, and a good high-level synthesis tool will result in a good
pipeline for that.
Any other questions?
>>: It's another [inaudible] that you heard from [inaudible] ->> Frank Vahid: Um-hmm.
>>: -- who was looking at decompilation from the point of view of [inaudible], so
take something, compile it down and then decompile it and see if the [inaudible].
>> Frank Vahid: Um-hmm.
>>: And I wondered is there something that you could think of adding into the
[inaudible] here, so not just beating out but adding other functionality into -- now
that we're paying the price for doing all this work, is there something else that we
can get out of it?
>> Frank Vahid: Yeah, that's a good idea. So you could -- what are some other
avenues you can get out? You could get some sort of verification ability, right, so
some sort of checking perhaps that you could regenerate a high-level model that
will execute reasonably fast and then you could check to see if the results that
your circuit is creating are consistent.
>>: [inaudible]
>> Frank Vahid: Hmm?
>>: Portability.
>> Frank Vahid: Portability, yeah. Yep. Reverse engineering to a high-level
spec that then people could manipulate if they needed to. Yeah. It's an interesting
point. Several avenues that could come from that, yeah. Yeah.
>>: So how do you use it today? Let's say I want to speed up my program today
and want to just use FPGAs, even for some custom scientific [inaudible] filtering,
anything where you're trying to do really intensive computation on very simple
operations but you want to even go very parallel. Do you plug in a board? I
mean, what do you use to do that?
>> Frank Vahid: Well, what we did was we showed the feasibility that this idea
of warping is possible. And so now I think it's up to people who have platforms
to consider. Right now everybody who has a platform pretty much is focusing on,
okay, how do we program this thing statically. And I think it's up to them now to
start saying, okay, well, it's feasible to do it dynamically. For our particular
platform, what does that mean, right? Do we need to put a CAD tool on a third
core here and start dealing with this or what. So, yeah, we'll see where it goes.
Okay?
>> Alessandro Forin: Questions?
>> Frank Vahid: All right. Thanks for your time.
[applause]
Download