>> Alessandro Forin: Good morning. Today it's my pleasure to introduce Frank Vahid from UC Riverside. Frank has been involved in FPGAs, and that's what he's going to talk to us about in architecture. But he's also teaching embedded systems and has some work in eBlocks. So if anybody wants to talk to him, there's still at least one slot open, either about this talk or other areas related with it. Frank. >> Frank Vahid: Okay. Thank you, Alessandro. And thanks for coming. I wanted to thank Alessandro, first of all, for hosting Scott Sirowy who's doing an internship here and for arranging for this visit. And I just received this last night, and I wanted to thank him for this too. And I actually did receive this last night, came from Microsoft Corporation. It looks like I won your long-awaited sweepstakes results from your Foundation for Software Promotion. So $250,000. And I assume you had something to do with that, Alessandro, so thanks. I appreciate that, you sly dog, man. So let me jump right into things here. So we have lots of transistors available today. And what Intel's doing is putting more and more cores onto chips and putting more and more things on there. Another thing we can do is put different types of cores, so we can have big cores, small cores, and so on. So some really neat things going on in architecture, trying to figure out what to do with all these extra transistors that we have. Well, another direction that we're seeing happening is that people are putting FPGAs onto chips. And I'll talk a little bit more about what FPGAs are in a minute. But what you're looking at in the upper right there are four power PC processors -- that's the four yellow boxes in the middle -- surrounded by Field Programmable Gate Array logic. On the far right we're looking at an ARM9 processor, which is pretty small all the way up in the upper-right corner there, and lots and lots of FPGA. That's happening at the chip level. It's also happening at the board level. So right here is a Cray XD1, which unfortunately isn't made anymore, but it had a number of processors, along with FPGA. And SGI is actually making supercomputers with FPGAs at the board level. And we actually acquired one last year. So 64 Itanium processors and a number of FPGAs all with equal access to memory. AMD supports FPGAs now. Intel has announced that they're going to be supporting FPGAs more and more. So FPGAs are starting to come into maturity, and they're starting to be supported by more and more platforms. They've been around about 20 years. And they're a very interesting way of doing computation. You guys have some really cool research going on here about using them. Let me give a quick background of what FPGAs are. So the fundamental concept of FPGA is that you can implement a circuit, a very simple circuit using a memory. That's the basic idea. So let's suppose you want to implement this simple circuit here with A and B as inputs and F and G as outputs. I could use a 4-word-by-2-bit-wide memory, have A and B go in as the address lines, and just program that memory with ones and zeros so that F and G implement this Nangate and this inverter. So that's the basic idea. You could have bigger circuits of course. And then I would take -- I would have a bunch of these memories and I need to connect them somehow and I want to be able to program those connections, because every circuit's different. And so I can create these programmable switches, or what are called switch matrices. So, for example, I have two inputs on the left, two outputs on the right, and by programming the select lines of those two [inaudible], I can have A go to X, I can have A go to Y, I can have B go to X, I can have B go to Y. So I could A go to both X and Y. So just by programming bits, I can implement circuits and I can implement programmable connections. And so the idea of an FPGA is just to take lots and lots of these little memories, which are called lookup tables, and lots and lots of these switch matrices, literally thousands of them, and create a very regular grid of these things. And then just by programming zeros and ones I can put any circuit onto that chip, any custom circuit onto the chip. And that's what you're looking at down in the corner there, that very regular structure that you're looking at, other than the four processors, is just a whole bunch of those lookup tables and switch matrices. By the way, under the tradition here would be questions anytime? Is that right? Okay. So please do. Okay. So that's the basic idea. And normally we don't have to actually figure out what the zeros and ones are. That's what the CAD tools do for us. FPGAs have a very strange name, Field Programmable Gate Arrays. The reason why is that when they came out in the '80s, the most popular custom-made chip was called a Gate Array chip. And so these were like Gate Array chips but you could program them in the field. There's no gate array inside of them. But it's just in contrast to what was the standard technology back then. Okay. So why are FPGAs a big deal? Because they can implement circuits and because circuits do computations and can do computations much faster than microprocessors sometimes. And here's why: For example, suppose that you want to just reverse 32 bits, you just want to take all the bits like this and flip them this way. Do you like this example? >>: [inaudible] >> Frank Vahid: Okay. So, you know, you'd write some -- there are some very efficient C code to do it. And it would compile down to this assembly instructions and you'd require somewhere between 38 and 128 cycles. Yes. >>: [inaudible] faster than the four table lookups? >> Frank Vahid: What's that? >>: Is this code up there faster than the four table lookups, one on each byte? >> Frank Vahid: That would be another way to do it. So that would probably get you close to that number of cycles. So, on the other hand, as a circuit, it's just wires. So you could do it in one clock cycle. That cycle might be a little longer than a regular microprocessor, but still it's just one clock cycle. Here's another example: An FIR filter where you're doing 128 multiplications and additions, and that would require thousands of instructions on a microprocessor. But as a circuit, you could create this kind of interesting tree where you have -- there are 128 multiplications right up front, and then you just create an adder tree going down and you get your result. And this could be done in one cycle if you wanted one big cycle, or you could pipeline it and get it down to just seven cycles or so. And you could even get very high throughput if it's pipelined. So you can see that you're going from several thousands cycles down to just a few cycles. So that's the basic idea why circuits work well on some types of computations. Not all, but some. There's been lots of work over the last 10, 15 years showing that compared to different microprocessor platforms ranging from a 200-megahertz ARM all the way up to a 3-gigahertz Athlon or Zion processor that you can get pretty good speedups on certain types of computations. So looking at 10x speedups, 40x -we basically compiled this from a variety of papers. This comes from maybe a dozen papers from embedded system conferences, from architecture conferences, from supercomputing conferences. 500x speedups to do placement for CAD. Fourier transforms, simulated [inaudible] and so on. So very being speedups we're talking about. Not 10 percent, 20 percent; we're talking 100x sometimes. The thing to note is that even though these are circuits, these circuits are software. It's just bits going into a prefabricated chip just like we download binaries into a microprocessor. However, circuits in hardware are sort of synonymous, right? When we say circuits, we think hardware, right? We say hardware-software partitioning when we take things and put them on FPGAs. That's something that I think we need to stop doing. Software and instructions is not the same thing. They're not synonymous. Software just means bits. Those bits can be either instructions or circuits. By the way, I just put up a quote there. I was just kind of curious, so I found the first believed use of the word "software" was in 1958. So that's the quote that came from from a paper. So it's important to realize that circuits are software. They are just -- they can be software. They're just bits that are going to into the lookup tables in the switch matrices. You download them into a chip. Just the same way that you download a sequence of instructions into a chip. I'm trying to get the community to stop calling circuits hardware, so this was an IEEE Computer article that came out a few months ago. I don't know if you guys have it. Maybe you guys can help too by in your normal conversations always use circuits instead of hardware. Okay. On the left here is a chart of Xilinx revenues. Xilinx is one of the two big FPGA manufacturers. You can see it's very much a growth industry. Altera is about equal in terms of its revenues also. So what we're seeing is that we have a pretty good steep increase here. They were just invented in the late '80s. Multibillion dollar industry. You're starting to find FPGAs in more and more products. We have some recent announcements, especially by Intel, saying that they're -- that major computer makers are supporting FPGAs just in the last two years or so that these announcements have been coming out and that actual products are coming out. So I think we're at a point now where FPGAs are about to take off. It's, of course, very hard to predict, but it looks like we're at a transition here where it's widely supported that people are going to start using FPGAs to do computation far more than they did before. It's hard to predict. I like to put up this example of trying to predict technology. This was -- it's a little story about Alexander Graham Bell when he tried to go to Western Union with his telephone patent and see if they wanted to license it. And they said, Why would we need a telephone, we have a telegraph. The information's getting across, what does it matter whether it's actual voice or not. So big mistake. Western Union is now just the fastest way to wire money as opposed to something bigger. So I think FPGAs, we might be at that point now where we're transitioning from the telegraph to the telephone in terms of using FPGAs. I think they're going to really take off in terms of the computation platform. But you never know. The future is very hard to predict, right? Okay. So let me give you a little bit of background on the actual project that I'm going to talk about today, starting with what I was working on back when I was doing my Ph.D. work at UC Irvine with Dan Gajski. From '89 to '94 we worked on a tool called spec send. And the idea was to take high-level language and synthesize it down to a microprocessor and circuits next to the microprocessor that would speed up the most important things. So rather than just compiling down to microprocessor, we were trying to compile down to two things, take the noncritical code, put it on the microprocessor, take the critical code and create a custom coprocessor just for that program. Back then FPGAs had just been invented, 1986. Is that about right? '86, '87 was when Xilinx came out? And they had very little capacity. So I hadn't even heard of them back then. Okay. Fast-forward a little bit to around the year 2000. And right around the year 2000 is when dynamic software optimization and translation was getting really popular. So, for example, Hewlett-Packard had their Dynamo system where as a binary was executing they would look for the hotspots and recompile those hotspots into a more optimized piece of code and then replace the existing binary by this reoptimized binary. And they could get 10, 15 percent improvements. Around this time Java just-in-time compilers were coming out. Around this time Transmeta Crusoe was talking about their code-morphing platform, where instead of trying to implement an x86 binary in a somewhat more native way, they would just create whatever architecture they wanted, in that case a VLIW architecture, and on-the-fly just translate their x86 binaries to VLIW binaries. So underneath it was all VLIW. But you could run x86 binaries. So -- and they got lower power and pretty good performance. So and they were getting -- all of these techniques were getting, 10, 20, 30 percent, maybe, improvements. But remember the slide I showed you about FPGAs and they get 100x improvements sometimes, not 10, 20 percent. So and remember FPGAs are software, just like a VLIW is implementing software. So we thought why don't we dynamically take this binary and via some process put it on an FPGA or put the critical spots. And instead of getting 10, 20 percent we may get 100x. Yes. >>: Why would you start with a binary? People [inaudible] x86 binaries and that was it. It was a worst possible source to compile from. >> Frank Vahid: True. On the other hand, if you want to add an architectural feature without having to change the entire industry, you have to work with binaries, because that's the lingua franca of computation. >>: There's two -- at least two products here that start with say Java classifiers and compile those. >> Frank Vahid: Oh, sure. That's what I was doing back in 1989 also. That's my roots. Okay. So I come from that. But I said, look, if I can do this dynamically from a binary, then I can put this into computers and people don't even have to know it's there, they don't have to recompile, they don't have to do anything. So that was the inspiration was to see if we could do a just-in-time compilation, some sort of binary translation, and so that the FPGA would be treated sort of like a cache. It's just invisible. It just gives you invisible performance improvements. So we started working on this project back in 2002 to dynamically translate binaries to FPGAs. So here's how it works. Initially you would download a standard binary onto a platform, and it would run on a microprocessor. And the programmer, the compiler, nobody knows that there's an FPGA here. And you would get some time and performance -- time and energy characteristics from there. Meanwhile, a profiler would monitor the executing program, and it would eventually detect some critical loops. And what it would do then is read those critical loops while the program is still executing. It would read those into some on-chip CAD program, which could be running on the same microprocessor or it could be running on a separate one. What we would then do is decompile the region into a control/data flow graph, and I'm going to talk a little bit more about decompilation. And then we would synthesize it down to a circuit, so there's like an adder tree from the FIR I showed you earlier. And then we would take that circuit and map it into the small memories and the switch matrices that are on an FPGA. So we'd maybe put the adders on those two blocks there and then we'd run this wiring through those switch matrices. And once that's all done, we would modify the original binary so that it would make use of the FPGA now for those critical regions. So the original code is gone and it's replaced by jumps to this FPGA here. And all of a sudden you get maybe 10x improvement in time and energy for some programs; for others you won't get any. But you might get 10x, maybe even 100x improvement. And all of a sudden we call that warping. Suddenly your program just starts running faster and using less energy. So that's the basic idea. So when you look at that idea, you're immediately faced with two big problems. The first problem is what you were addressing, right? Binaries are just a horrible starting point. Right? I mean, you've lost your loops, you've lost your rays, you've lost your functions, you've lost all that high-level information that you really want in order to synthesize a good circuit. So the question is can we somehow recover those high-level constructs from binaries. That's the first question. The second question is for people who know what CAD tools do, how long they take to take a binary -- or, I'm sorry, to take source code and synthesize it down to a circuit on a FPGA. You know that runs for tens of minutes, sometimes hours, right? So frustrating. So that's a long time for this thing to be trying to do some sort of optimization on the fly. So is there some way to speed that up. So those were the two big problems and those were the two things that we worked on from 2002 up until about 2006 or so. So one was decompilation. This is what happens if you don't decompile. If you just take the binary and feed it through a synthesis tool, okay, I mean, create it into a control/data flow graph and then feed it through a synthesis tool, you want to be below this line on performance energy. And you can see that for these examples. Everything's worse, okay. The circuit you get is much slower than a microprocessor. So we need to recover some high-level information before we try to synthesize a circuit. So here's an example of four loop that does some accumulation. And this is what it would get compiled to at the assembly level. So our first step is to get that control/data flow graph. So it -- the control/data flow graph looks a lot like the assembly. That's why synthesizing from it doesn't give us much improvement. But what we can then do is start applying decompilation techniques. And what we did was we actually built on about 20 years of previous work on decompilation, which was more intended for binary-to-binary translation. So we looked through the literature and found a bunch of techniques that we said, okay, this technique would work well for us, this one would also be very useful for synthesis. These two are not so important. So we went through and we found about seven or eight decompilation techniques. We had to build some of our own techniques, like we re-roll loops, for example. If you have an unrolled loop, we figure out what it was and re-roll it, okay. And we go through and apply these techniques. And what starts happening is little by little the code that would come out of that control/data flow graph starts to look more and more like the original. You do data flow analysis, so you get rid of some of the intermediary registers from the assembly. And then you do some function recovery, so you can actually see where functions are in the code and you can isolate those. Start to recover control structures, so we can detect loops, detect IF statements, even switches sometimes. Yes. >>: [inaudible] assembly some kind of risk code. Can you do it for the 86 through this incredible 30 years of baggage? >> Frank Vahid: So we've been doing it primarily for clean instruction sets. We're doing it for ARM, MIPS, we even did it for the MicroBlaze, which is a processor designed for FPGAs. Most of the decompilation work has been done on x86 stuff. So the techniques have been developed on x86. Most of our results, because we came from an embedded world, were for ARM and MIPS. But actually my student, who's now professor at University of Florida, he's as we speak doing it for -- he's porting his tools to x86. So there's been plenty of decompilation work for x86. What ends up happening, you can recover arrays. What ends up happening is you can recover a large amount of the high-level information from binaries. So we started doing some experiments to see how much could we recover. We looked at in this case a bunch of embedded system benchmarks. We did synthesis from C code. And we basically took the techniques that are used in various high-level synthesis tools that will take C code and generate circuits. And we basically did them manually to make sure that we weren't suffering -- that we weren't using a tool that wasn't ideal. So this is sort of the best-case scenario for synthesis from C. And then we ran our -- the binaries of those same programs. We took the binaries, ran them through our decompilation tools and then synthesized. And what you end up seeing is that there's no time overhead, at least for these examples, some area overhead, basically one example that was really bad there. And just because synthesis and compilation there's a bunch of arbitrary factors, for one example from a binary was actually better. But that doesn't really mean anything. That's just noise, really. Okay. So we did this for actually dozens of examples. But we really wanted to make sure -- and I'll tell you why. We got a lot of, to put it mildly, hostile reaction to the idea of synthesizing from binaries. In fact, my student back in 2003, he submitted a paper and he got a rejection. And it was the most violent rejection that we've ever experienced. And then so one guy was just blasting us for being just morons, you know. And then another guy, his review was just one sentence. It said: Synthesis from binaries is just wrong. That was the entire review, right? Like we were committing some sort of moral transgression here. >>: [inaudible] [laughter] >> Frank Vahid: So in addition to those studies that we did, we went to one of our partner companies, Freescale. And we said, look, we really want to do this on a serious benchmark that -- a real benchmark, not just the stuff that you get online but a real piece of software, and we want to see if we can speed it up at the binary level. So after several months of negotiations with them and lawyers involved, we got access to their actual H.264 video decoder, the one that they actually put on cell phones. So it was a several-million-dollar piece of software that they spent several man years developing. Highly optimized, 10 times faster than the reference H.264 code that you can get. So we took that code as our starting point. We did our analysis, we found the critical regions. And we started to say, okay, let's see how we can speed these up. So this is the ideal curve. This is if we could take these critical regions of code and just eliminate them, okay, execute them in zero time. That would be the ideal speedup that you could get with an FPGA. So you can see if you did about 40 of those critical regions, you'd be getting about 10x speedup. So that's our upper bound. We can't do better than that. If you do it from the C level, this is what you get. So this takes into account the time that it takes to transfer data over to the FPGA for the FPGA's computation to send the data back to the microprocessor. This is the speedup that an FPGA could give you over a microprocessor. I think is over a 200-megahertz ARM using a similar FPGA, so similar technology. So now the question is how do we do it from the binary. So we went through and this was about a four-month study that we went through and did this, and that's how we did with the binary. So barely off. We're maybe 3, 4 percent off. So even with this highly optimized code, we were still able to be fairly competitive. But I still wasn't satisfied, right? Because we really got very hostile reaction. So we looked even further. We said let's try different architecture. So we tried the MIPS, the ARM, and the MicroBlaze. And we said let's try different optimization levels, because a common question we got was, well, if you don't do much optimization, decompilation works fine, but what if you do a lot of optimization in your compilation? Maybe it confuses the assembly so much that you can't recover loops, you can't recover arrays and so on. So I said, okay, let's try a different optimization level. For the MIPS, this is the speedup we get using an FPGA if the code was originally compiled with a dash-01 optimization level. And this is the speedup we get if it was compiled with dash-03, with higher optimization. So about the same. So we didn't lose anything there. With the ARM, that's the speedup we could get with dash-01. That's the speedup we get with dash-03, actually better. Much to our surprise. We weren't expecting that. And, likewise, for the MicroBlaze it actually gets better with the higher optimization. So we looked into this a little bit and we found that it's doing more constant propagation, it's reducing memory accesses, and so it's doing things that are good for the FPGA too. And we're not losing our ability to recover the high-level information. So that was a very surprising finding. Okay. Any questions on the decompilation work, the binary synthesis work? Yeah. >>: [inaudible] optimization you might actually get better results because [inaudible] when you had the optimization up higher, does the resulting decompile code still resemble the original C code or it looks slightly different? >> Frank Vahid: It's very similar to the original C code. So in almost all cases -so Greg actually did a comparison. He looked at the various features and tried to figure out what percentage of features we were able to recover. And the numbers he got were like 95, 97 percent. So regardless of the optimization levels, we're able to get -- we're able to recover the loops, the arrays, the functions, the IF statements and so on. >>: Another thing, what if whoever wrote the original C code just did something that's completely erroneous, that doesn't make any sense, but the [inaudible] was able to catch it and compile it down so they look better, and then you decompile it, does it still look that way or ->> Frank Vahid: No, no, no, no. No. Whatever the compiler does, we can't -- we can't reverse engineer -- that type of stuff, right? But for straightforward code -and like embedded code tends to be fairly straightforward, there's loops walking through arrays and things, and we can get a lot of it back. Yeah? >>: I assume this is coming up, but I'm curious of course about relative to the initial time to execute the kernel how long does it take you to do this decompilation and this FPGA synthesis. >> Frank Vahid: It's coming up in a bit, but I'll just give you the quick answer. It takes forever. So it's still an eternity, but it is practical in some cases. And that's if you're using like regular CAD tools. What I'm going to show you in the next two slides is how we can at least ameliorate some of that problem. Okay. So the second big problem with warp processing was this, exactly this, was how long does it take to do all this decompilation, the synthesis, the placement and routing onto the FPGAs. And so I'm just going to show you the results of that rather than diving into the details. For a set of benchmarks that we looked at using the most popular FPGA, Xilinx, and using their synthesis tools, we analyzed how long it takes to actually run the entire sequence of taking a high-level piece of code and mapping it down to an FPGA implementation. And these are the big contributors. Decompilation is actually just tiny. It's negligible. So we don't even show decompilation here. But once you've got that high-level representation running through register transfer level synthesis, which is also trivial, so we don't show it, logic synthesis, technology mapping, placement and routing, which is trying to figure out how to connect all those wires of the circuit on the FPGA, that's how much time it takes and it takes about 60 megabytes of memory. So a different student, Roman Lysecky, who's now a professor at University of Arizona and still working on this, his job was to figure out how to shrink this. So he went and looked at each one of these and came up with techniques that were not solely focused on getting the fastest circuit, which is almost all work in CAD is how to optimize that circuit. And instead he said, well, how can I get a reasonable circuit in a shorter period of time, and our goal was 10x improvement. We wanted to get 10x faster flow here, so we wanted to get down to .9 seconds, basically. And just to make a long story short, this is what he came up with. So he was able to take each of those and shrink them down like this so that overall we got .2 seconds of execution, a 30 percent slower circuit. So the tradeoff is that you have a slower circuit, about 30 percent slower, but you have almost 20x now improvement, 15x or so. Just for fun, he took that -- his tool set, which we call the Riverside FPGA tools, and he showed that you can actually run these on a 75-megahertz ARM. The idea of running CAD on an ARM is -- if anybody knows CAD, you really would never do that, right? And to run it on a 75-megahertz ARM7, I mean, that's really a wimp. That's a wimpy, wimpy processor. He showed that his tools run on that and they would only take 1.4 seconds and still of course use only 3.6 megabytes. So that's kind of neat. If you can do CAD very efficiently, then warp processing becomes a little bit more feasible. These are some results we got using our entire tool flow. Four, number of embedded system benchmarks compared to a 200-megahertz ARM and using roughly equivalent FPGA. If 200 megahertz looks slow to you, we use a 600-megahertz ARM, that's fine, but then I'm going to use a faster FPGA. Okay. What we were talking about in the hallway this morning, right? So we got -- for these benchmarks we got between 50, 200, sometimes 5, 10x speedup. On average we got about 40x speedups using the FPGA as a coprocessor. Yeah. >>: [inaudible] geometric mean and the harmonic mean. >> Frank Vahid: Greg usually does geometric mean. This was Roman's work. I think he just took arithmetic mean. I don't know, but we can sort of visually see it, right? We're not throwing anything way off, are we, with the 191 and the 130 there? So maybe 30x, just looking at -- just kind of eyeball it, and geometric mean would be about 30. Now, this is just for the kernels. And actually when you look at papers that are doing speedups using FPGAs, they usually only show the kernels. That's a little misleading, because you really want to -- you really care about the application. So we also did this for the application. So looking at the overall application speedup, about 7.4, which is still pretty good. Still getting close to an order of magnitude speedup. I have to point out that these examples are embedded benchmarks, they do get sped up. There are lots of examples that don't. For example, we tried to do this with spec, which is the normal desktop application suite. And we basically couldn't get any speedup at all. So it just didn't work on those. There just weren't loops that were amenable to being sped up by circuits. Have you had any luck speeding up spec? >>: [inaudible] >> Frank Vahid: Yeah? The V-zip [phonetic]? Yeah. Okay. >>: My problem is that they're all floating point. >> Frank Vahid: Yeah, I know. >>: [inaudible] >> Frank Vahid: Yeah. That includes a lot of them. Although we do have -now with the modern FPGAs the floating point can be done reasonably efficiently, but the resources are somewhat limited. Okay. So moving on to some more recent work, some more recent directions of thread warping -- of warping, we looked at multicore platforms, which are much more likely to be implementing multithreaded applications. So let's try to take this warping concept and apply it to multithreaded applications. So we looked at an approach where the binary would be loaded under a microprocessor, operating system would start scheduling threads onto available cores. And any threads that weren't executing would be waiting in a queue. And so we looked at warp processing from the perspective of, well, let's look at that queue and take those threads, run them through our on-chip CAD and actually put those threads onto the FPGA a custom circuits. So rather than being limited by the number of cores that we have on our platform, we can synthesize as many threads as we want. We can synthesize 48 threads maybe, if you had 48 threads waiting. And they would each be executing then in parallel, and then the operating system would schedule the threads onto those circuits. And potentially you could get very large speed. That's because you're looking at a level of parallelism here that is even greater than at the circuit level. You're basically creating on-the-fly multicore systems, creating as many cores as you want, but each core is specialized to execute just that thread. It's a fairly complex framework. I'm not going to talk about all of it, except I do want to sort of zoom into the one thing that was sort of the Achilles heel of this thread warping and the solutions that we came up with. If you know -- if you've been working with FPGAs and trying to speedup, you know that the memory bottleneck is often the Achilles heel, right? You can put 48 threads on here, sure, but they're all just sitting there waiting for some bus to return data to it so it can actually compute, right? And so you don't actually get any speedup at all. You probably had this experience, right? So you can have as much concurrency here as you want, it's all moot if you don't have enough bandwidth from the memory to feed the data to these things that they need to do their computation. A lot of publications, when they talk about speedups and FPGAs, they ignore that point actually. They just assume the data is in the FPGA already. So that's a little bit cheating, so same for multicore -- yeah, that's right. Okay. So we looked at a number of multithreaded benchmarks -- there's a bunch out there -- and we noticed a couple of patterns that were common. Here's one pattern, which is that you've got a main program generating your threads, so here's one that's creating 10 threads. They're all calling -- they're based on function F. They're all accessing array I but they're using some different constant here to multiply it. So there's -- the basic pattern is that you've got a whole bunch of identical threads all accessing the same array. So the solution there is don't have each thread access the array itself, instead have one thing access the array and feed that then to all of the threads simultaneously. So we wrote some algorithms that would identify the groups of threads, identify those constant arrays, those constant memory accesses, and then would actually synthesize a separate custom DMA that will handle the access of that array. So these threads, when they're activated, this DMA will actually on their behalf fetch the data they need, push it into all of them, and then those guys all have their data. And you didn't have to have all 10 of those functions trying to access the RAM independently. So what we end up getting is the data being fetched and then pushed into the threads, rather than the threads fetching. So before we did that memory access synchronization, it would have been about 1,000 memory accesses through this example, and after it was reduced to only 100. So it helps alleviate the bottleneck. The other pattern that we saw very frequently was this idea of windows. So you may have a number of threads and they're operating on a slightly different region of an array. So here they're operating on four array elements, and each thread is operating on a window shifted by one, so one thread's operating on those elements, another thread's operating on those elements, another thread's operating on those. So we can still use this memory access synchronization technique to reduce the amount of data being fetched from RAM. Again, you have your smart DMA here and you have a buffer here that you're using. And so when the threads get activated, so you synchronize the access of -- the activation of the threads, and then you read in a big chunk of data, put it into the buffer, and then push into each thread the actual data that they need. This is kind of a neat thing that you can do with FPGAs that's a little harder to do with multicores. You can really synchronize the threads in this way. Yeah. >>: [inaudible] a custom-made cache memory? >> Frank Vahid: Yeah. That's what you're doing. >>: And all your reading and writing [inaudible]. >> Frank Vahid: Well, you could extend this concept to the other side so that the threads could be writing out to another type of buffer in a very intelligent way. It's the neat thing that you can do here, is you can do this sort of analysis and you're basically creating a custom multicore system with a custom memory system. So all this customization enables you to get rid of a lot of the overhead that occurs in more general-purpose systems. And if I understand, that was part of your project this summer was to find that overhead and show how FPGAs are much more efficient because they can do things more directly rather than having to have this very general way of doing things, which results in any specific thing having a lot of overhead. Okay. So in this example we've got 400 memory accesses reduced down to 100 or so. Okay. So compared to a 4-core device, in this case 4 ARMs, and I think now we're up to 400 megahertz -- as our experiments went on the frequency was going up. So compared to a 4-ARM device, we took a bunch of examples that were embarrassingly parallel. I mean, these are standard benchmarks, but these are -they're highly paralyzable benchmarks, just to be honest about it. Compared to a 4-ARM, what we got was using thread warping we can get speedups of 130x. And here we are just -- when Greg shows the results, he does the arithmetic and the geometric. So there's your geometric. By the way, you can see why it's important to show the geometric, because look at the arithmetic is 130 and the geometric is 38. But that's because of that big outlie there, the 502 and the 308. We didn't have those huge outliers in the other data. Okay. But then there's a very reasonable question. Sure, you're using FPGA, but remember the pictures I showed you of FPGA, they're huge, right? So you have the little ARM processor up in the corner and then the rest of the chip was FPGA, right? So, well, gosh, what if I just put more and more cores on the chip, maybe I would have gotten this sort of speedup also, so why do I really have to use FPGAs. So what we did was we also looked at multicore systems ranging up to 64 ARM11s. And so we actually wrote a very optimistic simulator, one that doesn't -- assumes that memory is always accessible, there's no contention for the bus, there's -- there's no cache coherency issues, nothing. So very optimistic multicore simulator. And then we compared that to a more pessimistic thread warping approach where we actually are paying attention to the memory synchronization. So the speedups I'm going to show you now would actually be greater. They would be better. But even with that approach, that fairly pessimistic FPGA approach, we see that an 8-core system doesn't get us that much. 16-core, you know, it's better than a 4-core, but it's not approaching what thread warping can do. There's the 32-core system, which is using about the same amount of area as the FPGA that we were using. So you can see it is giving you some speedup, but nothing compared to what we're able to do with the FPGA. And that's the same amount of area now. And we even went further. We said what if we even have 64-core, and, again, it's getting better, but your thread -- you can see that the FPGA really buys you a lot. Any questions on thread warping? Okay. >>: There you assume that there is no [inaudible] memory bus, more or less everybody hits on the cache, that [inaudible] parallels [inaudible] no synchronization [inaudible] on the multicore [inaudible] as good as it gets. Is that right? >> Frank Vahid: Well ->>: Was there anything you could do [inaudible] multicore to make it better? >> Frank Vahid: The assumption on the applications, we didn't make any assumptions, okay, we just chose applications that had lots of parallelism, okay, because we wanted to show the potential of thread warping. In terms of the assumption that I was stating, the assumption was working against us. I was giving a multicore system the absolute benefit of the doubt, assuming that data was always available when it needed it; whereas when looking at the FPGA that we were doing, which is we're trying to show the potential of that, that was accurate. So when I said that I made assumptions in terms of the access of data, that was working against thread warping. >>: And the cores [inaudible] single-thread simplistic things, not [inaudible]? >> Frank Vahid: ARM11s. Yeah. Which are good in Meta processors, but they're not -- nothing compared to desktop, of course. Yeah. >>: So were you using a precomputed offline profile to find these hotspots that you're accelerating the FPGAs or are you finding those on the fly as well? >> Frank Vahid: So let me show you. So all the data that I showed you doesn't take into account that time to do the actual CAD, right, which is what you're getting at, right? Okay. So how can we then apply this technique for real, right? When can we actually do this sort of warping. So we have this SGI Altix machine that has the 64 Itaniums and the FPGAs. Some jobs I run on there run for dozens of hours, if not multiple days. You're doing storm simulations and N-body simulations, you know, physics types calculations and so on. So those are very long-running applications. Might be dozens of hours. Whereas the CAD might be a couple hours, just using normal tools, not our faster, 10x faster tools. Just using normal Xilinx tools, CAD might be a few hours. And once the CAD is done, then you switch over. Instead of running it in its normal software mode, at this point it switches over to using FPGA. And so then it runs to completion in much less time. So you can get speedup that way. So for these long-running applications, warp processing is immediately applicable. The other scenario where it becomes applicable is in recurring applications, which is common and embedded as well as desktop systems. And so here when the application first runs on the microprocessor, we might fire off our on-chip CAD tools, but the on-chip CAD tools might run 10 times longer than the actual application. So the application may end and our CAD tools still haven't finished yet, okay. But they can keep plugging away, just keep churning away that; that application may come and go several times and we may not be able to do anything for it. But then at some point in the future when CAD is done and we load up that application and we say, oh, we've got an accelerator for it, let's use it now, and so then from that point on whenever that application runs, you can get that sort of speedup. So does that sort of address your questions? >>: Yeah, that's a very big part of it, just that I was also curious about before you run the CAD and so on, you also -- presumably, you know, you're only building specialized circuits for particular subkernels. And my question is, if you're using profiling, et cetera, to find those, is that also included in this, what I'm seeing on the screen, or is that also run magically offline in some space that we're not seeing the [inaudible]. >> Frank Vahid: So the profiler was part of the whole warp -- warping process. If you remember the initial slide I showed, the things were going by on the bus, we have a profiler and the CAD tools. Those are both dynamic. So you don't have to know beforehand what the kernels are. Is that what ->>: And so it is finding it during this initial microprocessor ->> Frank Vahid: Yes. The profiling is part of the -- yeah. Yeah. >>: When you run the application a second time, how do you make sure you're running exactly the same application, not one revision up? >> Frank Vahid: Good question. Depends on the platform we're dealing with. The complexity of solving that problem depends on the platform we're dealing with. >>: One thing that [inaudible] did back when they were introducing the alpha was to compile overnight large binaries [inaudible] the next day [inaudible] deployed systems, you know, changing the binaries is actually a serious [inaudible]. >> Frank Vahid: That's a good point. I should use that as an example, as this concept of optimization offline -- or runtime optimization to having been used before. >>: The other thing [inaudible] one thing we can do with an algorithm [inaudible] processor since you have as much time as you wish anyways, and that application would run for months [inaudible]. >> Frank Vahid: Okay. The most common question that I usually get is why don't you just do this statically, why are you going through all this trouble to try to do it dynamically. And this is the answer. First of all, the static approach definitely has a big role to play, the offline figure out what part should go where, create a binary that consists of the microprocessor binary and the FPGA binary and distribute that. That has a big role. But it has some limitations. It requires some specialized language in many cases and definitely requires a specialized compiler. And people don't like using specialized tools. The software industry is very big, FPGAs are still a relatively small part. And we can't ask -- it's sort of like the tail wagging the dog. We can't ask the whole software industry to start supporting FPGAs when we know that it's going to be a smaller piece of -- component of people that are going to use it. So if we can do it dynamically, you can use any language, any compiler, you can have object files that you're getting from third-party vendors and putting that all together into a binary. You know, you don't have to have any knowledge of the FPGA existing. It doesn't affect your tool flow at all. So that makes it available to everybody, just like caches are available to everybody. If you have warped processing, it enables this concept called expandable logic, which I named it that to compare it to expandable RAM, expandable memory. So think about expandable memory. You have your binaries, they're all downloaded on your processor, you've installed them, you've been running them, and you decide you need more power. Things are running too slow. So what do you do? You add more RAM. And invisibly the OS detects it and uses it and you get better performance. So, likewise, we'd like to be able to support expandable logic, logic representing the FPGA. So you might have no FPGA, you might have one FPGA originally and you decide I'm not getting enough performance, and so what you'd like to be able to do is just add in more FPGA and you get better performance and you just don't have to reinstall anything, recompile anything. This just shows how expandable logic can be added to improve things. Here's an N-body simulation, which is a physics type simulation. Adding some logic helps in that at some point it just tapers off. On the other hand, here's an image processing application where the more FPGA you added, the more performance benefit you got. So up to 250x, and it may have even gone more if we could have added more FPGA. So there is -- you know, for most examples there's someplace where it tapers off, but for many you can keep adding and it keeps growing. This is our most recent results, by the way. This is just a few months ago out of University of Florida, and this is now comparing to a 3.2 gigahertz Intel Zion. And you can see we're still getting speedups of up to 8x compared to this very high-end -- fairly high-end processor. And this is taking into account all the communication over the PCI bus and so on. So I thought that was kind of neat that he was able to get that running. That's Greg's work. >>: [inaudible] multiple boards? Sounds like you plug multiple RAM chips [inaudible]. >> Frank Vahid: The expandable logic concept? Yeah. >>: Okay. >> Frank Vahid: Yeah. This is some very recent work that actually Scott Sirowy had done. So it depends a lot -- the question is can you take these circuits that people have been putting on FPGAs and can you model them as C? So I've been looking at it the opposite for most of this talk. I'm just looking at C code, and people had no idea FPGAs existed, and seeing if we can speed them up. But there's a whole community out there that's designing circuits for FPGAs. Can we model their circuits and see and have them execute on microprocessors and then speed them up with FPGAs? And so what Scott did is he went through -- there was about 70 examples from a conference called FCCM, Field-Programmable Custom Computing Machines. And we found about 70 papers that talked about special circuits that were designed just for FPGAs to speed up computation. And he looked at every other one, basically half of them, and tried to see if he could write C code that he could then synthesize back to the circuit. And it turns out 82 percent of the cases he could. >>: I'm just curious [inaudible] deceleration? >> Frank Vahid: Is that one of the examples? >>: [inaudible] >> Frank Vahid: Scott, remember do you remember what that is? >> Scott Sirowy: I don't remember what that one was offhand, no. >>: Strange title. >> Scott Sirowy: Yeah. >> Frank Vahid: Sounds easy, right? I can slow down software. Yeah. There must be something to it. I don't know what it is. Anyway, we found out that 82 percent of those circuits that people had designed for FPGAs as circuits, that we could model and see and synthesize back. So that's very promising. That tells us that FPGAs used for computation can be used in an environment where there's microprocessors and we just move things over to the FPGA as needed, to the extent that the FPGAs are available. Okay. And these were some of the results that Scott got where he showed that the custom-designed circuit that the people had published in the papers had some speedup. We normalized all those to 1. And then the speedup that we got by writing C code and synthesizing it down to a circuit, you can see we get similar speedups along the way. And in some cases we -- oh, these are -- going up is bad here, right, in this case. Yeah. So in some cases we were slower, but in many cases we were the same. Okay. Have you guys heard the parable about the three blind men who walk up to an elephant and one grabs the tail, one's holding the -- touching the side and one's grabbing the trunk and saying -- and then they have an argument: One says an elephant's like a rope; the other one says, no, an elephant's like a wall; and the other one says, no, an elephant's like a tube, you guys don't know what you're talking about. Well, I like that parable in terms of FPGAs here. So the whole world has been looking at software for many years now as microprocessor instructions. And I think those of us in the FPGA community are starting to say, look, there's another aspect of software, FPGA circuits. It's not just instructions anymore, this very spatially oriented, highly parallel way of doing computation is just as much software as instructions are. And we need to start thinking of computation in terms of these spatial ways of doing things because we can get huge speedups in execution if we do. I used to have this on the other side at the tail, and I got comments about how that looked kind of bad, so I put it at the front there. So warp processing brings -- potentially brings the FPGA speedups to all computing because we make it invisible. So we got a patent back in 2007. It's been licensed to these companies via SRC. Microsoft isn't part of SRC yet. So you guys are starting to do more and more architecture research, FPGA research; hopefully you guys will come into SRC soon. And we're doing extensive work on this right now. Scott's working on it. We have a couple more students working doing online CAD algorithms, architectures, and how to model things at the high level in order to get them to work well on FPGAs. So any other questions? Yes. >>: If you go back to the 82 percent slide, a lot of [inaudible] says requires extensive modification to architecture. Let's see. Heavy modification of original algorithm. It's not very common. It's a few of them. >> Frank Vahid: Uh-huh. Heavy mod, okay. >>: [inaudible] did you [inaudible] the FPGA or took the [inaudible] ->> Frank Vahid: Well, when we say the original algorithm, sometimes what these papers did was they took some common algorithm, let's say quicksort, and they said how can we implement this as a circuit. And they went down and implemented some crazy circuit that didn't look anything like quicksort anymore. It was maybe mergesort of thing like that. And then we said, okay, can we take that circuit and reverse engineer it back to something and see that will synthesize back to that same circuit. So when we go back up to C, it doesn't look anything like this very intuitive algorithm. It's some very spatially oriented thing. So instead of just having a four loop that's walking through an array, maybe we have 12 functions and each one's connected by global variables or something like that. So that's what we mean by heavy modification, right? But it will still execute on a microprocessor correctly and it will still synthesize down to circuits. >>: [inaudible] you were making the case for [inaudible] as opposed to [inaudible]. >> Frank Vahid: We're saying C is surprisingly effective at modeling these circuits, yeah. As long as you have a good high-level synthesis tool, you can regenerate these circuits quite easily. So you don't really have to do register transfer level modeling for a lot of these things. Now, that's true for this domain because these are mostly computations; they're not timing specific. And so, you know, it doesn't really matter that we lost the timing, some of the clocking information. If there is some clocking information, it's mostly in terms of coarse-grain pipelining, which we can capture at the C level by different functions, and a good high-level synthesis tool will result in a good pipeline for that. Any other questions? >>: It's another [inaudible] that you heard from [inaudible] ->> Frank Vahid: Um-hmm. >>: -- who was looking at decompilation from the point of view of [inaudible], so take something, compile it down and then decompile it and see if the [inaudible]. >> Frank Vahid: Um-hmm. >>: And I wondered is there something that you could think of adding into the [inaudible] here, so not just beating out but adding other functionality into -- now that we're paying the price for doing all this work, is there something else that we can get out of it? >> Frank Vahid: Yeah, that's a good idea. So you could -- what are some other avenues you can get out? You could get some sort of verification ability, right, so some sort of checking perhaps that you could regenerate a high-level model that will execute reasonably fast and then you could check to see if the results that your circuit is creating are consistent. >>: [inaudible] >> Frank Vahid: Hmm? >>: Portability. >> Frank Vahid: Portability, yeah. Yep. Reverse engineering to a high-level spec that then people could manipulate if they needed to. Yeah. It's an interesting point. Several avenues that could come from that, yeah. Yeah. >>: So how do you use it today? Let's say I want to speed up my program today and want to just use FPGAs, even for some custom scientific [inaudible] filtering, anything where you're trying to do really intensive computation on very simple operations but you want to even go very parallel. Do you plug in a board? I mean, what do you use to do that? >> Frank Vahid: Well, what we did was we showed the feasibility that this idea of warping is possible. And so now I think it's up to people who have platforms to consider. Right now everybody who has a platform pretty much is focusing on, okay, how do we program this thing statically. And I think it's up to them now to start saying, okay, well, it's feasible to do it dynamically. For our particular platform, what does that mean, right? Do we need to put a CAD tool on a third core here and start dealing with this or what. So, yeah, we'll see where it goes. Okay? >> Alessandro Forin: Questions? >> Frank Vahid: All right. Thanks for your time. [applause]