1 >> Doug Burger: Good morning. It's my delight today to introduce Kermin E. Fleming, who goes Elliott. The E. Is for Elliott, and he's visiting us from MIT as an FTE hiring candidate who is, I think, very intellectually aligned with a lot of the work going on here at Microsoft, and he's done some really tremendous work as part of his dissertation, both on systems, FPGAs and compilation to FPGAs. So really excited to hear your talk. And thank you for visiting us. >> Elliott Fleming: Pleasure to be here. So my name is Elliott, and today I'm going to talk about how we can scale programs to multiple FPGAs. So before I get started, I'd like to thank everybody that I've worked with. So basically, my advisors were Arvind and Joel Emer, and all of these folks here were involved in the papers covered in this talk. And here's a bibliography. So the compiler work is categorized under LEAP, Airblue is a wireless transceiver project, and there are a few other designs presented. So let's get started. So basically, FPGAs are traditionally been used as ASIC prototyping tools. There's drop placements for ASIC. However, recently. The FPGAs have gotten quite large and also much easier to integrate into systems with PCIE, ethernet, various IOs, and so now we can talk about designing big systems with FPGAs in them as a first order compute with a goal to accelerate some algorithm. So we have some algorithm that we were running in software. Software is not fast enough or maybe burns too much power and so we want to run it on an FPGA. Okay. The goal here is time to answer. So time to answer has two components. One is, of course, accelerating the algorithm so it runs faster. But the other is also to reduce the amount of time that it takes an engineer to build an implementation. Okay? The second goal is functional correctness. So here, unlike in traditional FPGA flows, where we cared really about preserving the behavior of the ASIC that we're going to produce and we have to make sure works right, otherwise we've wasted a lot of money on a mask set, here we only care about functional 2 correctness. That is, that whatever answer we wanted to compute was computed correctly. And, of course, as fast as possible. Okay. So here are a couple of examples of this HAsim, which is a simulator for processors, and a framework for building wireless transceivers, with commodity hardware so you can talk to, you station. kind of program so one is the other is Airblue, which is which are actually compatible know, your [indiscernible] base Okay. And again, the goal here is functional correctness for both of these codes, as long as we produce the correct answer within some high order time bound, we're good. So now that we're writing programs on the FPGA, we can ask the question, what happens if our program is too large. Remember, the FPGAs, of course, are structural things so we can, unlike in a general purpose processor, express a program that is too big to fit on to the substrate. So here, we're laying down CPUs and eventually, we have too many to fit on the single FPGA, so what are you going to do? Right? So one thing we can do is optimize so we can try to make our design smaller. That works to first order, or we can go out and buy the biggest FPGA we can. Again, you know, these are patches. But at some point, we have to use multiple FPGAs. So what does that entail? One, we're going to have to partition our design so here, you know, there's quite an obvious partition, right. We just put the CPU on the other FPGA. Two, we're going to need to map or design this partition design down on to multiple FPGAs. And finally, we'll have to synthesize some network between them. And, of course, we an always do this manually. So we can take our engineers and have them implement this whole thing and they'll run it and it will probably run quite fast. However this can be tedious and error prone, particularly if one is exploring a design space and need to change the implementation. And the question is, can we do this automatically. So the remainder of this talk is going to discuss how we can achieve this goal automatically. >>: Automatic is non-tedious and error prone? >> Elliott Fleming: Yes, non-tedious and error prone. Right. Probably error 3 prone is the most important there. So before we get started on how the compiler actually works, let's talk about what we should expect when we map a design to multiple FPGAs. So more FPGAs mean more resources and just like in a software program when we throw another core, a better cache hierarchy at a problem, we should expect more performance. So one thing in FPGAs, one metric of performance is the problem size that we can implement and so what I'm going to show is one of the examples actually can be ten times larger when it fits on multiple FPGAs. So on a single FPGA, we can fit a 16 core model and on multiple FPGAs, two to be precise, we can fit a model that can model up to 121 cores. So that's an a 10X problem scaling for this particular problem. Also, when we give more resources to a problem, just like in software, we should expect it to run faster. This can happen for a number of reasons. For example, you get more DRAM banks on multiple FPGAs, but also since you're asking the tools to solve a simpler problem, when you partition a design, sometimes they can come up with frequency scales as well. What I'll show you is that one of our examples can actually achieve a super linear speed-up when mapped to multiple FPGAs. So this is performance normalized to single FPGA. And up is good. Okay. So in summary, what can we expect? Design scaling so designs when we have more resources, more FPGAs. We can get And then finally, although I'm not going to discuss this, we reduced compile times, because again we're asking the highly solve simpler problems. we can get bigger faster run times. can also get nonlinear times to Okay. So the good news is, so again our goal is to sort of produce these implementations automatically, and the good news is that multi-FPGA compilers exist commercially. So if they were good, we could stop, right? And they operate on arbitrary RTL, which is also good. The problem, though, is that they have to preserve cycle accuracy. So what is cycle accuracy? Cycle accuracy is a bijective mapping between the original RTL description, which was clocked, right, and whatever implementation we put on the FPGA. So in the FPGA implementation, there's a precise way to resolve the original behavior of the RTL. And so kind of what you can see here is that the model 4 clock, which represents the original behavior of the RTL, is ticked infrequently. The FPGA clock, of course, is running very fast, and then between model clocks, we're doing some kind of communication between the chips in order to preserve this idea of cycle accuracy. Oh, and feel free to stop me at any time if you have questions. Okay. So of course, you can see how this would be useful in ASIC verification because we want to preserve the behavior of the RTL, because if we make any mistake in that translation or if our RTL is in any way buggy, we could break our chip. The problem, though, of course, is the cycle accuracy gives us low performance so what you can see here is that the FPGA wants to be fast. It wants to run fast. But because we're having to preserve this cycle accuracy, we're actually going at a very low speed relative to what we could get out of the FPGA. And again, this comes from the need for distributed coordination. It also comes from the fact that there are very poor semantics here. Here, in maintaining cycle accuracy, we have to transport every bit just in case some logic might behave funny in the case that even if a bit is invalid, it was transported, right. So you can imagine here that if this data is invalid, so the control here is invalid, right, but we still have to transport all the data in case some point in our circuit might misbehave. So we have -we could some random data vector here, right. But that might cause a bug in our design. We don't know. And if our objective is verification, of course, we need to preserve that behavior so we can fix it. Yeah? >>: Sorry if this sounds like a moderately hostile question, but it seems to me like you're setting up a little bit of a straw man here. You know, you're saying, well, I want to synthesize RTL to a large logical FPGA, but then I'm going to partition it to multiple ones and I've got, you know, comparatively [indiscernible] slow communication and low bandwidth between them, and so that won't work well unless I partition the design. >> Elliott Fleming: Well, so -- >>: And it really does seem like a straw man, because there's no hope of getting to that magical point where you can partition any design and have it 5 run at your FPGA or your ->> Elliott Fleming: Precisely, and that's why we're not going to partition a design. We're, in fact, going to restrict designs in a way that leads to good partitioning. So basically, we're going to give programmers a new primitive, which we'll talk about in the next couple slides that will enable them to describe designs in a way that we can easily map. >>: Okay. >> Elliott Fleming: So we'll see how that works. Okay. So again here we're preserving cycle accuracy, but remember what I said. The goal of this new use case for the FPGA is functional correctness. As long as we get the right answer, we're happy. So the question is, do we actually need to reserve all of this cycle accuracy. Answer, of course, is no. So what I'm going to advocate is this new style of design called latency insensitive design. The basic idea here is that inter-module communication occurs only over latency insensitive channels. The idea is to decouple the behavior of different pieces of the design from one another so that we can change their implementation. Okay? Changing the timing behavior of a module then does not affect the functional correctness of the design, right, as long as the data flows between the modules and that data flow is preserved, then the behavior of the design will be the same. The functional behavior, right. Of course, the timing behavior will definitely change. Many hardware designs already use this methodology, so most hardware designs are described in terms of these FIFOs explicitly for the obvious reason that there are many unpredictable latencies in hardware designs. Of course, you know, hardware designers also want to do design space exploration. Why? Again, improve modularity, improve design space exploration. And today, what we do is we modules and the design, and room to in Q data, we don't express our design in those simply insert FIFOs, guarded FIFOs between the we don't include data into the FIFO unless there's DQ unless there's actually data in the FIFO, and we terms. This is a very simple model, okay? So let's think about that a little bit. So what I said was we could change the behavior inside of any module in any way we wanted to while preserving the functional correctness. But what this implies is we can also change the 6 behavior of the channels themselves. So if I can change the behavior of the channel and I have a design described this way on an FPGA, mapping to two FPGAs is straightforward, I simply stretch the channels between the boundaries. And, of course, logically, these are still FIFOs, okay? But there's a problem. I can have lots of FIFOs in the design and not all of them can have this property, because remember that a compiler, an RTL compiler sees only wires and registers. It can't even tell probably that there's a FIFO here. So semantically, it may seem some wires and registers with some logic, but it's very difficult to even determine that there's a FIFO. And additionally, of course, we think about cycle accuracy is difficult. So it's very hard in these things even to decide whether or not it's safe to add an extra pipeline stage in a FIFO. But the programmer knows about this property, this latency insensitive property. He expressed his design this way to sort of get the benefit of modularity, right. So what are we going to do? We'll just give the programmer a syntax to describe these kind of latency insensitive FIFOs. Yeah? >>: So can you give me a more precise description or semantics of what you mean by latency insensitive? >> Elliott Fleming: So what I mean is first, let me clarify that latency insensitive does not mean that we don't care about latency. So this is a common problem with when we use the latency insensitive, right. What it means simply is that we're free to change the behavior of the FIFO. This will come up. So we'll get to it in a couple clicks, but basically, we're free to change the behavior of the FIFO, and the programmer is asserting that they've described the rest of their design in a way that permits us to make this change. So, for example, they won't try to NQ data into the FIFO if the FIFO is full. So they're leveraging the back pressure on both end of the FIFO. Again, this is not the only way write a FIFO in your design. You're always free to use the register and wire FIFO. >>: So is latency insensitive then defined in terms of this particular implementation technology of FIFOs. Is that the only way to characterize it? 7 >> Elliott Fleming: I don't think so, but it's hard for me to imagine any other way of characterizing it. >>: Asynchronous logic? >> Elliott Fleming: >>: What's that? Asynchronous logic? >> Elliott Fleming: Yeah, you could think of it that way, perhaps. So that's fair. You could think of it as asynchronous logic and that whole field. That's sort of what we're doing here. Again, compute is happening on data flow tokens and we're decoupling the notion of clock from compute. I mean, that's the fundamental difficulty, right, is in cycle accuracy, you know, clock is the first order thing. Here we're trying to remove the notion of clock that we can perturb the design in ways that are beneficial to the programmer. Yeah? >>: So indeed, if you do have these NQ commands on one end and the data valid [indiscernible], I think you were saying, how would you have a latency insensitive channel that you can't -- that you can't stretch? >> Elliott Fleming: insensitive? >>: So the question is why aren't all FIFOs latency Exactly. >> Elliott Fleming: The simple reason in this is because you may make assumptions about the particular implementation of a FIFO. For example, you may make the assumption that this FIFO has a depth of one. That is when I NQ something into it, it will be full and that will maybe -- you'll use that control logic to determine some other things in your pipeline. So, for example, you may say if this thing is full, I will issue some other request. >>: So FIFOs don't have proper flow control? >> Elliott Fleming: No, the FIFO may have proper flow control, but you may make some assumption about the buffering, for example. You could also make an assumption about the latency. But I think the more common case, at least in 8 the designs that I've worked on, is you make some assumption about the depth of the FIFO. For example, that it has one or two buffer slides, and you write other logic to expect that. And so actually adding more buffer slots than perhaps one or two would break your design. Because that assumption that you baked into the logic is no longer true, right. I mean, you could imagine having a single entry FIFO with flow control, you know, not full, not empty, right? And then using that assumption that it's got a single buffer slot in it to actually implement some other logic. I mean, people do that. The basic issue is if you allow people to leverage that assumption, then it makes it very difficult to make the kind of changes that I'm going to propose in the next set of slides. >>: Seems to me that the thing that you want is to find large regions of code that have no recurrences. >> Elliott Fleming: Large regions of code that have no recurrences. >>: In other words, if -- think of it like a pipeline, you know. You're partitioning and the degree of latency insensitive is the amount of compute you can do that decoupled through a FIFO before you have to go back and close the loop. >> Elliott Fleming: That's right, so feedback. That's right. >>: And I know this is a pipeline for decades, and so if your tools can analyze your design, find the partitions, talk back to ear regions [indiscernible] partitioning, that's how you can actually map this. Are you taking an approach like that? >> Elliott Fleming: So we haven't studied map, right. So mapping hasn't become a problem for us yet. I'll talk a little bit about how we do mapping, but it's quite naive. But you can imagine some approach like that being necessary as a refinement to this. However, I'll also point out that generally speaking, even if there is feedback 9 in a pipeline, very often hardware pipelines will have a pipeline depth that is sufficient to cover the latency of inner-FPGA communication. This is certainly true in DSP algorithms. I think it will be true in others. So, for example, HAsim, because of the way it's implemented, kind of a time multiplex pipeline will actually have enormous potential to hide the latency with useful work. The latency of communications. >>: I'm just thinking about arbitrary applications. >> Elliott Fleming: >>: Sure. And general underlying approach. >> Elliott Fleming: Yeah, so generally speaking, yes, you would want to try not to partition across feedback paths too often, I think. But we don't have any way of doing that automatically now. Okay. So anyway, here's the syntax. Basically, one frames one's design in terms of sends and receives. And a compile time, the compiler will choose an implementation. One example of an implementation is just the vanilla FIFO that you could have written anyway. On the other hand, you may choose to synthesize a complicated network. Again, depending on placement and other design goals. And again, here we have an explicit programmer contract. When the programmer writes down the send/receive channel, right, he's willing to accept unspecified buffering and unspecified latency. And he's guaranteeing that he's written his design in a way that will admit that choice by the compiler. So, of course, you know, the programmer can write a buggy design and the compiler will happily generate a buggy implementation. It's more of a programming tool in that sense, okay? However, generally speaking, we found that this primitive is pretty easy to use. Often, you can take FIFOs in your design and just substitute them out. Ask it can be a simple substitution, and this has been our experience for most FIFOs, except the ones which I was describing to you in the back, wherein you're using them for control, the FIFO. The depth of the FIFO for control. Okay. So now we kind of talked about a syntax for describing latency insensitive designs and we've seen that latency insensitive designs can at 10 least in theory be mapped to multiple FPGAs. compilation flow. So now let's talk about a Now we'll now discuss a compilation flow to do that. Afterwards, we'll talk about how we synthesize netted works between FPGAs, okay? So what we're going to do is we're going to start out with an arbitrary RTL augmented with latency insensitive channels. So what we've got here are little state machines that can be any RTL or they could be software, for that matter. Doesn't necessarily even have to be an RTL connected by latency insensitive channels shown by the dotted lines so we just have some graph, okay? Again, I mentioned that to produce a multiple FPGA implementation, we three phases. First, we have to build in graphical representation of this which we can partition. Then we're going to have to take that partitioning and map it down into some network of FPGAs. Finally, we're going to have to synthesize a communications network carrying the -- yeah, sorry. >>: Who's responsible for establishing that the RTL behavior doesn't change with the latency of the channels? >> Elliott Fleming: The RTL behavior may very well change, absolutely. The point is that you're asserting that those behavioral changes are not going to impact functional correctness, okay. Again, the RTL behavior will absolutely change. And much the same way that your RTL behavior would change in you interposed a level of cache hierarchy, right. It will change. But it's the programmer's job to ensure that these changes don't perturb the functional behavior of their design. In practice, this is not a very difficult thing to do. >>: When you say RTL behavior, are you talking about timing events? >> Elliott Fleming: >>: It's correct. >> Elliott Fleming: >>: Timing, yes. [indiscernible]. Oh, yeah, correctness could be -- 11 >> Elliott Fleming: Absolutely. If you wrote a bad design or at least a bad design in terms of, you know, this property, right, you could very well get an incorrect implementation. Although, of course, I'll ask the question if our design was too big to fit to begin with, how would we have implemented a transform to preserve that correctness? We would have just -- we can do it, right. There are tools that's do it. But you pay out your performance. So you lose order of magnitude of performance to preserve the property. So this is the trade-off that we're making here is instead of preserving that exact timing correctness of the original RTL, we're giving freedom to the designer to express points at which that behavior may safely be changed, and we're going to leverage that. >>: [indiscernible]. >> Elliott Fleming: This is a [indiscernible]. So for full disclosure, we actually implemented [indiscernible] for a number of design choices, mainly because [indiscernible] is easier to augment with compiler-like features just like Haskell, its predecessor. But you can imagine these RTLs being [indiscernible] also. So you can think of this as just putting these send and receive points into [indiscernible], that's certainly admissible. Okay. And then finally, what we're going to do is given this sort of implementation, we'll produce an RTL for each FPGA and you can run it through the back end tools to produce implementation. Okay? All right. So first thing we do is we're going to construct a graphical representation that we can partition. Of course, remember that the only thing we know how to modify are these latency insensitive channels and that's going to kind of give us this graph structure over here, where we have blobs of RTL connected by latency insensitive channels. We call these latency insensitive modules. Although I've shown RTL here, again, it is possible to put whatever kind of computation you'd like in there, including software. As long as it ascribes to the latency insensitive channel communication model. And then what we're going to do is chop up the design in this way and map it down on to a set of FPGAs. And again, the vertices [indiscernible] and the edges are latency insensitive channels. And here's an example of the syntax here, so we have some channel A, and it induces the edge here. 12 So now that we've got that representation, our next objective is to place it down on to a network of FPGAs. So the first thing we need to know is actually what the topology of -- yeah? >>: So you mentioned previously [indiscernible] this borders between devices is that I'll have one cycle data out to data in. >> Elliott Fleming: No. Not at all. >>: Every cycle, I can produce data in such that the bandwidth between each of the devices is full then. >> Elliott Fleming: No, not necessarily. Again, these are just queues, right? So if you don't put anything into the queue, there's no communication at all. >>: Fine if I push something into the queue every cycle, then the interfaces between, say, the block A and the block B is insufficient to run at that speed. So, for example, the [indiscernible] that you described very early in one of your very early slides is [indiscernible] because the bandwidth externally is so much smaller than the bandwidth internally. >> Elliott Fleming: That's true, and we'll do [indiscernible] in this approach too, but we'll also have back pressure, okay. So if the bandwidth -- oops, sorry. I hit the wrong button. If the bandwidth on C is insufficient to carry all the traffic between A and B, then A will stall. A will get back pressure and A will stall. The hope I of course, that as things scale up, you'll get more and more bandwidth. But it is a problem. If C is enormous, if C is 10,000 bits, then yeah, yeah. >>: So how do you know how many ins to allocate? >> Elliott Fleming: We'll talk a little bit about that when we talk about compiler optimization. But generally speak, first of all, the conception that these are pins carrying traffic between FPGAs is a little bit mistaken. So actually, they turn into these high speed transceivers, right. So actually, there aren't pins at all in some sense. But we will talk about how exactly we allocated the bandwidth of the transceiver, you know, in a sort of intelligent 13 manner, perhaps in ten slides, or maybe less than that. Okay, did that answer your question? some point in the future. We'll talk about bandwidth allocation at All right. So anyway, we need to know what the physical topology of FPGAs is. This is a little syntax for describing that. So basically, we have two FPGAs. FPGA zero, FPGA one, and they're correct connected by bidirectional channels. And, of course, you could scale this up to have whatever system topology you would like even though I'm only showing a short example here. And so this is the physical system where we have two FPGAs and they're connected by some high speed transceivers, okay? And then what we'll do next is we'll map the modules, based on area. So, of course, you have to have feasible implementation, but also you want to minimize communication between the FPGAs, at least ideally we would have some algorithm that it did this automatically. So you get some mapping like this, where A and B are on one FPGA and C and D are on another, but currently require user inputs. The user's going to have to tell us which module goes where. Now, of course, an important feature work is doing that automatically, being? I'd like to point out at this time that this configuration file is the only thing in the design that differentiates a single FPGA from multiple FPGA implementation, or even a three or four whatever FPGA implementation. This is the only part of the input to the compiler which is changing. program itself is fixed. Which is, of course, an important property. The All right. So now that we've done this mapping, we have to synthesize the network. So basically, this entails choosing an implementation for each of the channels. Local communication, of course, just turn back into a vanilla FIFOs, so they're just this sort of ideal high bandwidth interconnect. However, remote communications will actually go through some kind of network hierarchy. Which we will synthesize based on the program. We'll talk about that in the next set of slides. So all of those channels will be tied down to some router, which will manage 14 the FPGA interconnect. Okay. And, of course, this link will appear as a FIFO, but the routers themselves will be quite complicated. So we've seen a flow of how we can get from RTL to a multiple FPGA implementation and now we'll talk about specifically how we build the routers. So here's a cartoon of the network architecture, right, so basically, the program is seeing FIFOs at back pressure, okay. And these FIFOs are going to be multi-flexed on to the router infrastructure, okay. So this programming model is quite simple. The hardware to support it is actually quite sophisticated. So basically, we have this automatically synthesized layer of network hierarchy, so the first layer is marshalling, so we have to be able to handle some wide links and convert them into a fixed packet size. Then we have to have some virtual channel buffering to ensure that channels don't block each other. And then finally to improve the parallelism, we'll actually run multiple lanes across the link in order to try soak up as much inter-FPGA bandwidth as we can. Is this clear? So the first thing we do is channel marshalling. So in the original user program, of course these can describe whatever data types they'd like to be carried through FPGAs. But the network width is fixed so we have to introduce some layer to packetize the data types. So for very wide data types, of course, you just do the shift register. But for narrow data types, we'll actually just pack everything into a single network word. And we will do this based on the links, so this will be automatically chosen by the compiler. And this is actually important, because remember we're working on hardware designs. And, of course, hard wire designers are always trying to economize on bits and it turns out in many hardware designs, the width of the channels is actually quite narrow. This is an example of HAsim, and what we see here is basically that the overwhelming number of channels are narrow. Okay? So next layer is channel multiplexing, right. So the good news is most channels actually don't have a lot of activity and remember, we're only carrying data between FPGAs when data is explicitly NQed so if there's no 15 activity, then there's no bandwidth consumed. The bad news is we don't control message creation consumption, and this can lead to dead locks, because we have a shared network infrastructure to see how that can happen. So an example. So now we need both A and B to do the star operator. A sends a value and, of course, A is going to send again, right. And you know how this works. So now B is going to send something that we actually need to proceed and, of course, it's got head of line blocking so we're deadlocked. How do we solve head of line blocking? Well, so one option is we could try to compute the dependencies and do something intelligent with virtual channels. But in reality, we'll just give every channel its own virtual control. Okay? How do virtual channels work? Oh, so this is going to be deadlock-free via the [indiscernible] sites theorem, because, of course, we're broken all the channel dependencies, since each channel has its own virtual circuit, then we can't have a deadlock. So how does this work? Well, now A sends but A doesn't have any more flow control credits so it can't send again. So B will send and, of course, it's now out of flow control. Now the operation can proceed and we'll send flow control back and A can proceed again. This is very simple. It's kind of how flow control works in general. Now, of course, we have some options in implementing this. One option, of course, is to very small buffers. Small buffers are inexpensive. Of course, there's a problem because inter-FPGA latencies can be quite long and so if we have small buffers, then if there's a hot path between the FPGAs, then we can stall. On the other hand, large buffers are expensive. So if we just give a large buffer so say we give eight registers per channel, then we end up using most of the area of the FPGA. And, of course, this is problematic because what we want is the user program to have most of the area of the FPGA for its own implementation. So what are we going to do? Well, so observe that the channel connecting the FPGAs is actually serial. So what that means that is we're basically getting one data word per cycle. What that implies is that the store for all of our 16 virtual channels can also produce data at one word per cycle and will satisfy the full through-put via Little's law. What this means in practice is we can use a serial structure specifically BRAM, to store all of these virtual channel buffers. What that means is because, you know, BRAMs are quite dense, we can actually have enormous channel buffer for virtual channel and we'll still be deadlock-free because the virtual channels don't block each other in the show structure. Okay? No, yes maybe? So basically, right, what we'll do is because this is serial and the SRAM is serial we won't lose any through-put, but we won't have a very deep buffer per channel which will also cover the latency of the FPGA links. >>: Are you saying you use a [indiscernible]. >> Elliott Fleming: Yes. Single right port, that's right. Single right port, single [indiscernible] port so you get full [indiscernible]. A little bit of latency, maybe, but we get full through-put. So this is what the multiplexor micro architecture looks like. So basically, like I said, we have all the virtual channels mapped down on to the SRAM and then we have some bookkeeping bits out to the side also mapped into [indiscernible]. What's that? Oh, okay. So data comes in, it's stored in the BRAM and we have some arbiter that selects which virtual channel we're reading out of based on the bookkeeping bits. And the great news here is that even if we give enormously deep buffers, more than 100 buffers per channel, we use only a small percentage of the FPGA for typical designs. And this allows us to scale the size of our implementations so we can actually, you know, instead of -- we can have connections between several different FPGA devices without overwhelming our area usage. >>: [indiscernible] using this architecture, previous architecture based on how -- allocation. >> Elliott Fleming: >>: What do you mean? This has, of course, the issue of you are kind of doing [indiscernible]. 17 >> Elliott Fleming: That's right. >>: How do you know when you can afford a [indiscernible] architecture versus ->> Elliott Fleming: We'll get to that in a couple slides, I think. Okay, so the last level, so at this point in time, we'll get to it in this slide. So at this point in time, we have a fully functional router so we could lay this down and we'd have a fully working multi-FPGA implementation. The question is can we do better, and your answer, as you alluded to, is yes, we can do better. So in order to do better, let's kind of look at the properties of user designs. So specifically, what the widths of the channels look like and what their traffic looks like. What we see here is, of course, as I already mentioned, channels are narrow and also that these narrow channels can have some high occupancy right. So ideally, what we want is to sort of service these channels as best we can. So user designs have pretty low clock frequency, and there are channels, whereas the inter-FPGA physical layer is very fast and is hundreds of bits wide as a result. So, of course, you have to this clock frequency sort of gear boxing, right. So if the user design is running at 50 megahertz, and the inter-FPGA is running at hundreds of megahertz, then we have to sort of multiply up its width. And you end up with a few hundred bits per cycle of data that you need to stuff into the file in order to get full bandwidth. Okay. And what this is telling us is basically that in the presence of all these narrow channels, single channel at a time is very wasteful. So if we just do a naive time multiplexed approach, we're going to waste a lot of bandwidth, okay? So how do we do better? Well, we'll have multiple lanes, okay, and they will share the bandwidth. So how does this work? Basically what we'll do is we'll instantiate several multiplexors on top of the wide PHY, okay, forming lanes. So here we have one multiplex or two multiplexers, three multiplexors, all on top of the same wide physical layer. And these can all go in parallel so we can recover some of the parallelism of the system. So we have some time multiplexing. Of course, these remain time multiplexed but they can all transmit data in parallel. 18 So now that we have the capability of adding these lanes, we have to ask the question, how many lanes should we have, and how do we allocate channels to lanes, right. So these are free parameters in the router architecture. So we could look at the dynamic behavior. Of course, ideally what you would not do is you would not allocate two channels that are instantly being NQed at the same time to the same lane because then, of course, they're fighting each other for bandwidth. We can't really reason about that behavior at this point in time, although maybe with some better perimeter analysis techniques, we could. But what we can do is observe aggregate channel loads so we can do is instrument the design and look at the traffic across each of the channels in the design and try to do something with that. The idea being that what we'll do is we'll minimize the maximum load on a given lane. So we take that maximum load as kind of a measure of how fast our program is running, assuming that it's communication bound and we'll try to make that as small as possible. Okay? Unfortunately, this is a processor scheduling problem, or at least it turns into a processor scheduling problem that's NP complete, but there is a good heuristic longest job first. So how does longest job first work? Okay. So what we have here is a set of channels and a program that we're going to route through two FPGAs. The height of the bars represents the loading. That is the absolute amount of traffic across the channel and the width represents the physical width of the channel. So you may have some channels which are wide and some channels which are narrow. They produce more or less amount of traffic. So the first thing we'll do is sort according to load and we're going to try to make the situation the best for these heavily loaded lengths, okay? Because, of course -- what's that? >>: Is the right way to think about this is that the width represents the packet size and the height represents rate? >> Elliott Fleming: The height represents total traffic, right, which could represent rate, although, you know -- 19 >>: If you've already taken a packet size, it's total traffic [indiscernible] packet size the same as rate, packet rate? >> Elliott Fleming: Yes, across the run of the program, right. But I guess what I'm trying to say is there needs to be a distinction drawn between the aggregate behavior across an entire run and dynamic behavior, right. >>: Is that distinction just [indiscernible]? >> Elliott Fleming: >>: So how are you capturing burstiness? >> Elliott Fleming: >>: Yes, burstness, right. We're not. That's where I was going. >> Elliott Fleming: Yeah, we're not capturing burstiness. So obviously, you know, if the total program run time is something up here, then the rate may be low, but you may have burstness and that might perturb your reader architecture, but I'm not trying to capture that at this time. Okay. So anyway, basically, what we'll do is we'll take our heaviest loaded lanes -- heaviest loaded channels and synthesized lanes for them, right. So one, two, three for the three heaviest loaded channels. Then we'll allocate those heavily loaded channels to the lanes. Now, with the remaining channels, we'll try to load balance, allocating the channel to the least loaded lane. So now we put this one here and we'll put this one here. And so on. And what we've got is basically load balancing. So on average, the total amount of traffic across each lane is more or less equal. Okay. And it -- >>: Question, are you talking into account the width as well in this allocation? 20 >> Elliott Fleming: Yes, yes. Because although we're not doing it here, if we put this fat channel on a narrow lane, then its traffic will change. So yes, we actually do account for that. So when we make the choice, we change -- so you can, of course, because you've kind of statically allocated the widths, you can see how much traffic will be across each lane. Yeah? >>: So this is a [indiscernible] simulation, right? >> Elliott Fleming: >>: It's [indiscernible]. That's right. And that depends on how you set it up? >> Elliott Fleming: >>: Yes. Yes, so it's workload dependent, absolutely. So it's a non-FPGA, simulated on not FPGA? >> Elliott Fleming: Oh, no. You could simulate it on [indiscernible] and all the channels for you and find all the loads. >>: So you make it remapping? >> Elliott Fleming: That's right. So basically, you can -- you're always free to synthesize a crappy network, and ideally, the loads will not change very much. Or you could do it in simulation. Of course, that's an infinite capacity FPGA. Although generally speaking, for most of these designs, they're of a sufficient size that simulation is not your most attractive option because you can't really run a large enough workload. >>: Are you going to talk about topological mapping of the FPGA networks? >> Elliott Fleming: What do you mean by that? So this idea of perhaps adding route-throughs to handle strange physical topology routes to logical topologies? >>: Strange is perhaps pejorative. your consumers are adjacent FPGAs. >> Elliott Fleming: >>: Okay. You're assuming that your producers and No, not at all. So FPGA producers and consumers are not FPGA -- 21 >> Elliott Fleming: >>: The [indiscernible], and now you're talking about a routed network. >> Elliott Fleming: >>: The [indiscernible]. That's right. And now your virtual channel approach is a little bit trickier. >> Elliott Fleming: [indiscernible] because again -- >>: Let me finish the question with the observation if you have a shared link with a direct communication and then a route-through, you've got to run it off both sets of virtual channels. >> Elliott Fleming: If you're trying to do something clever, I could imagine that being the case. However what we do is at each inter-FPGA crossing, we will give a new virtual channel. So basically, the virtual channels are only handling deadlock-free on that single inter-FPGA link. So basically what would happen is you bounce -- what's that? >>: [indiscernible]. >> Elliott Fleming: Right. >>: You in some sense turn it into a statically-routed network. better off if there are no dynamically routable paths. It would be >> Elliott Fleming: You could have a better implementation, perhaps, if you had some capability to dynamic load balancing. But, you know ->>: It's like we talked about this this morning. If I have a failure and I want to remap, you know, a FPGA's rolled to another FPGA, now all my routes through the network change and the virtual allocation changes and it doesn't sound to me like you've provisioned for that dynamically. >> Elliott Fleming: No, no, no. But it's again the virtual channels are so cheap that it wouldn't be beyond the realm of possibility to have spares. Again, these things are very inexpensive. The cost of a new virtual channel is the cost of adding extra space in an SRAM and the SRAM has, you know, 64 22 kilobytes of space. of space. Once you have one of them, right, you actually have a lot >>: I think maybe for the problems you've allocated the packet sizes are relatively small. But if you start having large packets running through the virtual channels [indiscernible] get really high. >> Elliott Fleming: >>: Maybe. But the packet is four kilobytes, for example. >> Elliott Fleming: Of course, you could break that packet up into chunks and just do channel allocation on the chunks. You could certainly do that. Yeah. And, in fact that's what we would do. We'd marshal it and do full control on the marshals. >>: The number of [indiscernible] channels, do you see the [indiscernible] channel which one you send out, do you see the effect? >> Elliott Fleming: That's right, it does go up, and, in fact, in the virtual channels, you can choose different architectures. So we have several layers of pipelining. So, of course, if you have a handful of virtual channel, you get single cycle scheduling. Otherwise, you have to do one or two cycle scheduling. But two cycle scheduling goes up to several hundred lanes. So it's scaleable. I mean, you could even if you wanted to add a third level of hierarchy there, but it's just dropping it in, right, and the compiler could choose based on the number of virtual channels, yeah. Okay. So what happens when we do this optimization to a real program that is HAsim? So here we have the naive implementation of HAsim, right. And again up is good. So up is aggregate MIPs for the simulation. And so when we do this longest job first algorithm, we do get some ten percent performance gain. Here, what we've done is eliminated sort of collisions between packets. So in HAsim, there were perhaps tokens being generated simultaneously and having more lanes removed some of that effect. I'll point out that HAsim is not communications bound. So it uses only about a third of the bandwidth between FPGAs. So that's why we don't get some higher through-put, because HAsim actually isn't stressing the network. 23 If, of course, you use some kernel which is, in fact, producing a large amount of traffic, then you will get linear speed-up issues scale the number of lanes. As you might expect. Okay. So now we've talked about sort of how we synthesize the inter-FPGA network and how we actually describe and implement zips that can be partitioned across FPGAs. Now let's talk about a couple of examples. So we'll get two case studies. Airblue, the wireless transceiver, and HAsim, a simulation framework for modeling multi-cores. So the basic idea of Airblue is we want to implement wireless transceivers such that we can operate on the air with commodity equipment to test out now protocol ideas. So this works well if the protocol is stainless steel like 802.11G, but newer protocols, particularly those with MIMO, of course, require much more area and so they don't fit into a single FPGA, and that also includes the need for multiple antennas. So what do we do? We just throw another FPGA at the problem so we go from one FPGA on the front end to two FPGAs on the front end. It's that simple. Okay. So the baseline, 802.11G implementation looks like this. You've got a TX pipeline and RX pipeline. What we want to do is implement some new algorithm spinal codes. Spinal codes is some new error correction algorithm. Okay. The problem is it's much larger than the existing -- it's new. So it was at Sigcomm in August. It's actually quite good. It's actually better than turbo in those respects. >>: [indiscernible]. >> Elliott Fleming: Oh, yeah, you know, maybe talk about how the name [indiscernible] came about. It wasn't my choice. I always think of spinal tap. So anyway, basically, the problem with this code is that as good as it might be, it's much larger than the turby and so we exhaust the area of the single FPGA. >>: Does [indiscernible] mean anything until you start playing the rate of frequency? 24 >> Elliott Fleming: That's right. So basically, it's the part of the wireless transceiver between the RF and the Mac working on packets. So it's the thing that's taking that baseline signal and taking it into packets with error correction and various other algorithms running. Okay? So anyway, of course, as you might expect, we simply partition across two FPGAs. These little FIFOs here are latency insensitive channels, and no source code modification is required, right. So that same design that you would map in simulation, you can map across two FPGAs and meet the high level protocol timings. Again, the high level protocol timings being at the scale of tens of microseconds. So the latency of the inter-FPGA interconnect is not a problem. And, of course, also because this is a largely flow-through pipeline with a tiny amount of feedback here, you would expect that we would have no problem with feedback latency. Okay? So the second thing we're going to do with Airblue is actually simulation. So often when we're evaluating protocols, we care about operating points at bit error rates of one in a billion. Of course, you know, if you want to test that operating point, you need to generate billions and billions and billions of bits. Which, of course, is a problem in software, because the software simulator is running at kilobits. And, of course, the FPGA is running at megabits. So by choosing FPGA, we run a thousand times faster. Of course, we can implement this on one FPGA so we can simulate on one FPGA, the question is why would we want two. The reason you want two is because the tools can actually find better implementations. So we can do when we take a simulator and partition across two FPGAs, even though we're talking about one FPGA, is get speed up. So here what we show is speedup relative to a single FPGA implementation. Most of the speedup comes from clock frequency improvements. So we just take the part under test and we amp up its clock frequency as high as possible and this gives us a faster simulator. Okay? So in summary, basically, Airblue and wireless pipelines in general are these deep pipelines with infrequent feedback and at the protocol level, we only care about ten microsecond timings so this is an ideal solution to sort of take a prototype wireless transceiver and actually get it to work on the air. Okay? 25 So now let's talk about something with a little bit more complicated communications graph. That is, the processor simulator HAsim. So what is HAsim? HAsim allows you to basically simulate complex multi-cores so full cache hierarchy, out of order and cycle accurate. So one key point about HAsim is that it's time multiplexed, which means that we don't -- say we're simulating 64 core processor. We don't lay out 64 cores. We lay out a single compute pipeline and multiplex it among all of the cores. This is like SMT. Okay. And, of course, with that approach, it's very easy to parameterize the design for scaleability. Of course, HAsim can go anywhere from one core to ten thousand cores. The question is whether or not you can actually implement it on the FPGA. >>: [indiscernible] you're time slicing architectural state on underlying substrate. You're not dynamically provisioning [indiscernible]. >> Elliott Fleming: >>: That's right. It's much more like [indiscernible]. >> Elliott Fleming: Okay. So anyway, it is multi-threaded, and, of course, it has a complex communications graph and lots of feedback. So it's different than the wireless pipeline in the sense that all of these ports are communicating and wherever they're communicating almost constantly. Although the time multiplexing is going to help us cover some of the latencies. So what happens when you add HAsim to multiple FPGAs? Well, the first thing to notice is that on one FPGA, we can map 16 cores and then on two FPGAs, we can map more than a hundred. Again, this is because in HAsim, this time multiplexing means that we're not replicating the entire structure of the processor to add another core. We're only adding some state. So there's a big constant cost to building a core model. But the cost for adding a new core is not so high. And that's why we get this highly nonlinear scaling. >>: So if you look at a [indiscernible] microprocessor, most of the area is devoted to micro architectural state, whether it's branch predictor, [indiscernible] cables, buffer, caches. And very little of it, relatively, is control state. 26 >> Elliott Fleming: Okay. >>: So you have to multiplex all of that, it seems like, as you add threads or logical cores, you're going to see a linear increase [indiscernible] buffer. >> Elliott Fleming: That's right, and we do. So part of, I think, the savings here is unless these things are mapping to different structures. So it's not quite linear. There is some room to scale. So, I mean, going from 16 cores to a hundred, for some of the structures means just stuffing more data into a BRAM. And for many structures, that means actually that they don't increase in size. Only some of the structures are increasing. >>: I guess [indiscernible] of course the networks fine. that that's the fact -- I'm just surprised >> Elliott Fleming: So remember HAsim is a model, not an actual implementation. So much of the cached state is stored in an interesting way so we actually synthesize a cache hierarchy, scop this hierarchy actually goes out to host virtual memory. So if -- in some sense, that cache hierarchy is sensed for any choice of cores. And the pressure on it, of course, changes, and its performance will change. The more cores you have, the more misses you will take. But the size in terms of FPGA area is not changing. >>: Right. >> Elliott Fleming: >>: Okay. Yep? [indiscernible] meaning they are using DDR 3? >> Elliott Fleming: Yes. So I'll actually talk about how the memory hierarchy works in detail in a few slides. It's actually very interesting. But we'll get there in a couple of slides. Looks like we've got plenty of time to do so. This is actually the first talk that I've made it this far in this amount of time. 27 >>: Before you move on, what are the different dual FPGA -- >> Elliott Fleming: Again, so we mentioned that as we add cores, we increase the amount of implementation area, but that has impact on clock frequency. So the more things you try to stuff on the FPGA, typically the worse that the tools do. Again, we're not -- we're just naive users of the tools. We're not trying to floor plan everything. So we just take whatever frequency is given to us by the as to. So basically, what happens is let's take this bar, for example. So this is, say, 36 cores. So either a 64, a maximum 64 implementation or a maximum 128 implementation can handle this model. It's just because the maximum 64 implementation is smaller, you get a higher clock frequency and so you get some performance benefit as a result. Okay? So one last thing to note is how much performance you lose going from one FPGA to two FPGAs. So basically, it's these two bars here, right, so the gray bar is a single FPGA implementation. When we go to two FPGAs, we lose at most maybe half of our performance. This is already much, much better than the traditional tools which would lose maybe an order of magnitude or more in terms of performance. Okay? And, of course, as we scale the number of cores we can cover more latency and so our performance comes back up, right? So in summary, single FPGA gets filled at 16 cores. But with multiple FPGAs, we can go to 128 and actually we're trying to build a thousand core processor on some -- Richard, yes? >>: So you mentioned [indiscernible]. >> Elliott Fleming: We never attempted to run commercial tools, in part because we think that that's going to require major surgery. So the commercial tools are not quite to easy to use. They usually require that you do some modification to your RTLs. Anyway, so and also, of course, you have to buy a box that costs a lot of money. The emulator boxes are not cheap. >>: I'm assuming Intel might have helped you there. 28 >> Elliott Fleming: Yeah, we talked about it and we decided, you know, it wasn't a productive exercise. Yes? >>: So in some sense, your previous graph here, it's not necessarily when people make certain [indiscernible] that first they have a through-put requirement and then they build hardware [indiscernible]. >> Elliott Fleming: Sure. >>: This is sort of clouding making that sort of design space a little cloudy because of the fact that [indiscernible]. How would you see [indiscernible]? >> Elliott Fleming: So I'm a big believer in getting a system to work and then understanding its bottlenecks before trying to optimize. You have a through-put target, but it's very hard to know where bottlenecks in the system are, particularly a new system without having something actually working. I view this tool as first and foremost enabling implementation. So it's entirely possible that we'll get something to work here and we'll discover that there's a bottle neck. Where the bottle neck might be could, I mean, my feeling is probably, you know, the compiler is not going to produce the bottle neck. That there will be some intrinsic bottle neck either in the inter-FPGA through-put or maybe in memory or something like that. And then we go solve that, right. That's just my approach to problems in general. Get something to work first and then debug later. So I also ask the question if we have this requirement of running all these channels between FPGAs, would the architecture that you hand code be substantially different from the ones the compiler's producing automatically for you. So I think that's another way to look at the problem, right. And I think if you consider it that way, the answer is probably not, that at the end of the day, you're going to be building this router infrastructure anyway. It's just you're going to have to go through the pain of debugging it by hand. Moreover if you make any slight perturbation to it, you'll have to rework the whole system. So it may be the case you can do the kind of longest job first optimization that I'm advocating, but I'd hate to have to write that code myself. Something like that. 29 >>: Yeah, but how much [indiscernible] what you call the driver? that or do you use another machine, how much work -- Do you use >> Elliott Fleming: So it's actually very simple, right so we abstract that layer as just being a FIFO. So if you look at, for example -- so we used these high speed inter-FPGA transceivers, right, so basically all we have to do is get the core code, test it out, make sure it runs and abstract it as a FIFO and feed it into the compiler. So actually, it's quite straightforward. If you look at something like PCI, respect, going between host and FPGA, that's a little more complicated. But the end of the day, it's still a FIFO as well. And it's just multiplexing on top of that FIFO. I mean, at the end of the day, if you look at the drivers that you're writing, this is what they look like. And it's not clear to me that you're going to do better than this. Age, of course, if you are doing better, there's probably a way to generalize what you're doing, feed it in here, right. I mean, generated router is just a phase of the compiler, right, so you could easily come up with a new router architecture and test it out, right. Again, that's the advantage of the compilers. easier, anyway. It makes things like that So now let's talk about resources in multiple FPGAs. So again, I mentioned this in the very beginning, if you remember all the way back, what we get when we get more than one FPGA is access to more -- most obviously more slice, but we also get more access to memory. And there's this analogy to multi-processing here, right, where if we have two cores and two threads, right, both threads get a full cache hierarchy, at least parts of the cache hierarchy so they run faster. What we need in the FPGA is an abstraction to sort of allow our FPGA programs, our HDLs to exploit these resources. So we need an abstraction layer between us and the physical devices. So what I've shown you to this point is abstraction for communications, right, these channels extract the communication between FPGAs. Nows I'm going to talk about abstracting memory in FPGAs. 30 So basically, what we've got is this very simple interface and this is how we'll do memory in our designs so just like a BRAM, you've got read request, read response and write. So very simple interface, right. Is that clear to everybody? So a few points about it. One, we have this will permit us to specify any size doesn't fit on an FPGA. So if you want probably don't have 32 gig of DRAM, you an unlimited address space, right. that we'd like, even if that side to specify 32 gig of space, you still write that down. So And we'll provide a virtualization infracture to back that storage space. >>: The freedom to run as slow as you want. >> Elliott Fleming: Sure. Okay. You also have arbitrary data size, of course, so again it's a parametric interface. If you want 64 bit words or 36 bit words, we'll generate the marshalling logic for it. And then finally, again, as I've said before, it's latency insensitive. You aren't writing a program assuming that there's going to be some latency that you get between the read and some fixed latency, right. So you actually write your program in a way that can basically decouple the read request and read response, okay? So how does this look on a single FPGA? What we'll do is each one of these are memory client. We'll aggregate them all together on a ring and feed them into the on-board memory. So the first thing you'll do is have an L1 cache here. If you miss out of the L1 cache, you'll go to the on board memory, which will be DRAM or SRAM, depending on your board. Finally, if you miss out of that cache, you'll go to host memory, okay? So host memory is what will take care of this arbitrarily large address space in the case that you need it. >>: Host memory as the PC -- >> Elliott Fleming: >>: So it's not local DRAM attached? >> Elliott Fleming: >>: We'll assume that there's a server attached. I see. So we use the local DRAM as an L2 cache. 31 >> Elliott Fleming: So basically, the flow will be something like this. You make a request to your local BRAM cache. You may miss. If you do miss, you'll scurry off to the board level resource, which will be shared among all the boards. If you miss there, then you'll go back to host virtual memory. And again, it depends on your address space and how big it is and how much data you're accessing. But again, the point here is if you need the large address space, we give you the ability to scribe that. If you don't need it, say, you know, you say I need an aggregate, a gig of memory, you will never miss in this cash, for example. Okay? >>: What is the use case, the motivatingy case that would be going down to the ->> Elliott Fleming: I think the most obvious use case is portability, so that's the first thing. I've given you an abstract interface and I can build you a memory hierarchy on any board you want to implement on, including a board that doesn't even have memory, right. The problem comes in hardware designs frequently, when you bake in assumptions about the underlying infrastructure to which you're mapping and suddenly that infrastructure gets pulled out from under you, either because your build the next generation or maybe because you have to do something like move between boards. At that point if you baked in some timing assumption that isn't true anymore, you've got rework all your code. And that's a big problem. Now, again, you're trading something for the abstraction, perhaps, right. We're introducing all of these layers so maybe you add a little bit of latency. That's certainly true. But the latency and the various performance loss comes at what I view as a very important price, you know. The price of distraction and portability. So I can frame a design in terms of these caches, and as long as the platform, whatever FPGA it I it doesn't matter if it's [indiscernible] or Ulterra or what generation it is, I can run that design on any board, and that's pretty powerful. 32 >>: [inaudible] going through the effort of making [indiscernible]. hope -- first of all ->> Elliott Fleming: >>: I would Again -- Trying to develop this system. >> Elliott Fleming: Again, performance is critical here, right. And I don't know that we're trading a lot in terms of performance. Of course, I haven't ever done the study. Again, I can tell you, what would it look like if you were doing this yourself in hardware. Would you have an L1 cache here? Maybe. We also have a way of eliminating the L1 cache so if you want to go directly to the DRAM, you can certainly do that. We give that as an option. >>: [indiscernible]. >> Elliott Fleming: >>: Okay you can synthesize in cache [indiscernible] DRAMs. >> Elliott Fleming: >>: No, not having read those things, I can't. Right. And distributing around. I don't remember if he [indiscernible]. >> Elliott Fleming: I don't know. So anyway, I mean, whatever technology you have to generate L1s is certainly useful here. I'll say that much. But again, the idea that is we're providing an abstraction layer. And that's going to be important, again, right, unlimited address space, fast local caches. And what happens when we map a design across multiple FPGAs, right. So here are two things that are happening, right. Again when we have multiple FPGAs, the words may be homogenous, and again maybe they're not so we want some portability of design, right. I mean, intrinsically, we've already said we're going to have multiple FPGAs. So we expect asymmetry, right. And, in fact, we may not even know what pieces we're mapping to what boards. So it doesn't make a lot of sense. The more you fix a piece of a design to a board, the less of this automation that can actually happen. So what are some cases that can happen here? 33 One, we automatically route clients to the nearest cache, even if it's on a different board. So here, we have clients that are sitting on a board that doesn't have an L2. And they will simply route to the local L2, the closest one, even if it's on another board. >>: I have a question. >> Elliott Fleming: Yeah. >>: So you have all these [indiscernible] FPGA [indiscernible] why do you want to build a central cache as opposed to a distributed cache? >> Elliott Fleming: Right, so remember that each of these clients has its own BRAM L1 cache. And those things can soak up all the resources on the board. In fact, we're working on an algorithm now wherein we do area estimates for placement. So we place design on board. We look and see how much BRAM is available left over after the user design and just scale up all the caches to soak it up. >>: Didn't you say each client has a cache, is that the single cache? >> Elliott Fleming: Yes, a private L1 cache. Each one. >>: But still, I'm trying to understand why do you want to have -- so in a processor, there's a single cache because you want to have a limited number, of course. But if on an FPGA, there is always these BRAMs with all these ports, don't you think that having a single cache, as opposed to multiple ones that you can access in parallel is ->> Elliott Fleming: That might be fine in architecture. So if I understand you correctly, you're asking why is it that I don't just give each one of these guys a place in their shared chip legal BRAM cache. In fact, that's one option. We do have a BRAM central cache that you can use. Of course, the scaleability and clock frequency issues there are pretty obvious, right. That is, once you have a resource that's being used by a bunch of guys, the multiplexing logic can be problematic. But that's a perfectly valid implementation. We'll lay out a half Meg or a Meg of BRAM cache and you can use that as a your shared L2 if you'd like, right. But again, you'll never get the clock frequency to that that you can get to the 34 local caches, right. I mean, if you have to run wires all over the chip, you have to run wires all over the chip and it's going to be slow. So there is that trade-off to be considered too. And as I mentioned, we do give an option when you can disable the L1 caches if for some reason you don't want to pay the latency. For example, if you have a streaming workload or you know that you're never going to hit an L1 for some other reason, you can just eliminate that cache. >>: Is it safe to say you punted on coherence? >> Elliott Fleming: At this point we have, so these are independent address spaces. Although what we're working on now is so ideally, let me tell you how coherence works in my mind, if I can just get the junior grad student to work on it, right. So basically, what you would do is when you say scratch pad, you would also specify coherence domain. You would say I want this set of scratch pads to be coherent, and you would synthesize some directory-based protocol on top of those specific scratch pads. Again, the approach is automatic synthesis, right. But it would some kind of directory-based protocol, sharing the space. Again, in the applications that we're considering primarily being these DSP algorithms, although HAsim is starting to run into the co Terrence problem now, which is why we're specking it out, what we want to do is basically slice that thing across 16 FPGAs and, of course, then suddenly the functional memory has a coherence problem, right. So we will be synthesizing coherence algorithms soon, I expect. >>: So the dine space [indiscernible] that you mentioned before, that's highly manual process. Can you see, is there some sort of plan that you can see to keep [indiscernible], or is that something that ->> Elliott Fleming: So in some sense what you just described to you with inflating the cache sizes, in some sense that design space exploration constrain to the memory sub sit tem. So I think you F you could express parameterizations and maybe their relationship, you could look to maybe have some machine assistance. So if the compiler throws down an algorithm and that algorithm has some parameter by which it could be scaled and it turns out that there's area on the FPGA, we could easily scale it up. 35 So generally speaking, and I think modeling this as some kind of linear system or I don't know if it would work as a linear system, but that's how I'd approximate it something like Pecora, a thing out of your colleagues down south, is an interesting approach to this idea of allocating area to different pieces. But yeah, I mean, if a compiler knows how to scale, then certainly you could maybe have some assistance in that. So anyway, what do scratch pads do? Basically, like a processor architecture, wife all seen this diagram before in internal purpose processors. We have a bunch of plateaus, so we have some plateau at L1 where we get lots of hits. Oh, so this is stride versus working set size. And up is bandwidth. So up is good. These values are hitting in the L1 cache and then we kind of start -- we have some region where we're sitting in the central cache and some reason where we miss out [indiscernible] and our performance sucks. But it stainless steel works, right? So just like in a processor, if you have to go page, then your performance will be horrific. And that's just the way it is. So programs is a good locality, I guess. So what else can we do with memory? >>: Why is this -- why is there [indiscernible] on that very last -- >> Elliott Fleming: >>: Somebody else have a question? Yeah. This data here? All of those. >> Elliott Fleming: So this data comes from the way in which we do our L1 caches. So we actually do SKU caches. Unfortunately, with SKU cache, sometimes you get collisions. And some of these patterns produce pathological collision in the SKU. >>: When you say SKU, you mean SKU to sensitive? >> Elliott Fleming: caches. No, these are drive mapped. So a SKU associated drive map 36 >>: Sounds like an oxymoron. >> Elliott Fleming: straight hash. >>: So what is the SKU? We hash the addresses into the cache. We just do a Any cache is hashed. >> Elliott Fleming: Yeah, yeah, yeah, but it's not a linear hash, right. We do some XORS. So there's some hash function perturbing the index of the cache. So it's a SKU cache. >>: The literature that benefitted you is you associate caches was reducing the probability of hash sets and why would you put the SKU functions on in from a standard, you know, just [indiscernible] hash function. >> Elliott Fleming: Well, for a simple reason, actually. So if you do that in HAsim, wherein you're doing indexing based frequently on some processor enumeration, it can be the case that you actually do get a hot set on a single block in the cache, right. So you still have a hot set in these things, right. It's just one entry. So if you have many addresses hitting the one entry, right, so we just want to provide some randomization there. That's all. Although we are now working on doing a set of associative L1s, but we'll still do SKUing there too. Okay. So anyway, that's how this works. So now, you know, we mentioned that we get more memory so we can actually increase the size of the cache. We can also introduce new algorithms. So one of the things that we can do in terms of introducing new algorithms to a memory hierarchy is adding prefetching. Again, we're optimizing under the memory abstraction. So the user says I have this read request, read response, write interface and the job of the compiler is to soak up FPGA resources to make that thing as fast as possible. So we're going to do here is add prefetching. So how does this work? So if you remember back to architecture class, you have some table, which you index into based on PC. And then if you have sort of a stride, right, so a stride pattern on a particular PC, then you'll try to prefetch out in front of that stride to get a new value to cover some of the memory latency. 37 So we can do the same thing in FPGA. Although there's one key difference. And what key difference is that, anybody? One of these fields is not right. There's no PC. So there's no PC in the FPGA so we can't use that as a hint. So it turns out that [indiscernible] programs don't have a PC, which might make prefetching a little bit more difficult. However, balancing this difficulty is the fact that harbor programs have a much cleaner access pattern so if you think about how software program works, right, you're passing data through the stack. You're constantly inserting new accesses to memory, which really have nothing to do with the data flow of the program and everything to do with sort of like stack frames and function calls. So that could screw up the prefetcher. But in hardware, of course, we don't have that problem. It's just a clean access stream. So even without PC, we can actually do a pretty good job of prefetching. So the idea here is if we have extra resources on the FPGA, we can add more complexity at the compiler under the abstraction and hopefully get more performance. So how does that work? So basically, here we have an implementation of matrix multiplication, and this is one-time normalized to a single -- I'm sorry, an implementation without prefetching. So, of course, we're we do matrix multiplication, which has a very predictable access pattern of the kind. The prefetcher should do a very good job with and we find that depending on the size of the matrix multiplication, size going this way, we actually get a lot of performance benefit with prefetching. And again, this comes because the original hardware wasn't doing a very good job of Hanning the edge cases in the matrix multiplication algorithm. Of course, as the matrix gets large Iraq the edge conditions are less important. We spend more time running down the rows and so the benefit of prefetching is much less, although is still measurable. But for mall matrices, 64 by 64, the performance gain is enormous, again because we hide the latency. So that's great, so matrix multiplication, we should also be able to prefetch. So what happens with H264? So H264 has a data dependent access pattern. Still predictable, and we do pretty well here. Again, we get a maximum of about 20 percent performance gain with prefetching. And again, these are codes that 38 have existed for many years. them as workloads. We wrote papers about them and we're just using Again, because we frame them in terms of the memory abstraction, we can actually extract performance as we improve the algorithms in the cache hierarchy. Okay. So in conclusion, latency insensitive channels enable automatic multiple FPGA implementation with minimal user intervention. And when we do such an implementation, we can get higher performance, better algorithm scaling and although I didn't talk about it, we get faster compilation. And the conclusion here the high level take-away is that high level perimeters can offer powerful automatic tools. And as we move to more complex systems, these kind of tools will be necessary, right. They can no longer be in the purview of the hardware designer to produce these implementations. Future work. So place and route. So as we know, those of us working with FPGAs, place and route is taking a long time, and it continues to scale up as the FPGAs get larger. However, the latency insensitive modules provide a way of dealing with this in that you can place and route the latency insensitive modules independently and then synthesize the network between them. The general idea is that long wires with long latencies, which you would get if you get the sort of distributive approach can be broken with register stages much in the same way that we use buffer boxes and ASIC process. So additionally, right, if we only have to recompile part of the design, then network resynthesis should be cheaper than full chip resynthesis. And then finally, hardware/software communication, we already do this a little bit, but I'd like to formalize it a bit more. So again, the problem with hardware and software communicating is that hardware -- sorry, software is a nondeterministic thing in terms of timing, but generally speaking, the latency insensitive channels allow us to capture this. So basically, you could have latency insensitive channels in software, latency insensitive channels in hardware, some ensemble program composed of pieces of software, pieces of hardware, and we would synthesize all of the communication between them. So that's how that would work. With that, I'll take questions. 39 >> Doug Burger: Thank you very much. Since we're at time, we may have time for one question. One or two questions. And if not ->>: You started off saying that FPGA use [indiscernible] and yet spent 99 percent of your time talking about processors. >> Elliott Fleming: >>: No, because it doesn't [indiscernible]. >> Elliott Fleming: >>: Processor simulation is an application for FPGA. No, it absolutely does. It's something -- >> Elliott Fleming: works. No, no, no, no. Let me explain to you more of how HAsim >>: I'm just kidding you [indiscernible] what you've been talking about there's a disconnect and you're into architecture and we are more -- I don't get the caches. Caches don't do nothing for me. >> Elliott Fleming: Well, maybe they do and maybe they don't. on what your access pattern is, right? It just depends >>: What I would like to see, for example, something that has [indiscernible] going through and then you get into [indiscernible]. >> Elliott Fleming: All of these things have multiple [indiscernible], right. All of these implementations are multiple [indiscernible]. In fact, what is nice about this approach is actually once you have a latency insensitive module, you can pick a clock for each module, right. If it's beneficial. Again, you know, synchronizers change the timing behavior. So actually, this model kind of supports intrinsically multiple clock domains. >>: You spend all your time on time insensitive and as fast as possible application. >> Elliott Fleming: Sure. Well, I think Airblue is actually timing sensitive at the high level rate, because you have to produce a result in ten or 20 40 microseconds, and then it's a question of whether or not the latency between the FPGAs is tolerable. Same is true in H264 encoder. I didn't talk about H264, mainly because I think the results are not terribly enlightening. They're no different than the wireless transceiver. I mean, the basic idea is yeah, we can partition it. Yeah, the bandwidth between the chips is sufficient. And yeah, we meet the millisecond or whatever timing, even with introducing this new latency component. But, you know, I think what is different here, let me address the process simulation problem. So understand that the processor simulator is actually a hybrid design because, of course, we're not modeling things like disk. We're not modeling things like most of the operating system on the FPGA. Actually, that stuff is running in the software simulator on top. >>: [indiscernible]. I understand. The reflection is more like all that work. Switch your mind to a certain set of problems and bandwidth, you actually have an [indiscernible] with specific time requirements, et cetera, you end up finding a whole different set of problems for the tools. You get much closer to the tools. You get to understand [indiscernible] constraints and your timing is quite [indiscernible] spend a lot of time and then the difference between synthesis and simulation and the [indiscernible] becomes a problem. So that's more [indiscernible]. You phrase it as this is good for application. [indiscernible]. But in fact it's not about >> Elliott Fleming: Well, I mean, I only talk about ASIC synthesis in the sense that this is what multiple FPGAs were used for in the past, right. >>: I think more to the point is that HAsim represents a single -- a very specific application. >> Elliott Fleming: Sure. >>: And many of their applications, with a certain set of constraints, a certain set of characteristics and many other applications have different, widely different constraints. 41 >> Elliott Fleming: >>: Sure, absolutely. Absolutely. System level behavior, bandwidth level behavior, that sort of thing. >> Elliott Fleming: Sure, I don't doubt it. accommodate of those designs, though. I would think that this would >>: I think as we move forward into more automation, I think it's really important to understand these different spaces and think carefully about -that's an easy statement to knock off, but, you know, there's a lot of FPGA intuition in the room that's not lining up with the approach that you're taking for many problems. So I think we should -- it will be important to understand those phases. >> Elliott Fleming: If you have to do things by hand, you have to do things by hand. I mean, you know, sometimes we have to write assembly too. And there's nothing in this that prevents that. It's just that if you can get away with not doing that, then it's probably best that we stay at the high level. But, of course, you know, you can't always live in that world. >>: [indiscernible] it's really time consuming. So you spend more time [indiscernible] for example as opposed to saying [indiscernible] see what happens. We're using as much as possible -- do you see what I'm saying? It's different kind of things [indiscernible] than if you're doing something else. >> Elliott Fleming: Sure, I completely believe anytime you use the compiler, the compiler may always be bad, but I think the [indiscernible] history is that compilers inevitably get very good. >>: I'm saying, I would like to see more of compilers report in what you've done. >> Elliott Fleming: Sure. I mean, I completely agree that there is more work needed here. Particularly on the quality of service front. But I think it is possible to model some of those things in the compiler and get good answers. I think. I mean, at least I think so. But we'd have to look at particular applications before we could get some conclusion. >> Doug Burger: All right. Thank you very much.