24729 >> Ken Eguro: Good morning, I am Kenny Eguro from the Embedded and Reconfigurable Computing Group here in Research Redmond. Today is my pleasure to introduce Russ Tessier. So, Russ Tessier is an Associate Professor in Electrical Computer Engineering at University of Massachusetts, Amherst. He received his SM and PhD Degrees in Electrical and Computer Science from MIT in 1992 and 1999, respectively. Russ has worked professionally at Altera and Mentor Graphics among other companies. His interests include reconfigurable computing, embedded systems, and multicore processing. So, let’s welcome him. [applause] >> Russell Tessier: Thank you Ken. Today I’m gonna talk about Network Virtualization using FPGAs. My background is primarily in the FPGA area. But about 3 or 4 years ago I started to get interested in networking and I started working with a Faculty Member, Lixin Gao, who is an expert in networking. We decided to start exploring sort of a new trend in networking. The idea of using hardware, routing hardware, to serve different routing purposes at different times, so the same physical hardware that can change its functionality over time to provide different network functionality. So today I’m gonna talk about how this goal can be achieved using Field Program Gate Array, okay. So before I start I’d like to also acknowledge my colleagues, most of the work I’m gonna talk about today is the PhD work of my student, Deepak Unnikrishnan, another PhD student, Dong Yin, and Professor Lixin Gao from UMass. So, I’m gonna start off today with an overview of what Network Virtualization is. It’s a topic area that’s getting a lot of attention in the last 3 or 4 years. People have looked at different ways of making the Internet and networking in general more functional. Okay, being able to reuse the same hardware over and over for different purposes and in some cases using the same hardware for different routing purposes at the same time. Okay, so I’ll give some examples today and show how this can be done in hardware and also in software. One of the challenges of the FPGA community over the years is identifying application areas that take advantage of the re-configurability of FPGAs. Okay, and it turns out that this application that I’m gonna talk about today is particularly good for dynamic reconfiguration of the FPGA, Field Program Gate Array device. So, I’ll show that for the work that we did today we used both complete device re-configuration and partial re-configuration at run time in order to be able to serve some of our needs in virtual networking. I’ll describe a complete FPGA-based virtualization system, okay, that allows us to be able to route packets at line rate for 1 gigabit per second. Ethernet we’re working on a faster technology currently. Finally, I’ll finish off with a discussion of how we went about making this particular implementation partially reconfigurable and I’ll finish off with some results. I’d also like to encourage everyone if there are questions or comments that you’d like to make in the course of the talk, you know of course I’m use to giving lectures to students with lots of questions or clarifications, so feels free as I talk to ask questions if there are things that aren’t clear. Okay, virtual networking. Over the history of networking for the most part the network routing hardware, the hardware that physically transmits the packet have been fairly static, fairly fixed function. Okay, but as time progresses there’s a need for more diversity in the network. Okay, we’ve seen sort of an explosion of cloud computing, data centers, a variety of other sorts of applications, and a variety of needs that present different bandwidth requirements, different security requirements, and different isolation requirements at different time. Okay, so it’s desirable to have a system that can provide different functionalities, different sort of quality of service at different times without having to completely replace the hardware, without having to come up with completely new protocols that can be implemented at the system level. For the most part, over the last 3 or 4 years there’s been an explosion of this interest in the idea of virtual networking, taking a physical piece of hardware, routing hardware, and changing it over time to be able to serve different functionalities. It doesn’t even necessarily need to be the hardware it also could be the software which is being used to route packets in a system. In fact the system that I’m gonna talk about today includes both hardware and software as a platform in order to be able to route packets. We also view this as an opportunity for service providers to be able to change, to have a, to make an investment in an infrastructure for networking and then go about changing and adapting the system over time. We think that can be an effective business model as people create different needs, and different applications, and different protocols that can be implemented on the hardware, okay. So the solution to all of these issues is, as I mentioned, is network virtualization, reusing the same hardware for different purposes at different times. The key question is, how can we go about providing the same quality of service, the same throughput, the same security, and the same isolation for systems by using the same hardware or potentially using the same hardware for multiple networks at the same time without having them collide with each other? Okay, so the goal of this research is to investigate some of these issues. Another issue that we’re evaluating in this research is one of scalability. Okay, how can I implement a large number of network routers, a large number of network dataplanes, on the same physical hardware and timeshare that hardware in a way that’s similar to the use of a microprocessor and its use across a variety of different applications on a standard PC? Okay, so I’ll talk about some of the techniques that we’re using in order to be able to do that today. So network virtualization, we can think about a virtual network as a collection of nodes. Okay, so I have a picture of two virtual networks, a blue network and a red network. We can see here that some of the names of these nodes overlap and the topologies of these two networks are dramatically different. Okay, one is more of a point-to-point type connection between different nodes and others have, this other network has a little bit more connectivity but roughly in terms of performing the connectivity it’s possible to see that, in fact, both of these networks could be generated using the same hardware and the same physical links. So, for example, if I had a situation with a series of computers which are located down here and a series of routers represented by these boxes here, if I could come up with someway of sharing the resources, the routing resources, programming the resources in a way that forwards packets in a predictable or in an application specific way, it’s possible to be able to implement the system effectively. Now about 3 or 4 years ago most folks looked at this and implemented this type of system using software. Okay and I will talk about a software system that was developed at my school based on something called OpenVZ which allows for kernel level forwarding of packets and providing virtualization. When we looked at this 3 or 4 years ago a decision was made that in fact software is too slow for this type of application. So it’s desirable to put some hardware in the system but continue to allow it to be scalable. Okay, allow it to handle not just one, not just two different virtual networks but to allow it to handle a large number of virtual networks. In effect almost to, you know, close to an infinite number of virtual networks as long as the various constraints of the networks were met. Okay, developing a system like this has a lot of benefits obviously. There’s the goal of reducing costs, it’s not necessary to make investments in a large amount of routing hardware since the same hardware can be used for multiple networks at the same time. Additionally, as protocols change or as techniques change it’s possible to update them to allow them to be able to be used simultaneously in an independent fashion. Okay, so when I say virtual router what do I mean? Well, I’m gonna use the term virtual router in this talk although perhaps a more accurate term would be virtual dataplane. Okay, so the virtual router is really split into two parts, there’s a control plain which defines how the packets are forwarded and then there’s the dataplane where the actual physical hardware determines where an incoming packet is sent to. The focus of this talk really is on the dataplane although I do use the word virtual router quite a bit in this particular talk. We can get this idea of a physical router which is being time multiplexed across multiple different virtual routers. So for example in this case the physical router has two physical interfaces, one for input, and one for output but the core hardware, processor, FPGA, ASIC is being time multiplexed at different times to perform different functionality. So for example in this case, for the example I had of the previous slide, there were two networks I might for example have two sets of hardware which are performing the routing for the two different networks. The key parts of the router include some routing control and then also the forwarding cable, the dataplane which actually determines when a packet arrives where it’s sent to as a destination, okay. In this particular picture I show two virtual networks, it’s common in a soft; if this was implemented in software it’s common to have 10, 15, 20, or more virtual networks. In the FPGA work I’m gonna talk about today we were successful in implementing up to 5 virtual networks in a relatively small FPGA. In a much larger Virtex 5 FPGA it was possible to implement up to 30 different virtual networks, okay. So some of the challenges in implementing this whether it’s in hardware or software is making sure that there’s sufficient resources in whatever the deployment device is for this particular technology. So for example in the FPGA the obvious constraint is the amount of hardware that’s available. If I run out of space in the forwarding table, if I run out of space in the logic I can’t implement more virtual networks. There’s also an issue in terms of performance. I want to make sure that I can meet the constraints for all the networks at the same time, okay. If the hardware is being multiplexed there’s this idea of flexibility and also an idea of scalability being able to change the network over time, okay. The place where the FPGA comes in, in this particular example, is that over time the different virtual networks will change, okay. Some new virtual networks will be formed; other virtual networks will be taken away, okay. The FPGA provides the capability to be able to modify the physical hardware at different times to accommodate new virtual networks as they’re created or as they’re destroyed. We’ve developed a system that has the capability to do that. Okay, so some of the traditional techniques that go about doing this there have a large variety of papers – yeah, question – >>: Can I ask a question on the previous slide? >> Russell Tessier: Yep. >>: Is there one [indiscernible] number of [indiscernible]? >> Russell Tessier: For this particular work there is. We have not looked at cases where there are a different number of virtual networks in number of virtual routers. One could think of techniques where you could potentially have networks which had, you know, an individual network which had different parameters at different time so that it wouldn’t necessarily be a one-to-one mapping, but – >>: Why, so why can’t you, for example [indiscernible] and one router? Why can’t the, so the packets from one of these networks why can’t just the packets be different, right? Like here it says, this is a packet from network A and then you have one router that, you know, you can set [indiscernible] packets from this [indiscernible] service or whatever [indiscernible]. Why do we need two separate, you know, copies of the hardware? >> Russell Tessier: So that’s a good question. I think for this picture I sort of provided a logical view point. In fact in the actual hardware that we implement a lot of the hardware is actually shared across the different networks. So for example in this case they would have separate forwarding, I show here separate forwarding tables but in fact the forwarding tables are implemented in the same physical memory and then it’s just a matter of indexing into the appropriate place. Okay, so most of the techniques that have been developed to implement virtual networking have been in software both kernel mode, in user mode forwarding for packets. I will show a little bit later in the talk that in fact this is orders of magnitude slower than what an FPGA can do in terms of being able to forward the packets. There have been other works in FPGAs recently since we started our work but in general most of the techniques involve either software or an ASIC approach. In general for obvious reasons the software’s limited both in terms of paralism and in terms of some functionality due to hardware limitations, but it is obviously very flexible so it’s possible to implement a large number of virtual networks in software for the system. Another technique which is sort of become less popular over the last 4 or 5 years is to develop ASICs which have the separate networks implemented with some sharing. Sort of the classy example of this is the Supercharging PlanetLab Platform, there’s a few other ones as well. Of course an ASIC starts running into the same problems we had with fixed networks and they can’t be modified over time and they have some limitations associated with them. Okay, so what are the three good reasons why FPGAs are such a good platform for implementing virtual networks? Like a lot of applications FPGAs have specialization. They have the capability to implement exactly the network that’s, or the network hardware that’s needed in the FPGA hardware which provides a lot of benefit, okay. A second reason is that the FPGA doesn’t have to be by itself. The fact that we include an FPGA in the system doesn’t mean that the FPGA has to be the only thing that’s performing the packet routing. In fact the word scalable in my talk here means that there are virtual networks in both the hardware inside the FPGA and also in software which is being implemented on an accompanying workstation, okay. One of the goals of the work is to allocate the high speed networks, the critical networks to the FPGA and to allocate the lower speed, or lower quality networks to the software and to dynamically change the allocation over time as necessary. The third reason why FPGAs are desirable is because they provide the capability to do dynamic reconfiguration. A portion of the FPGA continues to run and forward packets while a small portion of the FPGA is changed. Now that’s not universally true for all FPGAs, up until recently Altera had FPGAs which you couldn’t do that. But in general for a lot of the higher end FPGAs it is possible to do partial reconfiguration. As a member of the FPGA community over the years, you know, folks have had lots of tools and ideas for partial reconfiguration toward the application space of where partial reconfiguration is useful has been somewhat, you know, the more classy examples you see in the literature are encryption where a key is changed, or a filter where a parameter is changed or some other things. This particular application does really benefit from partial reconfiguration in a large scale not just a small one for a few parameters. Okay, so let me talk a little bit about our platform now to give you a sense of what this overall system looks like. We implemented our virtual networking system using a NetFPGA board from Stanford. So, a NetFPGA 1G board, in fairness this system is old, okay, it’s about 6-7 years old. My group is currently migrating, or has migrated the technology to a newer board from Altera which I’ll say a few words about at the end of the presentation but most of this talk is gonna focus on the, this NetFPGA product. The NetFPGA is a board which plugs into a standard PC. You can buy a cube off the shelf from a company that has all the appropriate software installed. The idea here is that we would have a Linux based PC which has a series of software virtual routers which are being implemented in kernel mode and there’s also accompanying through a PCI interface a NetFPGA board which has a series of hardware virtual routers implemented inside this rather small Virtex 2 FPGA. And the idea would be to swap the configurations as necessary if one of the virtual networks is no longer needed one of the routers can be kicked out of the FPGA, a new one can be put in. If it turns out there’s room in the FPGA it’s possible to migrate one of the networks from software into the hardware. There’s a total of four ports here on the 1G Ethernet interface and a relatively, by today’s standards, a relatively small amount of SDRAM and SRAM that’s part of the system. So I’m gonna talk about two techniques, yeah sorry – >>: So the network that’s getting implemented is coming through the 1G on the NetFPGA or is it the NICs going to the Linux host? >> Russell Tessier: Okay, so good question, I actually, the answer to that is yes. So there’s two different modes, there’s one called single receiver mode and one called multi receiver mode which I’ll talk about in just a second. Yeah, okay, so basically you can do it either way depending on how you want to do the reconfiguration. Okay, so just a higher level block diagram of what’s happening. Here we can see, here’s the physical interfaces over here for the input, physical interfaces for the output. You’ll notice that the physical interfaces not only have physical MAC coming in from the outside they also have a way of transmitting information over the PCI bus from the software into these queues which are located in the NetFPGA. As I’ll mention in just a second there are actually two different modes we can run, one where all information is received at the NetFPGA card and then is parceled out to the appropriate place. Another technique where a switch is used to be able to split up the traffic and make, transfer the data dynamically. It turns out that the choice that is used for how the data is split up depends on what sort of reconfiguration mode that you want to use for the FPGA. Okay, so again I just sort of mentioned this. When we started this the goal was to route everything through the NetFPGA hardware and there’s some real limitations to be able to do that. Initially when we started the project we didn’t use partial reconfiguration we used a reconfiguration which involved modifying the entire FPGA. Okay, which makes it very difficult to be able to use just one single location where packets are being received. So we move to a technique, at least initially called, the multi-receiver approach. So I’m gonna start off with a discussion of the multi-receiver approach first and describe some of the challenges involved in that. Then I will move onto the technique, the more advanced technique which we used later in the project which involved having all the packets arrive at the NetFPGA board. Okay, so one of the goals of this work was to try to figure out how can we receive information for the networks? How can we allow for changes in the hardware, okay, without upsetting the traffic which was not changing? So in other words if we implement multiple virtual routers in hardware if we change one of them how can we do that in such a way that the other routers are not negatively impacted and they continue operating in an efficient fashion, okay? So as I mentioned we tried two different approaches. The first one I’m gonna talk about is the multiple receiver approach. Okay, so here’s a picture of the multi-receiver approach, we can see from the multireceiver approach here that there are switches that are located out here, outside the hardware. Okay, so here’s the source, here’s the destination, these switches basically make a determination in terms of whether a particular packet should be routed in the hardware, in the FPGA hardware, or they should be routed by using one of the software virtual routers. We can see here that, you know, this looks relatively straight forward in terms of being able to implement a routing path. The router which was implemented in the NetFPGA hardware was the reference router that was available from Stanford. It has an arbiter in the front here which determines as packets arrive which particular router should be handling the packets. There’s a table over here which indicates the virtual IP indication in terms of being able to select the router, whether things are being implemented in hardware or software. For this particular multi-receiver approach we’re only interested in the hardware implementations, at least for the, on the NetFPGA. Then finally inside the virtual router there’s a look up into the forwarding table to determine what the destination is and then ultimately the packet is sent out through an output queue back out to a switch to whatever the destination is and onward to the next to the next node. A similar path is taken for software. We can see here the packets arrive over here at the PC NIC and they go through a bridge and are effectively the same sort of process takes place, there’s a look up and a forwarding table and the information is then sent out back to the switch. I’ve given this talk before and people have asked me, well it sounds like you’re only doing, you know modifications of the forwarding table why do we need all this extra hardware? In fact we do have some, even though this talk doesn’t focus on it there are other examples that we have where other changes beyond just the forwarding table look up is different. So in other words it’s possible to put in RAV limiters on some of the traffic for some virtual networks but not others. There’s other protocols the student has implemented something called a ROFUL protocol for some networks in a more standard routing procedure for other networks. So it is possible, actually these virtual routers can have significant differences beyond just performing forwarding table look ups. For this particular talk though I’m focusing on the forwarding table look ups portion of this. Okay, so what happens if we use this dynamic approach- yeah go ahead – >>: How would you decide which is gonna be in software and which is in hardware? >> Russell Tessier: So, we have a scheduling algorithm I’ll talk about at the end that makes the determination based on required resources, which resources should be in the hardware and which should be in the software. >>: So you could route through either one for any of the networks it’s just a performance difference, if you choose it? >> Russell Tessier: It is a performance difference. Some networks if they have high enough requirements would have to go on the hardware. So for example, let’s say that you worked at a data center and there was a need to route some packets, you know, one of the users want to create a network, if the bandwidth is, or the requirements are high enough and then maybe that particular network would only be suitable for the hardware. If it turns out there’s too many that have to go on the hardware then there would have to be some sort of an idea of which one would be most appropriate, okay. So let me say a few words about what happens if we perform dynamic reconfiguration of the FPGA for this multi-receiver approach, okay. So as I mentioned there are multiple virtual routers implemented in the hardware. If one of the virtual routers needs to be changed in this multi-receiver approach the first technique that we tried was to reconfigure the entire FPGA. Okay, and as a result of that the other virtual routers which are not being changed for a short period of time were actually not functional, which would seem to be a bad thing, okay. So a way to overcome that would be to migrate, if the entire FPGA is being modified one way to overcome that is to migrate the functionality for the virtual routers which are not be changed to software for a period of time, reconfigure the FPGA, and then migrate those routers back to FPGA after the new router has been put in place, okay. So in looking at this, we can see here that the steps involved for awhile if the FPGA is being modified moving all the networks which are not being changed to software that granted a much slower bandwidth rate, reconfiguring the FPGA, and then moving the networks back to hardware. So we call this a static FPGA approach so effectively the FPGA is shut down for a period of time. This is not the most effective approach. I’ll talk about a more effective approach in a minute, but at least from, you know, the first part in terms of studying this it did provide a way of at least providing initial testing. I’ll mention later that the total amount of time necessary to perform this reconfiguration is around 12 seconds. It doesn’t take 12 seconds to reconfigure the FPGA. It does take 12 seconds to remap the ports for these networks which are being migrated from hardware to software. Okay, so after the FPGA has been reconfigured the networks are moved back into hardware and then the routing can continue. Okay, so after we finish the multi-receiver approach we moved on to this idea of using a single receiver. The idea here would be just to get rid of all the switches and all the, you know, having to reconfigure the entire FPGA every time. So the idea would be that everything, all the network traffic is received over here at the input to the NetFPGA and then there would be hardware inside the NetFPGA which would figure out if a particular packet should be routed by the FPGA hardware or by the software which is located on the PC. Okay and this make it a lot easier if we want to perform a dynamic reconfiguration, in this case here if I want to go about modifying a particular router in the FPGA and the routers have been created properly it’s possible to dynamically change only the router hardware which is affected, okay. Only the forwarding table information and the logic associated with the virtual network is modified via partial FPGA reconfiguration, okay. So this is an example here of how a packet, let me go back a second, this is an example of a packet that comes in from the network, is identified as being as it should be routed by software, the packet is forwarded from the NetFPGA board through the PCI Bus to software, okay. Look up takes place over here in the software virtual router and then the packet comes back out. Now that its destination has been determined the packet comes back out and is then sent out via the physical interface of the FPGA hardware. So, you know, another benefit of this approach is we don’t need switches out in front of a board which cuts down on some of the cost. Okay, so if we want to perform partial reconfiguration of the FPGA we need to take some steps, okay. Some of you, I know, are FPGA experts, some of you may not be. Although FPGAs provide the capability of partially reconfiguring the device, in other words, leaving a large portion of the device operational while a small portion is being changed. In general until recently the FPGA companies haven’t spent a whole lot of time worrying about making it real easy for people to be able to do that. Fortunately that’s why we have graduate students, but. It is possible to do it although it’s a little bit painful. In allowing for the single receiver approach we had to construct the system so that most of the FPGA design, most of the reference router is stable, okay, most of it does not change while the FPGA is being modified. But that small pieces of the FPGA, namely these partially reconfigurable regions would be the part that would get modified when a particular network is being swapped out, okay. One of the limitations in the current FPGAs is that in order to be able to have a partially reconfigured region they need to be isolated from the other logic in the device through a series of bus macros. They basically have to be of the same size, same physical size typically either a portion of a column of FPGA logic or a whole column in the case of the Virtex 2. You know, there’s a little bit of work necessary in order to be able to download the configuration and in our case we use a JTAG interface to do the reconfiguration, the partial reconfiguration of the FPGA device. Okay, so here’s kind of a snap shot of the Virtex 2, here’s a couple of macros that we made. You can see they’re Bus Macros which isolate the partially reconfigurable region from the remainder of the logic in the chip. It turned out that, in this case here, we were able to fit up to about two partially reconfigurable virtual routers in the Virtex 2. It’s also, we did the same experiment with a Virtex 5 and it was possible from up to 20 of them. With the larger chips, you know, Virtex 6 and Virtex 7 you could literally implement, you know, 50, 40-50 of these different virtual routers in the physical hardware without too much trouble. So this device is actually on the smaller side but it still is unfortunately the NetFPGA is still the one that’s used by most people for networking research. Okay, so we’re gonna move into talk about some results now that we generated for this project. This system was implemented in the lab. We had several NetFPGA cubes we used to generate some of the experiments. The idea was to get an understanding of the overhead of doing this reconfiguration, of the benefits of using this approach versus the software approach. Then finally to understand the ideas regarding scalability, okay. So as we add more virtual networks to the system what happens? So we can see in this case here we actually wound of using two boxes to perform these experiments. The NetFPGA project has a suite of tools. So, they have a reference router which will actually route packets that’s in Verilog, you can download it and run it on the board. They have a packet generator which will generate packets at line rate and allow for experimentation with the packets. So we did some experiments here, we generated packets using the packet generator available from the NetFPGA folks and also iPerf which is a software packet generator which is much simpler. Then finally we implemented the software virtual routers using a platform called OpenVZ that was implemented on top of the Linux on the PC and measured the overall results for the system. Okay, so some of the things that we ran into as I mentioned to do the full FPGA reconfiguration, the static reconfiguration it took about 12 seconds to basically migrate virtual networks from hardware to software, reconfigure the device, and then migrate the networks from software back to hardware again. The reason it took so long to be able to do this was because it was necessary in the network to remap where the packets were being sent. Since there were switches in the network it was necessary to figure out where the packet would be sent. You know, for a short time if we have a small number of virtual networks doesn’t seem like that bad of an approach but if there’s a lot of virtual networks shutting off the entire FPGA probably doesn’t make a lot of sense. So there’s really a strong motivation here for partial reconfiguration. In the partial reconfiguration the modification of just one of those little portions in the FPGA only took 0.6 seconds. So it’s a lot faster and I’ll show you in the results in just a minute that in fact the throughput, the performance that you get is substantially better than this other static approach. Okay, so first thing we looked at is how well do we do versus the software technique, if you implement these routers and software what happens? Some of this is based on the packet size, obviously the bigger the packet size the better because you don’t have to route, as information or look at the headers to figure out where the packet is going quite as often. This is a log scale, so we can see here that in general in all cases the FPGA based router, the hardware virtual router was able to operate at line rate. This is, you know, of course the 1gigbyte per second Ethernet interface. In fact there was a lot of slack associated with this. We could have done a lot better than the 1gigbyte per second based on the speed of the FPGA. In fact some of the faster FPGAs that are available, you know, don’t have any, with some of the experiments that have been done recently don’t have any trouble getting to 10gigbytes per second or much higher than that. This is a consistent rate where as the software especially for small packets, actually, did pretty bad. So we can see it’s at least two orders of magnitude worse. Most of that is due to the overhead of being able to get the packet into the software, doing context switches, et cetera. In general the performance was certainly not scalable or beneficial. In this case this was only for one virtual router, okay, so this didn’t involve a lot of virtual routers with switching, even with just one the performance was not so good. Okay, so I’m gonna contrast two different techniques of using FPGA reconfiguration and what happens to the network. So let’s take a look at a system here which includes two virtual routers, we’ll say A and B that are initially in the network. The idea would be that virtual router B is gonna be replaced from the FPGA with a second network called B Prime. Okay, so there’s two networks A and B both of them are implemented in the FPGA. There’s a desire to shut off the entire FPGA for a period of time, reconfigure the FPGA so it will, instead of just including A and B it will now include A and B Prime. So initially the throughput here we can see here we have 100,000 packets per second when both of these networks are included in the FPGA. When the FPGA is shut off network A is migrated from hardware which has high throughput down here to software which has low throughput. So for a period of time A is continuing to forward packets but at a substantially reduced rate, okay. So it’s doing better than nothing but it’s still taking a long time. Meanwhile B is shut off so it’s not forwarding anything. 12 seconds later, okay, after all the port remapping and the changes in the physical hardware we can see here that when the, when A is restored back to the FPGA along with B Prime we can see here that the network regains its full bandwidth, okay. A couple of points to point out here, the first point is that this particular graph does not show the amount of time necessary to do the place in route and the compile time for the FPGA so that would be extra. Okay, so it’s an even, another even more stronger motivation for a partial reconfiguration in that you can make a bunch of little modules that basically are the same size that can get swapped into the FPGA without having to recompile the entire device, as long as it had the same interfaces on the outside. Additionally, in general, you know, it may not be necessarily possible to know all the different combinations of virtual networks that would be in the FPGA at the same time, okay. So, in general this approach is fine for initial experimentation but is not really a long term solution. Okay and we contrast that with the single, with the partial FPGA reconfiguration. In this case here B is swapped out of the FPGA and replaced by B Prime. We can see here that A more or less continues generating sustained throughput. So A doesn’t get changed at all, that stays in the FPGA, it continues to route its packets happily. B is shut off after a period of time there’s a much shorter amount of time that the FPGA is down cause nothing has to be migrated anywhere. Then B Prime is swapped in via the JTAG interface and it continues along at high speed. Effectively we get a 20X speedup. It’s also scalable in the sense that if I had a lot of virtual networks I could go ahead and basically individually swap them in without any problems, okay, based on a schedule. Okay, so the next issue is one of scalability. There’s a desire as I said to have a lot of virtual networks. In the FPGA we’re limited, okay. We can only have a total of up to 4 virtual routers in the FPGA. If we’re really constrained by partially reconfigurable regions we can only really have 2. So we did some experiments here with the 4 just to see, you know, to sort of take the best case. So here we can see that if we augment the 4 virtual routers which are implemented in the FPGA over time what happens is if we augment it with software we can see eventually the overall throughput, the average throughput starts to go downward because the software, as we add more virtual routers they must be in software and it has a negative affect on the overall throughput of the system. Okay, cause some things need to be forwarded to the software. In general the software only does not so good and it, we can see that as time goes on the overall rate more or less stays the same then towards the end there’s more and more virtual routers. The software has more context switching and its performance goes down. So, what’s the take away for this particular graph, this particular graph is desirable if there’s a lot of virtual networks that have a big FPGA. Okay, which it can be partially reconfigured. Okay, what about latency associated with the system? Often in evaluating networks people look at throughput but latency is also an important issue. So we evaluated the latency of this particular system here. We can see here that in general we can see the software only has a much longer latency. You can see that this is about 0.1 and that’s about 0.7 so about seven times worse. Overtime the latency here is more or less constant even as I add virtual routers cause again the hardware is being used in parallel. But then things quickly get worse as I start adding software routers to the system. Same thing, software in general starts off bad and gets worse as more routers are added. Another question I sometimes get asked is why didn’t you put, you know, you can get an FPGA that has a hard microprocessor on the FPGA, why don’t you use that for the routing? That’s certainly an option, it would make things a lot easier cause you don’t have to transfer things over the PCI Bus. We didn’t really have an opportunity to do that. But a longer term one could consider having just one device that had a hard processor that was doing the software virtual routers and then the FPGA which was performing the, performing the hardware virtual routing. The ZiLinks devices have what’s called an ICap interface which allows for the device itself to reconfigure portions of its logic. So, that might be another option in the longer term for, basically the FPGA could be more or less a self contained unit, could perform all the routing and could configure itself without a lot of external effort. But, we haven’t gotten to that point yet. We did a comparison between partial reconfiguration and static reconfiguration, where static in this case is reconfiguring the entire device once again. We looked at using different numbers of ports on the FPGA. So you could think of using all 4 ports for receive and send on the NetFPGA board or you could look at just using one port. The key difference here is a question of how often do you have to reconfigure the virtual networks, okay? So, if you don’t have to reconfigure that much it doesn’t, you know, it’s sort of not pleasant to have to shut the entire FPGA down. But it has less of an impact than if you have the reconfigure frequently, okay. So if we take a look at this we can see here that in general this is the throughput per dataplane. You can see here in general the partial reconfiguration does the best. Okay, but it’s not that much different between shutting the FPGA down, the entire FPGA down for a period of time if we don’t have to reconfigure that often. The gap gets bigger if the reconfiguration takes place frequently. From talking to colleagues the reconfiguration for the dataplanes, you know, 180 seconds is pretty fast. I mean usually these virtual routers stay up for long periods of time. But in general the partial approach is still a better approach for some of the reasons I said a little bit earlier. Okay, what about the average throughput for the combined hardware/software approach? We don’t have a board which has the Virtex 5 in it but we can make some estimates based on, we were able to compile some of the partially reconfigurable regions into the Virtex 5 and see how fast they are. We can see here that in general the throughput starts dropping off for the partial reconfiguration versus the static reconfiguration. You can see here that this, sort of this knee of the curve here, this inflexion point takes place earlier for partial reconfiguration versus the full reconfiguration. The reason for that is because as I said for the partial reconfiguration it’s necessary to, the regions have to be sort of carefully crafted by hand, they have to be a certain size. It uses up a lot more FPGA resources, okay. So in some cases it may not necessarily be a good thing to have partial reconfiguration because then I can’t fit as much stuff inside the hardware cause everything needs to be built in a specific way. Okay, so let me say a little bit about power consumption. One of the concerns with FPGA is with its power consumption. General our power consumption here was pretty good although in fairness this is a Virtex 2 which is fairly old technology, I think it’s [indiscernible] a nanometer, it’s pretty old. But in general most of the power consumption here is dynamic power as you’d expect with this old technology. Some of the more recent FPGAs the static power is more important. One of the comments that we got is, instead of dynamically reconfiguring stuff why don’t you just - If you don’t need a network anymore why don’t you just cut the clock to save power? We did that and we were able to save about 10% of the power. You can see up at the top that’s more or less, if we don’t use the PCI interface we can implement five virtual routers in a Virtex 2. If we implement the PCI interface but don’t support partial reconfig we get four. If we support both partial reconfig and the interface we can support a total of two virtual routers in the FPGA. But, again we can support up to 20 in a Virtex 5 device. There’s a number of interesting issues associated with this work on the software side. My student is continuing to work on this. One of them is the algorithms necessary to figure out which networks go on hardware and which networks go on software and whether the FPGA should be reconfigured if a particular network is created or destroyed. So the student has done a little bit of work on that. You know there’s a question of, is it desirable when there’s a space available in the FPGA to migrate a network to fill that gap from software to hardware? So the student’s done some work on this to take a look at what happens, the upgrades in terms of bandwidth if you go about making a migration. We can see here that in general the amount of time we were able to get improved bandwidth increases if we go about migrating a network from software to hardware. Okay, since the migration path for partial configuration is fairly straightforward. The algorithm takes a look at how much space is available in the FPGA. What the required traffic is and then it makes a decision based on whether the particular upgrade should take place or not. In some cases it’s not possible to make an upgrade cause there’s not enough space in the device. So we had a sample of a series of traffic, real traffic in a network and we were able to find that it involved 20% of the time it is possible to make the migration and get an improvement. Okay, so let me finish up, I just want to say a few words about some other things I’m working on at UMass. I have a few slides. You’ve seen from the presentation here that we were successful in implementing these virtual routers on this Virtex 2 device. Altera recently has provided me with some funding support to migrate the NetFPGA infrastructure to Altera devices. Over the past couple years we’ve migrated pretty much the entire suite of tools from NetFPGA to an Altera DE4. I have a link on my website which allows faculty and other researchers to go and download the infrastructures. So the packet generator, the reference router, various other infrastructure is available there for free download. The only thing we don’t provide is, or we don’t have on the website the Verilog code for the reference router. Stanford doesn’t want that on a website cause they actually assign their students for a class on how to do that. So, I was told that I could have that code and I could give it to people if they requested but I can’t make it, you know, just download for anybody, so if you’re interested it’s on my faculty website and I’d be happy to send anybody that’s interested the, both the packet generator and the reference router for this board. We targeted DE4 board which is a fairly, which is a board that was introduced a couple years ago. It has a much bigger FPGA on it, a lot more DRAM. It still has 4-1gig ports. Altera has recently announced, I think within the last couple of days actually, that they have what’s called a DE5 board which has a much bigger Stratix device, Stratix V device that they’re shipping. They have also given my group some money to migrate this experimental platform onto the DE5, okay. These, let me go to the next slide here to show you some of the resource differences. So here’s some resource differences between the DE4 and the NetFPGA. So this is the board we’re currently using, here’s the DE4. The amount of LUTS is much bigger and the amount of space taken up by the reference router is a much smaller percentage than the overall device which gives us much more room to be able to do more virtual dataplanes and other logic if necessary, sorting and various other applications in the network. This is for the DE4. The DE5 is about; I think the Stratix V device on the DE5 is about 2 to 3 times bigger than DE4. There are some limitations. One of the limitations is Altera for these, at least for the DE4 does not support partial reconfiguration, okay. So, that’s a limitation we’re dealing with but for the DE5, I’m not sure if the DE5 has the device that has, I don’t believe they do partial reconfiguration either but I’m not 100% sure about that. So this is new technology we have. My student who is finishing his PhD is looking at ways of performing computing using these devices. Basically imbedded map reduced type algorithms in the virtual network to be able to speed up a system of computers running a synchronous version of Map Brutus. Yes – >>: Are those suppose to be the same, the same design on those two different boards? >> Russell Tessier: It’s roughly the same design, there’s a little bit of difference, you can see that some of the numbers are a little bit different because we had to make some changes in terms of the implementation because, in going from Altera to ZiLinks. So, in this particular case you’ll see that the, for example the memory bits look much less for the NetFPGA board primarily because the packets were being stored off the chip. For the DE4 implementation that we currently have we’re storing that information on the chip itself so the numbers looked a little bit bigger. But the student is currently working on a DRAM interface to alleviate that problem. >>: That’s probably where, also where the IOs are [indiscernible]. >> Russell Tessier: Yeah, the IOs, well yeah we have a different, there’s a different network interface core that Altera has that we migrated over. We did some experiments with these virtual networking on the DE4 and basically found that we could again match the packets per second. We can match, there’s no difference between what we could do with the NetFPGA and the NetFPGA DE4. That’s probably not very surprising since the DE4 is faster and bigger than the NetFPGA but we just wanted to confirm that. So let me wrap up. This project looks at implementing virtual routers on FPGAs. Okay its novel in two ways, okay. When we initially introduced this we were the first project that had implementation of multiple different virtual routers on one FPGA, okay. We also were the first project that allowed for scalability for both hardware and software routers in the same system where the hardware is implemented in the FPGA. We think there’s a lot of, you know, this has a promising future, there’s a lot of future applicability for this. Because there’s a lot of interest in being able to make networks big and small that can basically virtualize hardware at the same time. I’m sure I’ll have a chance to talk to some of you today about that. We’re able to demonstrate this in the lab using a NetFPGA board. We’re currently migrating towards bigger boards to do more experimentation. I think from, you know, for those of us in the FPGA area the thing that kind of got me excited was the fact that this really was a great example of why partial reconfiguration is a good thing for FPGAs. Altera for many years kind of resisted doing partial reconfiguration, I think, not as a result of this work, I’m sure but as a result of sort of a broad range of applications using partial reconfiguration has decided to move into that area as well, and provide tool support to be able to deal with that, right. One last thing too, there is a follow on NetFPGA board that the Stanford folks have created in collaboration with Zylon, it’s called the NetFPGA 10G. I’m not sure the status that board has sort of been in creation for a long time, so I’m not sure of the current status of that board. But there is a competing board too with the DE4 board just to be fair to everyone. Okay, thank you very much. [applause] >> Russell Tessier: Yeah. >>: So, probably gonna combine a bunch of different questions into what [indiscernible]. They’re all sort of surround [indiscernible]. >> Russell Tessier: Uh huh. >>: So what do you see with the board to be able to support [indiscernible] being able to support lots of ports ask it in comparison to [indiscernible] requirements? I mean it looks like all this is basically beyond [indiscernible] but there are some scalability especially when you get far out where if you’re trying to support lots of different [indiscernible]. >> Russell Tessier: I think we haven’t looked quite that far into the distance. I think there’s issues, you know, the place that I would see the issues would be sort of at the interfaces to try to get data onto the FPGA, on to the hardware devices, not reach a bottle neck in the input or output ports. I feel we didn’t spend a whole lot of time optimizing the logic inside the device simply because we could get the performance we needed without having to optimize a lot. I think that if more effort was made, so if there really was a bandwidth limitation I think additional work could be done to optimize the way the FPGA publication was created. So I think that is possible but we haven’t really look at that, yet. The student did one thing I didn’t mention in this work and we have a Transack Computer paper that’s on my website that describes it, so most of this work is in a TComp paper that the student had accepted a few months ago. There’s a whole other area I didn’t talk about today which is how do you implement the forwarding tables, right? So, yeah, it’s this tree based approach that allows for various levels of lookup for virtual networks, which is very space efficient, right. So as the number of networks increase there’s scalability issues not only in the actual hardware for the router but also how to implement some of these things on the device. Yeah. >>: So often with these partially reconfigured applications you’re telling me it does it through that and then somebody comes along and builds essentially a programmable super set – >> Russell Tessier: Right. >>: Would you say this is something where, what’s common is fixed and what’s variable you’ve got a little options and you download, just set a couple registers and go? >> Russell Tessier: Right. >>: Is that possible to do with what you’re doing? >> Russell Tessier: It is possible so, you know, the question of is it, do you necessarily need to have partial reconfiguration or can you do sort of a parameter based set up for configuration? For the types of things we wanted to do for rate limiters and other, and changing, like really changing protocols you kind of have to reconfigure the hardware for the network. But I think you’re right. I mean in other cases it may be possible just to have a parameter setting. It’s certainly possible the NetFPGA router allows this and the virtual router allows it to. If you just want to change the forwarding table there’s nothing to reconfigure. >>: [indiscernible] how much variation from virtual router are you anticipating, I guess, how different is the logic gonna be for this instance? >> Russell Tessier: That’s a good question. I guess it depends on the application space. So, you know, I would expect if somebody implemented this in lets say it was a company selling this as part of a data center to the outside world I mean there maybe a large variety of different types of network configurations that might be desirable if it was sort of within a company and sort of fixed it maybe possible to come up with just a very small [indiscernible]. Yeah. >>: Way back in the early section of the talk you mentioned the figure 12 seconds so a total reconfiguration time approximately [indiscernible] software. How much of that was actually in reconfiguring the FPGA? >> Russell Tessier: It was less, the FPGA is reconfigured in less than a second it was think hundreds of milliseconds. >>: That kind of leads to a second question that if given that there’s very little overhead [indiscernible] configuration. If you don’t have a partial reconfigurable FPGA why don’t you just have two FPGAs? >> Russell Tessier: You could do that too, so – >>: Any problems with that? >> Russell Tessier: No, I mean you could have, well I mean you would have, you’d have to have the physical ports, the networking ports going to both FPGAs. You’d have to have some way of trading off between the two. I mean that’s a possibility. >>: So from one of the application [indiscernible] – >> Russell Tessier: Yes. >>: And so one way that people get around sort of the scalability issues that you brought up is by temporarily [indiscernible] circuit switched network. Now – >> Russell Tessier: Right. >>: How, do see sort of how you might sort of integrated that type of [indiscernible]? >> Russell Tessier: We haven’t really thought about that. I would guess that instead of having partially reconfigurable regions with sort of logical routers you could have partially reconfigurable regions with data path switches. Although it might be easier, you know, Scott mentioned this; it might be easier for something like that if you just had a programmable cross bar that you could configure without having to reconfigure the FPGA. >>: Well, I mean I guess this only comes up when you start having performance problems. >> Russell Tessier: Yeah. >>: [indiscernible] in order to have really high bandwidth connections they have to be terribly, terribly wide – >> Russell Tessier: Right. >>: A ton of data coming in at one time. Being able to sort of set up that [indiscernible] to the point where in fact you can’t reach the bandwidth limit of the device because sort of one packet is, two packets, one packet isn’t even the width of the data path that [indiscernible]. You know and so there, yeah, I am imagining that this is becoming, especially when the data rates are 40 times what you could be using [indiscernible]. I mean that happen really easily, so. >> Russell Tessier: I think there is, you know, there is a desire, I’ll say for myself I wish we had a board that had much higher network capacity. So, you know, one of the issues in being a researcher is, you know, NetFPGA is kind of the thing most people are using. I think everybody agrees it’s not efficient, you know, the NetFPGA 10G has been in planning for 3 years until recently I think there have been a lot of problems in, there are a lot of problems with the physical interface for it for many years. I don’t even know if it’s available yet that’s kind of what got us interested in DE4. But I would say that as these platforms for research become available, I mean I would love to do something with 4EG and get in there and figure out what the real challenges are in the FPGA side. Yeah. >>: As a, do you have a story for mapping models to any number of the PR [indiscernible] that you have in your device? You were talking about [indiscernible] fit 20 PR regions in [indiscernible] – >> Russell Tessier: Right. >>: To migrate the virtual router from software moving it in and out so – >> Russell Tessier: Right. >>: [indiscernible] allocating [indiscernible] or something. >> Russell Tessier: Right. >>: How would you, do you have a story for that right now? >> Russell Tessier: Yeah, you know, I talk about the allocation algorithm I think the student, one of the nice things about the Virtex 5 which is what we didn’t actually physically do that on the, we did partially recon, we did completely reconfigure the device and partially reconfigured it for just one router. We didn’t do it for, you know, 20 routers. So the short answer to that is we primarily just assumed that the partially reconfigurable regions were allocated in a column, you know, a series of columns. There was software on the computer that would sort of understand what was currently populated and what wasn’t. Then we’d sort of be able to swap things in and out. There probably is an interesting research question in terms of can you cache these things and get them on the board before hand and just swap them in as you need them? I think the, we were using JTAG, I mean I think an ICAP interface with the processor that’s on the board is a much smarter way of doing it, we just didn’t do that yet. I think there’s an interesting paper in the allocation algorithm for this that we didn’t look at as well. Some of this too is, you know, coming up with realistic traces for the traffic so that we can actually show that it makes sense. Yes. >>: Just a general question. How much operating [indiscernible]? Is that [indiscernible] all in the DRAM or can you [indiscernible] FPGA [indiscernible] latency [indiscernible]? >> Russell Tessier: Right, so the experiments we did here were I believe the packets were stored in DRAM and, but it wasn’t a limitation in terms of the throughput. The student did another paper where he actually did some, excuse me; the packets were stored in the SRAM. The student did another experiment where he moved the packet buffering from SRAM to DRAM and then moved the forwarding table into SRAM because he needed more space for the forwarding tables. We didn’t see that as limitation on this particular system, maybe, you know cause the parallel that’s in the FPGA was able to route, in fact in the 1gigbit per second is not, you know, by today’s standards is not tremendously fast. I would guess that if you moved to higher speeds then the memory interface, especially as the packets get bigger would be a concern. But compared to the Virtex 2, I mean MOD and FPGAs have borders of magnitude or on chip memory. >>: [indiscernible] you are using both DRAM and SRAM, do you think the general cost would be [indiscernible] because the SRAM chips seem to be very expensive? FPGA is also – >> Russell Tessier: Yeah. >>: Do you feel like, you know, in the future that FPGA prices are going down and this will be something that they can use everywhere, or – >> Russell Tessier: I think the rule of thumb I often think of for FPGAs is if you want to; the best price point for FPGAs is usually a generation which is one generation old in the middle chip, right. So a high end chip for any generation is the most expensive. The cheap ones, the real low end ones are too small to do anything really useful except for maybe a few embedded dedicated applications. So the right place to be is sort of like one generation before in the middle. I think that, you know, the prices in that area have consistently been in like the few hundred dollar region for a while and, you know, the price starts, as soon as a new generation comes out it like old cars right the prices come down pretty dramatically. So I guess the short answer is, I think of an FPGA that has suitable resources being similar to a microprocessor, a few hundred dollars, so it’s not thousands of dollars but it’s not $20 either. All these things are sort of following; I tell this to my students, you know, all these things are following mores law, right. So as you get bigger chips there’s more things you want to do with it. So, I don’t know that’s just my take on it. Yes. >>: How painful was the transition from the ZiLinks to the Altera tools? >> Russell Tessier: So, I had a student work on that for about a, so the tools themselves weren’t bad cause, you know, we, in my lab we have students that do both ZiLinks and Altera. The actual getting the Ethernet core ported took about, took a few months, took 3 or 4 months of student time to figure out how to deal with that. We’re still working on the DRAM interface. So, I mean the student is taken the DRAM interface from the NetFPGA and is trying to migrate it. But the technology is different so you have to sort of look at the flow control and find out why. So just putting a DRAM interface is easy but then the packets get backed up and it’s not clear like why it’s not working and things like that. So it’s a work in progress but, you know the students often will do this for a period of time, you know, it’s a two year Masters they’ll spend 6 months working on this and then they’ll spend a year and a half using the board to do something useful. I have another project which is funded by NSF which is looking at network security using FPGAs. So the student’s developing monitors in the hardware to actually look at what’s happening in the network and that uses the DE4. We’re still fairly early in that project. Thank you. [applause]