24729 >> Ken Eguro: Good morning, I am Kenny Eguro from...

advertisement
24729
>> Ken Eguro: Good morning, I am Kenny Eguro from the Embedded and Reconfigurable Computing
Group here in Research Redmond. Today is my pleasure to introduce Russ Tessier. So, Russ Tessier is an
Associate Professor in Electrical Computer Engineering at University of Massachusetts, Amherst. He
received his SM and PhD Degrees in Electrical and Computer Science from MIT in 1992 and 1999,
respectively. Russ has worked professionally at Altera and Mentor Graphics among other companies.
His interests include reconfigurable computing, embedded systems, and multicore processing.
So, let’s welcome him.
[applause]
>> Russell Tessier: Thank you Ken. Today I’m gonna talk about Network Virtualization using FPGAs. My
background is primarily in the FPGA area. But about 3 or 4 years ago I started to get interested in
networking and I started working with a Faculty Member, Lixin Gao, who is an expert in networking. We
decided to start exploring sort of a new trend in networking. The idea of using hardware, routing
hardware, to serve different routing purposes at different times, so the same physical hardware that can
change its functionality over time to provide different network functionality.
So today I’m gonna talk about how this goal can be achieved using Field Program Gate Array, okay. So
before I start I’d like to also acknowledge my colleagues, most of the work I’m gonna talk about today is
the PhD work of my student, Deepak Unnikrishnan, another PhD student, Dong Yin, and Professor Lixin
Gao from UMass.
So, I’m gonna start off today with an overview of what Network Virtualization is. It’s a topic area that’s
getting a lot of attention in the last 3 or 4 years. People have looked at different ways of making the
Internet and networking in general more functional. Okay, being able to reuse the same hardware over
and over for different purposes and in some cases using the same hardware for different routing
purposes at the same time. Okay, so I’ll give some examples today and show how this can be done in
hardware and also in software.
One of the challenges of the FPGA community over the years is identifying application areas that take
advantage of the re-configurability of FPGAs. Okay, and it turns out that this application that I’m gonna
talk about today is particularly good for dynamic reconfiguration of the FPGA, Field Program Gate Array
device.
So, I’ll show that for the work that we did today we used both complete device re-configuration and
partial re-configuration at run time in order to be able to serve some of our needs in virtual networking.
I’ll describe a complete FPGA-based virtualization system, okay, that allows us to be able to route
packets at line rate for 1 gigabit per second. Ethernet we’re working on a faster technology currently.
Finally, I’ll finish off with a discussion of how we went about making this particular implementation
partially reconfigurable and I’ll finish off with some results.
I’d also like to encourage everyone if there are questions or comments that you’d like to make in the
course of the talk, you know of course I’m use to giving lectures to students with lots of questions or
clarifications, so feels free as I talk to ask questions if there are things that aren’t clear.
Okay, virtual networking. Over the history of networking for the most part the network routing
hardware, the hardware that physically transmits the packet have been fairly static, fairly fixed function.
Okay, but as time progresses there’s a need for more diversity in the network. Okay, we’ve seen sort of
an explosion of cloud computing, data centers, a variety of other sorts of applications, and a variety of
needs that present different bandwidth requirements, different security requirements, and different
isolation requirements at different time. Okay, so it’s desirable to have a system that can provide
different functionalities, different sort of quality of service at different times without having to
completely replace the hardware, without having to come up with completely new protocols that can be
implemented at the system level.
For the most part, over the last 3 or 4 years there’s been an explosion of this interest in the idea of
virtual networking, taking a physical piece of hardware, routing hardware, and changing it over time to
be able to serve different functionalities. It doesn’t even necessarily need to be the hardware it also
could be the software which is being used to route packets in a system. In fact the system that I’m
gonna talk about today includes both hardware and software as a platform in order to be able to route
packets.
We also view this as an opportunity for service providers to be able to change, to have a, to make an
investment in an infrastructure for networking and then go about changing and adapting the system
over time. We think that can be an effective business model as people create different needs, and
different applications, and different protocols that can be implemented on the hardware, okay.
So the solution to all of these issues is, as I mentioned, is network virtualization, reusing the same
hardware for different purposes at different times. The key question is, how can we go about providing
the same quality of service, the same throughput, the same security, and the same isolation for systems
by using the same hardware or potentially using the same hardware for multiple networks at the same
time without having them collide with each other? Okay, so the goal of this research is to investigate
some of these issues.
Another issue that we’re evaluating in this research is one of scalability. Okay, how can I implement a
large number of network routers, a large number of network dataplanes, on the same physical hardware
and timeshare that hardware in a way that’s similar to the use of a microprocessor and its use across a
variety of different applications on a standard PC? Okay, so I’ll talk about some of the techniques that
we’re using in order to be able to do that today.
So network virtualization, we can think about a virtual network as a collection of nodes. Okay, so I have
a picture of two virtual networks, a blue network and a red network. We can see here that some of the
names of these nodes overlap and the topologies of these two networks are dramatically different.
Okay, one is more of a point-to-point type connection between different nodes and others have, this
other network has a little bit more connectivity but roughly in terms of performing the connectivity it’s
possible to see that, in fact, both of these networks could be generated using the same hardware and
the same physical links.
So, for example, if I had a situation with a series of computers which are located down here and a series
of routers represented by these boxes here, if I could come up with someway of sharing the resources,
the routing resources, programming the resources in a way that forwards packets in a predictable or in
an application specific way, it’s possible to be able to implement the system effectively.
Now about 3 or 4 years ago most folks looked at this and implemented this type of system using
software. Okay and I will talk about a software system that was developed at my school based on
something called OpenVZ which allows for kernel level forwarding of packets and providing
virtualization. When we looked at this 3 or 4 years ago a decision was made that in fact software is too
slow for this type of application. So it’s desirable to put some hardware in the system but continue to
allow it to be scalable. Okay, allow it to handle not just one, not just two different virtual networks but
to allow it to handle a large number of virtual networks. In effect almost to, you know, close to an
infinite number of virtual networks as long as the various constraints of the networks were met.
Okay, developing a system like this has a lot of benefits obviously. There’s the goal of reducing costs, it’s
not necessary to make investments in a large amount of routing hardware since the same hardware can
be used for multiple networks at the same time. Additionally, as protocols change or as techniques
change it’s possible to update them to allow them to be able to be used simultaneously in an
independent fashion.
Okay, so when I say virtual router what do I mean? Well, I’m gonna use the term virtual router in this
talk although perhaps a more accurate term would be virtual dataplane. Okay, so the virtual router is
really split into two parts, there’s a control plain which defines how the packets are forwarded and then
there’s the dataplane where the actual physical hardware determines where an incoming packet is sent
to.
The focus of this talk really is on the dataplane although I do use the word virtual router quite a bit in
this particular talk. We can get this idea of a physical router which is being time multiplexed across
multiple different virtual routers. So for example in this case the physical router has two physical
interfaces, one for input, and one for output but the core hardware, processor, FPGA, ASIC is being time
multiplexed at different times to perform different functionality.
So for example in this case, for the example I had of the previous slide, there were two networks I might
for example have two sets of hardware which are performing the routing for the two different networks.
The key parts of the router include some routing control and then also the forwarding cable, the
dataplane which actually determines when a packet arrives where it’s sent to as a destination, okay.
In this particular picture I show two virtual networks, it’s common in a soft; if this was implemented in
software it’s common to have 10, 15, 20, or more virtual networks. In the FPGA work I’m gonna talk
about today we were successful in implementing up to 5 virtual networks in a relatively small FPGA. In a
much larger Virtex 5 FPGA it was possible to implement up to 30 different virtual networks, okay.
So some of the challenges in implementing this whether it’s in hardware or software is making sure that
there’s sufficient resources in whatever the deployment device is for this particular technology. So for
example in the FPGA the obvious constraint is the amount of hardware that’s available. If I run out of
space in the forwarding table, if I run out of space in the logic I can’t implement more virtual networks.
There’s also an issue in terms of performance. I want to make sure that I can meet the constraints for all
the networks at the same time, okay. If the hardware is being multiplexed there’s this idea of flexibility
and also an idea of scalability being able to change the network over time, okay.
The place where the FPGA comes in, in this particular example, is that over time the different virtual
networks will change, okay. Some new virtual networks will be formed; other virtual networks will be
taken away, okay. The FPGA provides the capability to be able to modify the physical hardware at
different times to accommodate new virtual networks as they’re created or as they’re destroyed. We’ve
developed a system that has the capability to do that.
Okay, so some of the traditional techniques that go about doing this there have a large variety of papers
– yeah, question –
>>: Can I ask a question on the previous slide?
>> Russell Tessier: Yep.
>>: Is there one [indiscernible] number of [indiscernible]?
>> Russell Tessier: For this particular work there is. We have not looked at cases where there are a
different number of virtual networks in number of virtual routers. One could think of techniques where
you could potentially have networks which had, you know, an individual network which had different
parameters at different time so that it wouldn’t necessarily be a one-to-one mapping, but –
>>: Why, so why can’t you, for example [indiscernible] and one router? Why can’t the, so the packets
from one of these networks why can’t just the packets be different, right? Like here it says, this is a
packet from network A and then you have one router that, you know, you can set [indiscernible] packets
from this [indiscernible] service or whatever [indiscernible]. Why do we need two separate, you know,
copies of the hardware?
>> Russell Tessier: So that’s a good question. I think for this picture I sort of provided a logical view
point. In fact in the actual hardware that we implement a lot of the hardware is actually shared across
the different networks. So for example in this case they would have separate forwarding, I show here
separate forwarding tables but in fact the forwarding tables are implemented in the same physical
memory and then it’s just a matter of indexing into the appropriate place.
Okay, so most of the techniques that have been developed to implement virtual networking have been
in software both kernel mode, in user mode forwarding for packets. I will show a little bit later in the
talk that in fact this is orders of magnitude slower than what an FPGA can do in terms of being able to
forward the packets.
There have been other works in FPGAs recently since we started our work but in general most of the
techniques involve either software or an ASIC approach. In general for obvious reasons the software’s
limited both in terms of paralism and in terms of some functionality due to hardware limitations, but it
is obviously very flexible so it’s possible to implement a large number of virtual networks in software for
the system.
Another technique which is sort of become less popular over the last 4 or 5 years is to develop ASICs
which have the separate networks implemented with some sharing. Sort of the classy example of this is
the Supercharging PlanetLab Platform, there’s a few other ones as well. Of course an ASIC starts
running into the same problems we had with fixed networks and they can’t be modified over time and
they have some limitations associated with them.
Okay, so what are the three good reasons why FPGAs are such a good platform for implementing virtual
networks? Like a lot of applications FPGAs have specialization. They have the capability to implement
exactly the network that’s, or the network hardware that’s needed in the FPGA hardware which
provides a lot of benefit, okay. A second reason is that the FPGA doesn’t have to be by itself. The fact
that we include an FPGA in the system doesn’t mean that the FPGA has to be the only thing that’s
performing the packet routing. In fact the word scalable in my talk here means that there are virtual
networks in both the hardware inside the FPGA and also in software which is being implemented on an
accompanying workstation, okay.
One of the goals of the work is to allocate the high speed networks, the critical networks to the FPGA
and to allocate the lower speed, or lower quality networks to the software and to dynamically change
the allocation over time as necessary.
The third reason why FPGAs are desirable is because they provide the capability to do dynamic
reconfiguration. A portion of the FPGA continues to run and forward packets while a small portion of
the FPGA is changed. Now that’s not universally true for all FPGAs, up until recently Altera had FPGAs
which you couldn’t do that. But in general for a lot of the higher end FPGAs it is possible to do partial
reconfiguration.
As a member of the FPGA community over the years, you know, folks have had lots of tools and ideas
for partial reconfiguration toward the application space of where partial reconfiguration is useful has
been somewhat, you know, the more classy examples you see in the literature are encryption where a
key is changed, or a filter where a parameter is changed or some other things. This particular
application does really benefit from partial reconfiguration in a large scale not just a small one for a few
parameters.
Okay, so let me talk a little bit about our platform now to give you a sense of what this overall system
looks like. We implemented our virtual networking system using a NetFPGA board from Stanford. So, a
NetFPGA 1G board, in fairness this system is old, okay, it’s about 6-7 years old. My group is currently
migrating, or has migrated the technology to a newer board from Altera which I’ll say a few words about
at the end of the presentation but most of this talk is gonna focus on the, this NetFPGA product.
The NetFPGA is a board which plugs into a standard PC. You can buy a cube off the shelf from a
company that has all the appropriate software installed. The idea here is that we would have a Linux
based PC which has a series of software virtual routers which are being implemented in kernel mode
and there’s also accompanying through a PCI interface a NetFPGA board which has a series of hardware
virtual routers implemented inside this rather small Virtex 2 FPGA. And the idea would be to swap the
configurations as necessary if one of the virtual networks is no longer needed one of the routers can be
kicked out of the FPGA, a new one can be put in. If it turns out there’s room in the FPGA it’s possible to
migrate one of the networks from software into the hardware. There’s a total of four ports here on the
1G Ethernet interface and a relatively, by today’s standards, a relatively small amount of SDRAM and
SRAM that’s part of the system.
So I’m gonna talk about two techniques, yeah sorry –
>>: So the network that’s getting implemented is coming through the 1G on the NetFPGA or is it the
NICs going to the Linux host?
>> Russell Tessier: Okay, so good question, I actually, the answer to that is yes. So there’s two different
modes, there’s one called single receiver mode and one called multi receiver mode which I’ll talk about
in just a second. Yeah, okay, so basically you can do it either way depending on how you want to do the
reconfiguration.
Okay, so just a higher level block diagram of what’s happening. Here we can see, here’s the physical
interfaces over here for the input, physical interfaces for the output. You’ll notice that the physical
interfaces not only have physical MAC coming in from the outside they also have a way of transmitting
information over the PCI bus from the software into these queues which are located in the NetFPGA.
As I’ll mention in just a second there are actually two different modes we can run, one where all
information is received at the NetFPGA card and then is parceled out to the appropriate place. Another
technique where a switch is used to be able to split up the traffic and make, transfer the data
dynamically. It turns out that the choice that is used for how the data is split up depends on what sort
of reconfiguration mode that you want to use for the FPGA.
Okay, so again I just sort of mentioned this. When we started this the goal was to route everything
through the NetFPGA hardware and there’s some real limitations to be able to do that. Initially when
we started the project we didn’t use partial reconfiguration we used a reconfiguration which involved
modifying the entire FPGA. Okay, which makes it very difficult to be able to use just one single location
where packets are being received. So we move to a technique, at least initially called, the multi-receiver
approach.
So I’m gonna start off with a discussion of the multi-receiver approach first and describe some of the
challenges involved in that. Then I will move onto the technique, the more advanced technique which
we used later in the project which involved having all the packets arrive at the NetFPGA board.
Okay, so one of the goals of this work was to try to figure out how can we receive information for the
networks? How can we allow for changes in the hardware, okay, without upsetting the traffic which was
not changing? So in other words if we implement multiple virtual routers in hardware if we change one
of them how can we do that in such a way that the other routers are not negatively impacted and they
continue operating in an efficient fashion, okay?
So as I mentioned we tried two different approaches. The first one I’m gonna talk about is the multiple
receiver approach. Okay, so here’s a picture of the multi-receiver approach, we can see from the multireceiver approach here that there are switches that are located out here, outside the hardware. Okay,
so here’s the source, here’s the destination, these switches basically make a determination in terms of
whether a particular packet should be routed in the hardware, in the FPGA hardware, or they should be
routed by using one of the software virtual routers.
We can see here that, you know, this looks relatively straight forward in terms of being able to
implement a routing path. The router which was implemented in the NetFPGA hardware was the
reference router that was available from Stanford. It has an arbiter in the front here which determines
as packets arrive which particular router should be handling the packets. There’s a table over here
which indicates the virtual IP indication in terms of being able to select the router, whether things are
being implemented in hardware or software. For this particular multi-receiver approach we’re only
interested in the hardware implementations, at least for the, on the NetFPGA.
Then finally inside the virtual router there’s a look up into the forwarding table to determine what the
destination is and then ultimately the packet is sent out through an output queue back out to a switch
to whatever the destination is and onward to the next to the next node.
A similar path is taken for software. We can see here the packets arrive over here at the PC NIC and
they go through a bridge and are effectively the same sort of process takes place, there’s a look up and a
forwarding table and the information is then sent out back to the switch.
I’ve given this talk before and people have asked me, well it sounds like you’re only doing, you know
modifications of the forwarding table why do we need all this extra hardware? In fact we do have some,
even though this talk doesn’t focus on it there are other examples that we have where other changes
beyond just the forwarding table look up is different. So in other words it’s possible to put in RAV
limiters on some of the traffic for some virtual networks but not others. There’s other protocols the
student has implemented something called a ROFUL protocol for some networks in a more standard
routing procedure for other networks.
So it is possible, actually these virtual routers can have significant differences beyond just performing
forwarding table look ups. For this particular talk though I’m focusing on the forwarding table look ups
portion of this.
Okay, so what happens if we use this dynamic approach- yeah go ahead –
>>: How would you decide which is gonna be in software and which is in hardware?
>> Russell Tessier: So, we have a scheduling algorithm I’ll talk about at the end that makes the
determination based on required resources, which resources should be in the hardware and which
should be in the software.
>>: So you could route through either one for any of the networks it’s just a performance difference, if
you choose it?
>> Russell Tessier: It is a performance difference. Some networks if they have high enough
requirements would have to go on the hardware.
So for example, let’s say that you worked at a data center and there was a need to route some packets,
you know, one of the users want to create a network, if the bandwidth is, or the requirements are high
enough and then maybe that particular network would only be suitable for the hardware. If it turns out
there’s too many that have to go on the hardware then there would have to be some sort of an idea of
which one would be most appropriate, okay.
So let me say a few words about what happens if we perform dynamic reconfiguration of the FPGA for
this multi-receiver approach, okay. So as I mentioned there are multiple virtual routers implemented in
the hardware. If one of the virtual routers needs to be changed in this multi-receiver approach the first
technique that we tried was to reconfigure the entire FPGA. Okay, and as a result of that the other
virtual routers which are not being changed for a short period of time were actually not functional,
which would seem to be a bad thing, okay. So a way to overcome that would be to migrate, if the entire
FPGA is being modified one way to overcome that is to migrate the functionality for the virtual routers
which are not be changed to software for a period of time, reconfigure the FPGA, and then migrate
those routers back to FPGA after the new router has been put in place, okay.
So in looking at this, we can see here that the steps involved for awhile if the FPGA is being modified
moving all the networks which are not being changed to software that granted a much slower
bandwidth rate, reconfiguring the FPGA, and then moving the networks back to hardware. So we call
this a static FPGA approach so effectively the FPGA is shut down for a period of time. This is not the
most effective approach. I’ll talk about a more effective approach in a minute, but at least from, you
know, the first part in terms of studying this it did provide a way of at least providing initial testing.
I’ll mention later that the total amount of time necessary to perform this reconfiguration is around 12
seconds. It doesn’t take 12 seconds to reconfigure the FPGA. It does take 12 seconds to remap the
ports for these networks which are being migrated from hardware to software. Okay, so after the FPGA
has been reconfigured the networks are moved back into hardware and then the routing can continue.
Okay, so after we finish the multi-receiver approach we moved on to this idea of using a single receiver.
The idea here would be just to get rid of all the switches and all the, you know, having to reconfigure the
entire FPGA every time. So the idea would be that everything, all the network traffic is received over
here at the input to the NetFPGA and then there would be hardware inside the NetFPGA which would
figure out if a particular packet should be routed by the FPGA hardware or by the software which is
located on the PC.
Okay and this make it a lot easier if we want to perform a dynamic reconfiguration, in this case here if I
want to go about modifying a particular router in the FPGA and the routers have been created properly
it’s possible to dynamically change only the router hardware which is affected, okay. Only the
forwarding table information and the logic associated with the virtual network is modified via partial
FPGA reconfiguration, okay.
So this is an example here of how a packet, let me go back a second, this is an example of a packet that
comes in from the network, is identified as being as it should be routed by software, the packet is
forwarded from the NetFPGA board through the PCI Bus to software, okay. Look up takes place over
here in the software virtual router and then the packet comes back out. Now that its destination has
been determined the packet comes back out and is then sent out via the physical interface of the FPGA
hardware. So, you know, another benefit of this approach is we don’t need switches out in front of a
board which cuts down on some of the cost.
Okay, so if we want to perform partial reconfiguration of the FPGA we need to take some steps, okay.
Some of you, I know, are FPGA experts, some of you may not be. Although FPGAs provide the capability
of partially reconfiguring the device, in other words, leaving a large portion of the device operational
while a small portion is being changed. In general until recently the FPGA companies haven’t spent a
whole lot of time worrying about making it real easy for people to be able to do that. Fortunately that’s
why we have graduate students, but. It is possible to do it although it’s a little bit painful.
In allowing for the single receiver approach we had to construct the system so that most of the FPGA
design, most of the reference router is stable, okay, most of it does not change while the FPGA is being
modified. But that small pieces of the FPGA, namely these partially reconfigurable regions would be the
part that would get modified when a particular network is being swapped out, okay. One of the
limitations in the current FPGAs is that in order to be able to have a partially reconfigured region they
need to be isolated from the other logic in the device through a series of bus macros. They basically
have to be of the same size, same physical size typically either a portion of a column of FPGA logic or a
whole column in the case of the Virtex 2. You know, there’s a little bit of work necessary in order to be
able to download the configuration and in our case we use a JTAG interface to do the reconfiguration,
the partial reconfiguration of the FPGA device.
Okay, so here’s kind of a snap shot of the Virtex 2, here’s a couple of macros that we made. You can see
they’re Bus Macros which isolate the partially reconfigurable region from the remainder of the logic in
the chip. It turned out that, in this case here, we were able to fit up to about two partially
reconfigurable virtual routers in the Virtex 2. It’s also, we did the same experiment with a Virtex 5 and it
was possible from up to 20 of them. With the larger chips, you know, Virtex 6 and Virtex 7 you could
literally implement, you know, 50, 40-50 of these different virtual routers in the physical hardware
without too much trouble. So this device is actually on the smaller side but it still is unfortunately the
NetFPGA is still the one that’s used by most people for networking research.
Okay, so we’re gonna move into talk about some results now that we generated for this project. This
system was implemented in the lab. We had several NetFPGA cubes we used to generate some of the
experiments. The idea was to get an understanding of the overhead of doing this reconfiguration, of the
benefits of using this approach versus the software approach. Then finally to understand the ideas
regarding scalability, okay. So as we add more virtual networks to the system what happens?
So we can see in this case here we actually wound of using two boxes to perform these experiments.
The NetFPGA project has a suite of tools. So, they have a reference router which will actually route
packets that’s in Verilog, you can download it and run it on the board. They have a packet generator
which will generate packets at line rate and allow for experimentation with the packets.
So we did some experiments here, we generated packets using the packet generator available from the
NetFPGA folks and also iPerf which is a software packet generator which is much simpler. Then finally
we implemented the software virtual routers using a platform called OpenVZ that was implemented on
top of the Linux on the PC and measured the overall results for the system.
Okay, so some of the things that we ran into as I mentioned to do the full FPGA reconfiguration, the
static reconfiguration it took about 12 seconds to basically migrate virtual networks from hardware to
software, reconfigure the device, and then migrate the networks from software back to hardware again.
The reason it took so long to be able to do this was because it was necessary in the network to remap
where the packets were being sent. Since there were switches in the network it was necessary to figure
out where the packet would be sent.
You know, for a short time if we have a small number of virtual networks doesn’t seem like that bad of
an approach but if there’s a lot of virtual networks shutting off the entire FPGA probably doesn’t make a
lot of sense. So there’s really a strong motivation here for partial reconfiguration.
In the partial reconfiguration the modification of just one of those little portions in the FPGA only took
0.6 seconds. So it’s a lot faster and I’ll show you in the results in just a minute that in fact the
throughput, the performance that you get is substantially better than this other static approach.
Okay, so first thing we looked at is how well do we do versus the software technique, if you implement
these routers and software what happens? Some of this is based on the packet size, obviously the
bigger the packet size the better because you don’t have to route, as information or look at the headers
to figure out where the packet is going quite as often. This is a log scale, so we can see here that in
general in all cases the FPGA based router, the hardware virtual router was able to operate at line rate.
This is, you know, of course the 1gigbyte per second Ethernet interface. In fact there was a lot of slack
associated with this. We could have done a lot better than the 1gigbyte per second based on the speed
of the FPGA. In fact some of the faster FPGAs that are available, you know, don’t have any, with some of
the experiments that have been done recently don’t have any trouble getting to 10gigbytes per second
or much higher than that.
This is a consistent rate where as the software especially for small packets, actually, did pretty bad. So
we can see it’s at least two orders of magnitude worse. Most of that is due to the overhead of being
able to get the packet into the software, doing context switches, et cetera. In general the performance
was certainly not scalable or beneficial. In this case this was only for one virtual router, okay, so this
didn’t involve a lot of virtual routers with switching, even with just one the performance was not so
good.
Okay, so I’m gonna contrast two different techniques of using FPGA reconfiguration and what happens
to the network. So let’s take a look at a system here which includes two virtual routers, we’ll say A and
B that are initially in the network. The idea would be that virtual router B is gonna be replaced from the
FPGA with a second network called B Prime. Okay, so there’s two networks A and B both of them are
implemented in the FPGA. There’s a desire to shut off the entire FPGA for a period of time, reconfigure
the FPGA so it will, instead of just including A and B it will now include A and B Prime.
So initially the throughput here we can see here we have 100,000 packets per second when both of
these networks are included in the FPGA. When the FPGA is shut off network A is migrated from
hardware which has high throughput down here to software which has low throughput. So for a period
of time A is continuing to forward packets but at a substantially reduced rate, okay. So it’s doing better
than nothing but it’s still taking a long time. Meanwhile B is shut off so it’s not forwarding anything.
12 seconds later, okay, after all the port remapping and the changes in the physical hardware we can
see here that when the, when A is restored back to the FPGA along with B Prime we can see here that
the network regains its full bandwidth, okay.
A couple of points to point out here, the first point is that this particular graph does not show the
amount of time necessary to do the place in route and the compile time for the FPGA so that would be
extra. Okay, so it’s an even, another even more stronger motivation for a partial reconfiguration in that
you can make a bunch of little modules that basically are the same size that can get swapped into the
FPGA without having to recompile the entire device, as long as it had the same interfaces on the
outside.
Additionally, in general, you know, it may not be necessarily possible to know all the different
combinations of virtual networks that would be in the FPGA at the same time, okay. So, in general this
approach is fine for initial experimentation but is not really a long term solution.
Okay and we contrast that with the single, with the partial FPGA reconfiguration. In this case here B is
swapped out of the FPGA and replaced by B Prime. We can see here that A more or less continues
generating sustained throughput. So A doesn’t get changed at all, that stays in the FPGA, it continues to
route its packets happily. B is shut off after a period of time there’s a much shorter amount of time that
the FPGA is down cause nothing has to be migrated anywhere. Then B Prime is swapped in via the JTAG
interface and it continues along at high speed. Effectively we get a 20X speedup.
It’s also scalable in the sense that if I had a lot of virtual networks I could go ahead and basically
individually swap them in without any problems, okay, based on a schedule.
Okay, so the next issue is one of scalability. There’s a desire as I said to have a lot of virtual networks. In
the FPGA we’re limited, okay. We can only have a total of up to 4 virtual routers in the FPGA. If we’re
really constrained by partially reconfigurable regions we can only really have 2. So we did some
experiments here with the 4 just to see, you know, to sort of take the best case.
So here we can see that if we augment the 4 virtual routers which are implemented in the FPGA over
time what happens is if we augment it with software we can see eventually the overall throughput, the
average throughput starts to go downward because the software, as we add more virtual routers they
must be in software and it has a negative affect on the overall throughput of the system. Okay, cause
some things need to be forwarded to the software.
In general the software only does not so good and it, we can see that as time goes on the overall rate
more or less stays the same then towards the end there’s more and more virtual routers. The software
has more context switching and its performance goes down.
So, what’s the take away for this particular graph, this particular graph is desirable if there’s a lot of
virtual networks that have a big FPGA. Okay, which it can be partially reconfigured.
Okay, what about latency associated with the system? Often in evaluating networks people look at
throughput but latency is also an important issue. So we evaluated the latency of this particular system
here. We can see here that in general we can see the software only has a much longer latency. You can
see that this is about 0.1 and that’s about 0.7 so about seven times worse. Overtime the latency here is
more or less constant even as I add virtual routers cause again the hardware is being used in parallel.
But then things quickly get worse as I start adding software routers to the system. Same thing, software
in general starts off bad and gets worse as more routers are added.
Another question I sometimes get asked is why didn’t you put, you know, you can get an FPGA that has
a hard microprocessor on the FPGA, why don’t you use that for the routing? That’s certainly an option,
it would make things a lot easier cause you don’t have to transfer things over the PCI Bus. We didn’t
really have an opportunity to do that. But a longer term one could consider having just one device that
had a hard processor that was doing the software virtual routers and then the FPGA which was
performing the, performing the hardware virtual routing.
The ZiLinks devices have what’s called an ICap interface which allows for the device itself to reconfigure
portions of its logic. So, that might be another option in the longer term for, basically the FPGA could be
more or less a self contained unit, could perform all the routing and could configure itself without a lot
of external effort. But, we haven’t gotten to that point yet.
We did a comparison between partial reconfiguration and static reconfiguration, where static in this
case is reconfiguring the entire device once again. We looked at using different numbers of ports on the
FPGA. So you could think of using all 4 ports for receive and send on the NetFPGA board or you could
look at just using one port.
The key difference here is a question of how often do you have to reconfigure the virtual networks,
okay? So, if you don’t have to reconfigure that much it doesn’t, you know, it’s sort of not pleasant to
have to shut the entire FPGA down. But it has less of an impact than if you have the reconfigure
frequently, okay.
So if we take a look at this we can see here that in general this is the throughput per dataplane. You can
see here in general the partial reconfiguration does the best. Okay, but it’s not that much different
between shutting the FPGA down, the entire FPGA down for a period of time if we don’t have to
reconfigure that often. The gap gets bigger if the reconfiguration takes place frequently.
From talking to colleagues the reconfiguration for the dataplanes, you know, 180 seconds is pretty fast.
I mean usually these virtual routers stay up for long periods of time. But in general the partial approach
is still a better approach for some of the reasons I said a little bit earlier.
Okay, what about the average throughput for the combined hardware/software approach? We don’t
have a board which has the Virtex 5 in it but we can make some estimates based on, we were able to
compile some of the partially reconfigurable regions into the Virtex 5 and see how fast they are.
We can see here that in general the throughput starts dropping off for the partial reconfiguration versus
the static reconfiguration. You can see here that this, sort of this knee of the curve here, this inflexion
point takes place earlier for partial reconfiguration versus the full reconfiguration. The reason for that is
because as I said for the partial reconfiguration it’s necessary to, the regions have to be sort of carefully
crafted by hand, they have to be a certain size. It uses up a lot more FPGA resources, okay.
So in some cases it may not necessarily be a good thing to have partial reconfiguration because then I
can’t fit as much stuff inside the hardware cause everything needs to be built in a specific way.
Okay, so let me say a little bit about power consumption. One of the concerns with FPGA is with its
power consumption. General our power consumption here was pretty good although in fairness this is a
Virtex 2 which is fairly old technology, I think it’s [indiscernible] a nanometer, it’s pretty old. But in
general most of the power consumption here is dynamic power as you’d expect with this old
technology. Some of the more recent FPGAs the static power is more important.
One of the comments that we got is, instead of dynamically reconfiguring stuff why don’t you just - If
you don’t need a network anymore why don’t you just cut the clock to save power? We did that and we
were able to save about 10% of the power.
You can see up at the top that’s more or less, if we don’t use the PCI interface we can implement five
virtual routers in a Virtex 2. If we implement the PCI interface but don’t support partial reconfig we get
four. If we support both partial reconfig and the interface we can support a total of two virtual routers
in the FPGA. But, again we can support up to 20 in a Virtex 5 device.
There’s a number of interesting issues associated with this work on the software side. My student is
continuing to work on this. One of them is the algorithms necessary to figure out which networks go on
hardware and which networks go on software and whether the FPGA should be reconfigured if a
particular network is created or destroyed. So the student has done a little bit of work on that.
You know there’s a question of, is it desirable when there’s a space available in the FPGA to migrate a
network to fill that gap from software to hardware? So the student’s done some work on this to take a
look at what happens, the upgrades in terms of bandwidth if you go about making a migration. We can
see here that in general the amount of time we were able to get improved bandwidth increases if we go
about migrating a network from software to hardware. Okay, since the migration path for partial
configuration is fairly straightforward.
The algorithm takes a look at how much space is available in the FPGA. What the required traffic is and
then it makes a decision based on whether the particular upgrade should take place or not. In some
cases it’s not possible to make an upgrade cause there’s not enough space in the device. So we had a
sample of a series of traffic, real traffic in a network and we were able to find that it involved 20% of the
time it is possible to make the migration and get an improvement.
Okay, so let me finish up, I just want to say a few words about some other things I’m working on at
UMass. I have a few slides. You’ve seen from the presentation here that we were successful in
implementing these virtual routers on this Virtex 2 device. Altera recently has provided me with some
funding support to migrate the NetFPGA infrastructure to Altera devices.
Over the past couple years we’ve migrated pretty much the entire suite of tools from NetFPGA to an
Altera DE4. I have a link on my website which allows faculty and other researchers to go and download
the infrastructures. So the packet generator, the reference router, various other infrastructure is
available there for free download. The only thing we don’t provide is, or we don’t have on the website
the Verilog code for the reference router. Stanford doesn’t want that on a website cause they actually
assign their students for a class on how to do that. So, I was told that I could have that code and I could
give it to people if they requested but I can’t make it, you know, just download for anybody, so if you’re
interested it’s on my faculty website and I’d be happy to send anybody that’s interested the, both the
packet generator and the reference router for this board.
We targeted DE4 board which is a fairly, which is a board that was introduced a couple years ago. It has
a much bigger FPGA on it, a lot more DRAM. It still has 4-1gig ports. Altera has recently announced, I
think within the last couple of days actually, that they have what’s called a DE5 board which has a much
bigger Stratix device, Stratix V device that they’re shipping. They have also given my group some money
to migrate this experimental platform onto the DE5, okay.
These, let me go to the next slide here to show you some of the resource differences. So here’s some
resource differences between the DE4 and the NetFPGA. So this is the board we’re currently using,
here’s the DE4. The amount of LUTS is much bigger and the amount of space taken up by the reference
router is a much smaller percentage than the overall device which gives us much more room to be able
to do more virtual dataplanes and other logic if necessary, sorting and various other applications in the
network. This is for the DE4. The DE5 is about; I think the Stratix V device on the DE5 is about 2 to 3
times bigger than DE4.
There are some limitations. One of the limitations is Altera for these, at least for the DE4 does not
support partial reconfiguration, okay. So, that’s a limitation we’re dealing with but for the DE5, I’m not
sure if the DE5 has the device that has, I don’t believe they do partial reconfiguration either but I’m not
100% sure about that.
So this is new technology we have. My student who is finishing his PhD is looking at ways of performing
computing using these devices. Basically imbedded map reduced type algorithms in the virtual network
to be able to speed up a system of computers running a synchronous version of Map Brutus. Yes –
>>: Are those suppose to be the same, the same design on those two different boards?
>> Russell Tessier: It’s roughly the same design, there’s a little bit of difference, you can see that some
of the numbers are a little bit different because we had to make some changes in terms of the
implementation because, in going from Altera to ZiLinks. So, in this particular case you’ll see that the,
for example the memory bits look much less for the NetFPGA board primarily because the packets were
being stored off the chip. For the DE4 implementation that we currently have we’re storing that
information on the chip itself so the numbers looked a little bit bigger. But the student is currently
working on a DRAM interface to alleviate that problem.
>>: That’s probably where, also where the IOs are [indiscernible].
>> Russell Tessier: Yeah, the IOs, well yeah we have a different, there’s a different network interface
core that Altera has that we migrated over.
We did some experiments with these virtual networking on the DE4 and basically found that we could
again match the packets per second. We can match, there’s no difference between what we could do
with the NetFPGA and the NetFPGA DE4. That’s probably not very surprising since the DE4 is faster and
bigger than the NetFPGA but we just wanted to confirm that.
So let me wrap up. This project looks at implementing virtual routers on FPGAs. Okay its novel in two
ways, okay. When we initially introduced this we were the first project that had implementation of
multiple different virtual routers on one FPGA, okay. We also were the first project that allowed for
scalability for both hardware and software routers in the same system where the hardware is
implemented in the FPGA.
We think there’s a lot of, you know, this has a promising future, there’s a lot of future applicability for
this. Because there’s a lot of interest in being able to make networks big and small that can basically
virtualize hardware at the same time. I’m sure I’ll have a chance to talk to some of you today about
that.
We’re able to demonstrate this in the lab using a NetFPGA board. We’re currently migrating towards
bigger boards to do more experimentation. I think from, you know, for those of us in the FPGA area the
thing that kind of got me excited was the fact that this really was a great example of why partial
reconfiguration is a good thing for FPGAs.
Altera for many years kind of resisted doing partial reconfiguration, I think, not as a result of this work,
I’m sure but as a result of sort of a broad range of applications using partial reconfiguration has decided
to move into that area as well, and provide tool support to be able to deal with that, right.
One last thing too, there is a follow on NetFPGA board that the Stanford folks have created in
collaboration with Zylon, it’s called the NetFPGA 10G. I’m not sure the status that board has sort of
been in creation for a long time, so I’m not sure of the current status of that board. But there is a
competing board too with the DE4 board just to be fair to everyone.
Okay, thank you very much.
[applause]
>> Russell Tessier: Yeah.
>>: So, probably gonna combine a bunch of different questions into what [indiscernible]. They’re all
sort of surround [indiscernible].
>> Russell Tessier: Uh huh.
>>: So what do you see with the board to be able to support [indiscernible] being able to support lots of
ports ask it in comparison to [indiscernible] requirements? I mean it looks like all this is basically beyond
[indiscernible] but there are some scalability especially when you get far out where if you’re trying to
support lots of different [indiscernible].
>> Russell Tessier: I think we haven’t looked quite that far into the distance. I think there’s issues, you
know, the place that I would see the issues would be sort of at the interfaces to try to get data onto the
FPGA, on to the hardware devices, not reach a bottle neck in the input or output ports. I feel we didn’t
spend a whole lot of time optimizing the logic inside the device simply because we could get the
performance we needed without having to optimize a lot. I think that if more effort was made, so if
there really was a bandwidth limitation I think additional work could be done to optimize the way the
FPGA publication was created.
So I think that is possible but we haven’t really look at that, yet. The student did one thing I didn’t
mention in this work and we have a Transack Computer paper that’s on my website that describes it, so
most of this work is in a TComp paper that the student had accepted a few months ago.
There’s a whole other area I didn’t talk about today which is how do you implement the forwarding
tables, right? So, yeah, it’s this tree based approach that allows for various levels of lookup for virtual
networks, which is very space efficient, right. So as the number of networks increase there’s scalability
issues not only in the actual hardware for the router but also how to implement some of these things on
the device. Yeah.
>>: So often with these partially reconfigured applications you’re telling me it does it through that and
then somebody comes along and builds essentially a programmable super set –
>> Russell Tessier: Right.
>>: Would you say this is something where, what’s common is fixed and what’s variable you’ve got a
little options and you download, just set a couple registers and go?
>> Russell Tessier: Right.
>>: Is that possible to do with what you’re doing?
>> Russell Tessier: It is possible so, you know, the question of is it, do you necessarily need to have
partial reconfiguration or can you do sort of a parameter based set up for configuration? For the types
of things we wanted to do for rate limiters and other, and changing, like really changing protocols you
kind of have to reconfigure the hardware for the network. But I think you’re right. I mean in other cases
it may be possible just to have a parameter setting. It’s certainly possible the NetFPGA router allows
this and the virtual router allows it to. If you just want to change the forwarding table there’s nothing to
reconfigure.
>>: [indiscernible] how much variation from virtual router are you anticipating, I guess, how different is
the logic gonna be for this instance?
>> Russell Tessier: That’s a good question. I guess it depends on the application space. So, you know, I
would expect if somebody implemented this in lets say it was a company selling this as part of a data
center to the outside world I mean there maybe a large variety of different types of network
configurations that might be desirable if it was sort of within a company and sort of fixed it maybe
possible to come up with just a very small [indiscernible]. Yeah.
>>: Way back in the early section of the talk you mentioned the figure 12 seconds so a total
reconfiguration time approximately [indiscernible] software. How much of that was actually in
reconfiguring the FPGA?
>> Russell Tessier: It was less, the FPGA is reconfigured in less than a second it was think hundreds of
milliseconds.
>>: That kind of leads to a second question that if given that there’s very little overhead [indiscernible]
configuration. If you don’t have a partial reconfigurable FPGA why don’t you just have two FPGAs?
>> Russell Tessier: You could do that too, so –
>>: Any problems with that?
>> Russell Tessier: No, I mean you could have, well I mean you would have, you’d have to have the
physical ports, the networking ports going to both FPGAs. You’d have to have some way of trading off
between the two. I mean that’s a possibility.
>>: So from one of the application [indiscernible] –
>> Russell Tessier: Yes.
>>: And so one way that people get around sort of the scalability issues that you brought up is by
temporarily [indiscernible] circuit switched network. Now –
>> Russell Tessier: Right.
>>: How, do see sort of how you might sort of integrated that type of [indiscernible]?
>> Russell Tessier: We haven’t really thought about that. I would guess that instead of having partially
reconfigurable regions with sort of logical routers you could have partially reconfigurable regions with
data path switches. Although it might be easier, you know, Scott mentioned this; it might be easier for
something like that if you just had a programmable cross bar that you could configure without having to
reconfigure the FPGA.
>>: Well, I mean I guess this only comes up when you start having performance problems.
>> Russell Tessier: Yeah.
>>: [indiscernible] in order to have really high bandwidth connections they have to be terribly, terribly
wide –
>> Russell Tessier: Right.
>>: A ton of data coming in at one time. Being able to sort of set up that [indiscernible] to the point
where in fact you can’t reach the bandwidth limit of the device because sort of one packet is, two
packets, one packet isn’t even the width of the data path that [indiscernible]. You know and so there,
yeah, I am imagining that this is becoming, especially when the data rates are 40 times what you could
be using [indiscernible]. I mean that happen really easily, so.
>> Russell Tessier: I think there is, you know, there is a desire, I’ll say for myself I wish we had a board
that had much higher network capacity. So, you know, one of the issues in being a researcher is, you
know, NetFPGA is kind of the thing most people are using. I think everybody agrees it’s not efficient,
you know, the NetFPGA 10G has been in planning for 3 years until recently I think there have been a lot
of problems in, there are a lot of problems with the physical interface for it for many years. I don’t even
know if it’s available yet that’s kind of what got us interested in DE4. But I would say that as these
platforms for research become available, I mean I would love to do something with 4EG and get in there
and figure out what the real challenges are in the FPGA side. Yeah.
>>: As a, do you have a story for mapping models to any number of the PR [indiscernible] that you have
in your device? You were talking about [indiscernible] fit 20 PR regions in [indiscernible] –
>> Russell Tessier: Right.
>>: To migrate the virtual router from software moving it in and out so –
>> Russell Tessier: Right.
>>: [indiscernible] allocating [indiscernible] or something.
>> Russell Tessier: Right.
>>: How would you, do you have a story for that right now?
>> Russell Tessier: Yeah, you know, I talk about the allocation algorithm I think the student, one of the
nice things about the Virtex 5 which is what we didn’t actually physically do that on the, we did partially
recon, we did completely reconfigure the device and partially reconfigured it for just one router. We
didn’t do it for, you know, 20 routers. So the short answer to that is we primarily just assumed that the
partially reconfigurable regions were allocated in a column, you know, a series of columns. There was
software on the computer that would sort of understand what was currently populated and what
wasn’t. Then we’d sort of be able to swap things in and out.
There probably is an interesting research question in terms of can you cache these things and get them
on the board before hand and just swap them in as you need them? I think the, we were using JTAG, I
mean I think an ICAP interface with the processor that’s on the board is a much smarter way of doing it,
we just didn’t do that yet.
I think there’s an interesting paper in the allocation algorithm for this that we didn’t look at as well.
Some of this too is, you know, coming up with realistic traces for the traffic so that we can actually show
that it makes sense. Yes.
>>: Just a general question. How much operating [indiscernible]? Is that [indiscernible] all in the DRAM
or can you [indiscernible] FPGA [indiscernible] latency [indiscernible]?
>> Russell Tessier: Right, so the experiments we did here were I believe the packets were stored in
DRAM and, but it wasn’t a limitation in terms of the throughput. The student did another paper where
he actually did some, excuse me; the packets were stored in the SRAM. The student did another
experiment where he moved the packet buffering from SRAM to DRAM and then moved the forwarding
table into SRAM because he needed more space for the forwarding tables. We didn’t see that as
limitation on this particular system, maybe, you know cause the parallel that’s in the FPGA was able to
route, in fact in the 1gigbit per second is not, you know, by today’s standards is not tremendously fast.
I would guess that if you moved to higher speeds then the memory interface, especially as the packets
get bigger would be a concern. But compared to the Virtex 2, I mean MOD and FPGAs have borders of
magnitude or on chip memory.
>>: [indiscernible] you are using both DRAM and SRAM, do you think the general cost would be
[indiscernible] because the SRAM chips seem to be very expensive? FPGA is also –
>> Russell Tessier: Yeah.
>>: Do you feel like, you know, in the future that FPGA prices are going down and this will be
something that they can use everywhere, or –
>> Russell Tessier: I think the rule of thumb I often think of for FPGAs is if you want to; the best price
point for FPGAs is usually a generation which is one generation old in the middle chip, right. So a high
end chip for any generation is the most expensive. The cheap ones, the real low end ones are too small
to do anything really useful except for maybe a few embedded dedicated applications. So the right
place to be is sort of like one generation before in the middle. I think that, you know, the prices in that
area have consistently been in like the few hundred dollar region for a while and, you know, the price
starts, as soon as a new generation comes out it like old cars right the prices come down pretty
dramatically.
So I guess the short answer is, I think of an FPGA that has suitable resources being similar to a
microprocessor, a few hundred dollars, so it’s not thousands of dollars but it’s not $20 either. All these
things are sort of following; I tell this to my students, you know, all these things are following mores law,
right. So as you get bigger chips there’s more things you want to do with it. So, I don’t know that’s just
my take on it. Yes.
>>: How painful was the transition from the ZiLinks to the Altera tools?
>> Russell Tessier: So, I had a student work on that for about a, so the tools themselves weren’t bad
cause, you know, we, in my lab we have students that do both ZiLinks and Altera. The actual getting the
Ethernet core ported took about, took a few months, took 3 or 4 months of student time to figure out
how to deal with that.
We’re still working on the DRAM interface. So, I mean the student is taken the DRAM interface from the
NetFPGA and is trying to migrate it. But the technology is different so you have to sort of look at the
flow control and find out why. So just putting a DRAM interface is easy but then the packets get backed
up and it’s not clear like why it’s not working and things like that.
So it’s a work in progress but, you know the students often will do this for a period of time, you know,
it’s a two year Masters they’ll spend 6 months working on this and then they’ll spend a year and a half
using the board to do something useful.
I have another project which is funded by NSF which is looking at network security using FPGAs. So the
student’s developing monitors in the hardware to actually look at what’s happening in the network and
that uses the DE4. We’re still fairly early in that project.
Thank you.
[applause]
Download