18892 >> Jitu Padhye: It's a pleasure to welcome Sanjin. ... SIGCOMM. It's about routers that use GPs. Thanks.

advertisement
18892
>> Jitu Padhye: It's a pleasure to welcome Sanjin. And this paper is going to appear in
SIGCOMM. It's about routers that use GPs. Thanks.
>> Sanjin Han: Thank you. Hi. I'm Sanjin Han. And I'm going to present
PacketShader, our GPU-accelerated software router. This work is joint work with Keon
Jong, Kyoungsoo Park and my field advisor, Sue Moon.
Let me explain what PacketShader is all about. PacketShader is a software router that
exploits GPU for high performance database process. PacketShader prototype shows
40 gigabits per second for performance on a single PC.
I'm going to give a general idea of the software routers. Software router is literally driven
by software. So in the horizontally flexible and programmable. The loads of the
software router is not limited to IP routing. If we want more functionality for support of
new protocols, then you can just do it by multiplying software. Software routers are
based on commodity hardware, which is two implications. Commodity hardware is
achieved and each evolution is much better because it's backed by millions of
customers.
Now let's review the commodity hardware market. Thinking about this has become
popular. And this is a great opportunity for software routers.
The cost of the input is developed quickly, and it's now below $300 per port. You can
build multi-tension software router at a very affordable price.
However, high speed link and low cost general guarantee, they guarantee the
performance of software routers. The Achilles heel of software routers has long been its
low performance. The tail end of thresholds of the performance of software routers
measured with shifts of four or five packets, many which have tried to push up
performance to the limit, but it has not been scaled beyond ten kilobits per second.
This means that it doesn't matter how many tens of ports you have, your shifter will not
be able to handle even a single port at a line rate.
The performance bottleneck is in the CPU, and I'm going to give you a little more details.
Let's see how much computation power is needed and why CPU only approach is not
enough for tension networks.
First of all, you need 300 CPU cycles to receive and transmit packets for interface cut.
Furthermore, you need additional computation power depending on what you want to do
on the software router.
The problem is that you don't have enough CPU cycles, so how can you solve this
problem?
>>: CPU cycles they must be to a specific architecture, right a set, is this 86 or?
>> Sanjin Han: Yeah, these are based on actually -- and unfortunately [indiscernible].
So our approach is two-fold. The first one is to optimize packet aisle and you get
significant improvement by factor of six. I'm not going to address the details in this talk,
and you can find our optimization techniques in the paper.
And our second approach is about how to handle these application-dependent
operations in an effective way. Our solution is to off-load those operations to GPU, and
this is the main topic of this talk.
I assume most of you are not familiar with GPU. So let me introduce you to GPU briefly.
GPU is the center processor in a graphics card. In order to synthesize realistic 3D
cryptics in real time, GPU handles millions of polygons and billions of pixels per second
on each massively patterned architecture. Software routers have nothing to do with the
graphics and rendering, then why am I introducing GPU to you guys?
Essentially, GPU has been flexible and programmable enough for known graphics
workload -- general proportion computation on GPUs is increasingly popular and
basically this is a novel application of GPU per software routers.
GPU architecture is different from that of general [indiscernible] CPU. Today's CPU has
small number of cores and each core is quite [indiscernible]. In contrast, GPU has larger
number of cores and each GPU core is not just as smart as CPU cores. They don't have
some automatic features of CPU, such as other world execution or dynamic branch
prediction.
If we want to transport a few people quickly, then GPU is the best option. But if you
want to transport so many people, in other words, if your workload is through-put centric
then you should [indiscernible] GPU.
This difference comes from how [indiscernible] are computed in that processor. CPU
dedicates most transistors to execution control plane and larger caches to accelerate
single-threaded programs.
GPU has much more transistors in general, and most of the transistors are devoted to
estimating larger units and each estimating larger units form large array of GPU cores.
So if your workload has enough parallelism, you can expect significant through-put
improvement with GPU.
Now let's move on how we can utilize GPU per packet processing. For packet
processing, GPU can be quite effective solution. I'm going to explain the elements of
GPU in terms of low computation power, [indiscernible] latency and memory bandwidth.
Software routers need lots of computation power. Hessian, encryption, pattern
matching, all these are operations are computation hungry. The good thing with GPU
has much higher computation power so you can adopt those compute intensive
operations in your software router.
And the second aspect is memory latency. Software routers tend to do random memory
access, so your CPU will experience larger cache misses and it thus risk computation
power.
In contrast, GPU can work around memory penalties. When the GPU has a cache miss,
then it switches to another threat that is immediately executable.
This context switching is done by hardware scheduler. So it doesn't encourage any
extra overage.
And the last thing I want to compare is memory bandwidth. The memory bandwidth of
top of the line GPUs is referred to gigabytes per second today and this number is
theoretical and depending on memory access pattern, you will get much less empirical
memory bandwidth.
So you can't utilize even this number because you have many mouths to feed. Basically
reflection and transmission of package [indiscernible] take larger amount of memory
bandwidth. If your software router for traffic at a certain rate, then the memory
bandwidth consumption is more than four times.
As a result, your project packet progression can be left 10 gigabytes per second on high
speed networks. Within the budget, you have to make a porting decision or perform
deep packet inspection for every packet. The problem with that, your limited budget is
not quite enough to handle such memory intensive operations. But good thing GPU has
order higher memory bandwidth than that of CPU.
So if you offload memory intensive operations to GPU, then you can secure enough
memory bandwidth.
So far I have explained the why GPU can be useful for packet progression. Now I'm
going to explain how to use GPU for packet progression.
The basic idea of using GPU in software router is very simple. The CPU law of the
shipping and transmitting packets through a [indiscernible] remains [indiscernible] -remains as a tradition of the software router.
The importance is that we offload core operations to the GPU. Such operations could be
for example forwarding table lookup, encryption or anything that requires much
processing power.
One thing you have to consider as I said to people GPU is basically parallel processing.
For optimal performance, you must feed enough parallelism to GPU, otherwise GPU will
be underutilized. So you can't expect any performance advantages over GPU.
Then where can you find such parallelism in packet processing? We can take much per
package from Alex Cues [phonetic] in a bench not one by one and process them in
parallel.
The key insight behind this work is that safely progression is usually parallelizable by
definition. You don't need any coordination between package per procession.
Some of you might worry that accumulating hundreds of packets may induce
unreasonable delay. I will answer to the question, it's not true. For example, on a fully
saturated intensive link, you can get up to 1,000 packets only in 67 microseconds.
Now let's move on, the detailed architecture of PacketShader.
PacketShader basic design is very simple. The packet procession is done in three
stages. Preshader and post shader. And I'm going to explain how each stage works.
This is an example of IPv4 forwarding. Were packets received in a bets. Preshader
shader proposed some preprocessing of package such as checker sum and keep check
or check whatever. Some packets that need further precession, goes to the IPsec of the
operating system. For the last package, Preshader collects the destination IP order
sheet of the packets and pass them to the shader stage.
In the shader stage, the collected IP order sheets are copied to the GPU. And the GPU
propose 14 paper lookup in parallel. And finally the resulting, that's the whole
information is delivered to the CPU again.
Finally, in the post-shader stage, the packets are authenticate and transmitted through
alpha ports based on the 14 [indiscernible].
For simplicity, I omitted the device driver in the previous slide but, of course, you need
interface to communicate with network interface as in the figure. We implemented our
own device driver for optimal performance. And you can find some details in the paper.
So far we have been -- we have seen that basic design of PacketShader. But things
become a little more complicated because of some challenges behind today's
commodity architecture.
The first challenge comes from adoption of multi-core CPU. CPUs have much per cost
in the [indiscernible] and each core must run as independently as possible in order to
achieve linear scalability with the number of cores.
So we duplicated the patent streamline across multiple cores as shown in the slide.
Another choice we made is to dedicate one CPU core as master core which
communicates with the GPU. And multiple cores contended for the same GPU
performance significantly interrupts. So this is the reason why we have only one master
core in the ship year.
For [indiscernible] multipchip computation has become increasingly popular. The
challenge of a multichip is the cost of communication between separated GPUs is very
extensible. So we partitioned the PacketShader, shift them in a way that any
communication between the master and workers can be turned only into the same CPU.
Now let's move on. How we evaluated the shift and what performance numbers we got.
This is how we set up our PacketShader prototype. We used two protocol GPUs for 0
port [indiscernible] and two GPUs in the shifter.
The implementation of why we used multiple components in the shifter means first to
prove that PacketShader is really scaleable; and, second, to confirm that our system
reflects the trend of top-of-the-line commodity servers.
For measurement, we collected packet generated and PacketShader back to back. We
built the packet generator which is based on our highly optimized with packet IO engine.
You can generate up to 80 kilobits per second with minimum size UDP packets.
And here is the result. We implemented the four applications, IPv4 and [indiscernible]
were below switch and access tunnelling on top of PacketShader. This could have,
compares our CPU-only and GPU plus GPU implementations, clearly showing that the
effectiveness of the GPU per packet procession.
The speeder due to GPU auxiliary depends on applications. For IPv4, which is simple
enough already, you can't expect much improvement with GPU acceleration. In
contrast, IPv6 shows 4.8 times improvement over CPU-only implementation because IP
shift forwarding is highly memory intensive. So the GPU can help a lot.
>>: How do you manage to get 38.2 gigabits per second? The max was under 10, and
the memory bandwidth was roughly 10.
>> Sanjin Han: Memory bandwidth is this in partially in this case. But I can say that the
memory bandwidth, the memory bandwidth we need eight kilobits per second because
it's about gigabits per second, and the numbers I said is about megabytes and kilobytes
per second.
>>: [indiscernible].
>> Sanjin Han: In this talk I'll give you more details of IPv6 and IP offloading of each
example.
First example is IPv6 forwarding. IPv6 lookup requires more processing power than
IPv4 because IPv6 address is 128 bits long, which is eight times longer than IPv4
relations. So we offloaded the longest preface matching operation to GPU. The
algorithm we used in our work performs binary search on hash tables.
Here is the performance result of IPv6 forwarding. The performance improvement over
GPU is clear when the packet size is small. And our GPU acceleration guarantees
around the 40 gigabits per second for any packet sizes.
I told you that the total capacity of our shift is 80 kilobits per second but for GPU only
and GPU plus GPU caches the through put is not scale beyond 40 kilobits per second.
This is because of the hardware problem in our model, and details are again in the
paper.
We expect much higher performance once the hardware program is fixed. And the
second example is IPv6 tunnelling, IPv6 tunnelling is done in two stages. The first stage
the entire packet is encrypted with AES algorithm and in the second stage hash packet
method is appended with [indiscernible] algorithm. We offloaded AES and [indiscernible]
algorithms to GPU because these cryptography operations are highly [indiscernible]
intensive. For IPv6 GPU acceleration improves the through-put by a factor of 30.5. You
can see the performance improvement is stable for all packet sizes. This is because
IPv6 operates on entire package not only on packet headers. For minimum size
packets, PacketShare is launch 10 kilobits per second and its performance which is up
to 20 gigabits per second for larger packet sizes. And I can say this number is
comparable, much superior to commercial hardware-based IPv6 tunnelling appliances.
Let me compare our system with previous software routers. PacketShader is the first
multi changes router on a single machine. And with GPU acceleration, PacketShader
launches the speed of 40 kilobits per second.
What's interesting here is that previous software routers were implemented in corner, not
in user space. The column belief has been that usually where packet procession,
despite its many other advantages, is much slower than other packet procession. In
contrast, we implemented PacketShader in user library and shown that high
performance packet processing is possible in user space.
>>: [indiscernible] for minimum size packets, right? 64.
>> Sanjin Han: Right.
>>: Otherwise they mimic your [indiscernible] as well?
>>: I guess that's what IPv6 for 1500 byte packets they're CPU only implementation.
>> Sanjin Han: For larger packet sizes due to our IO bottleneck, it shows the same
performance.
But what I want to say is that in GPU implementation for guarantees 40 kilobits per
second performance for any combination of packet sizes.
I'm going to conclude this talk. I have shown that GPU is a great opportunity for
[indiscernible] routers. We realize that GPI idea in PacketShader. And PacketShader is
just scales of the router that exploits GPU for packet progression acceleration on top of
optimizing the packet value engine.
We implemented four applications, IPv4 and V6 open flow and IPsec are PacketShader
to show the prove the effectiveness of our shifter.
Here is our plan to expand PacketShader. The first one is to integrate [indiscernible]
plane in PacketShader. Our current implementation only supports static forwarding table
for IP routing, and we are planning to integrate [indiscernible] or [indiscernible] in our
shifter.
And second one is about programming environment. We are planning to support
cling-like model program environment to make our system more extensible. And the
third one is to support the concept of opportunistic offloading. If the problem of input
traffic is low enough, then we may use only GPU for low latency and low power
consumption. And if the traffic goes up, then we can use GPU for high through-put. And
this is the end of my talk, and I would be glad -- happy to answer your questions. Thank
you.
>>: So you do back processing, right? You put the packets in a buffer, give the idea -so that adds some delay. How much delay do you? -- and I was expecting to see some
sort of graphic repair [indiscernible].
>> Sanjin Han: Usually we collect hundreds or thousands of packets. But the delay is
acceptable. And this graph shows the long-term latency with PacketShader and you can
see that but CPU plus GPU implementation shows the latency between 200
microseconds and 400 microseconds.
This may be not trivial for some people. But I think it's acceptable. And if I can
implement the opportunistic offloading idea, as I said, then the latency graph will look
like this.
>>: The bottleneck is the PCI express? It seems like you would be bound by
[indiscernible] transfer data to the graphics card and back.
>> Sanjin Han: The problem with that, this is how our system looks like. And there is
something called IOH between GPU and GPU. Can you see? And IOHC basically
bridges from devices to CPU. And in our main model we have two IOHs, because you
really needed large over PCI floats in our system.
But the problem is that if we have two IOHs in the main board, then the IO through-put in
our system is not work well as it's supposed to be.
So I think I need your help.
>>: We actually played with a single IOH case and with a single IOH case plus three
network interface cards, we actually have performed more than 50 gig. But when we
have dual IOH on our system, then the [indiscernible] -- it's actually not only the
reception part, receiver side, that's bottlenecked at 40 gig. So our packet generator can
spit out packets of 80 gig, but cannot receive basically take the packets from the
[indiscernible] and touch memory at 80 gigs. It's counted 40 gig. So I think it's a dual
IOH, the control problem, Intel control problem, and it's been reported in other user
groups.
And we're just waiting for the next generation IOH from Intel.
>>: Even if that's fixed from the IOH to the GPU, that communication is happening over
the PCI express bus, right?
>>: Yeah.
>>: How much benefit do you think you'd get if you actually take the processors from the
GPU and put them just on another core on the actual CPU dye itself, so you didn't have
to go memory request from GPU to GPU?
>> Sanjin Han: You mean how about putting more CPUs than GPUs, right?
>>: No, I mean actually putting vector processors like you have on the GPU into the
actual CPU dye? They take away an extra core.
>> Sanjin Han: Something like AMD. Yeah. Yeah. What I want to say is GPU in this
work may be transitional. And maybe some GPU plus GPU integrated to perform maybe
the final destination, I think. But today the separate graphics card is the best way to
have some performance advantage over CPU, because GPU consumes larger power
and it displays larger over here.
Maybe I think in some day it's going to be realized that CPU and GPU can be together.
>>: Yes, but my question was how much is the memory bandwidth currently to the GPU,
currently bottlenecking your system, if you think that's actually a significant factor or not?
>> Sanjin Han: Maybe I don't understand.
>>: The CPS bandwidth. So from the NIC first it's loaded to the RAM, then the RAM to
GPU, and then back to the RAM. And so GPU ->>: So you have an extra back and forth?
>>: Yes.
>>: That you wouldn't need if you didn't have that.
>>: Which is why your packet size goes up.
>>: Yeah.
>> Sanjin Han: But the important thing is that you don't ->>: In the first place where he had the different pieces eating up parts of it.
>>: Depending on the processing, you don't need to copy the entire packet. For the
forwarding, you only need the address.
>>: Right.
>>: So for IPv6 you need the entire packet. So it depends on the application.
>>: For your CPU numbers did you take advantage of succinct instructions?
>> Sanjin Han: What kind of instructions?
>>: Instruction on the ->> Sanjin Han: Yeah. Basically -- yeah. We could have used SSE for our
implementation because our implementation is done at individual level. In each
individual level we can use SSE instructions.
You can't do it in kernel level implementation. But basically we didn't use FFE
implementation, SSE instructions because it needs lots of work and some expertise in
the instruction. And the same story can be applied to GPU. But our GPU
implementation is just straight 14 of CPU program to CPU.
So maybe as SSE instructions can help, but the performance bottleneck over CPU
implementation of IP forwarding is something about [indiscernible] latency. Not the
number instructions. So SSE, I don't think that SSE was a [indiscernible].
>>: What did you use for programming the GPU and what was the ->> Sanjin Han: Basically let's see -- this is a 13 -- 14-hour CPU. Basically it had some
extensions for professional, but ->>: Was an open [indiscernible].
>> Sanjin Han: We used CUDA.
>>: Okay.
>>: What happened -- there's a line of work coming out of Intel and other research
places like making network processors. What happened to that? Are they still like
networked processors that are supposed to build this kind of thing, processing
customized as far as packet forwarding and such things?
>> Sanjin Han: What I want to say is that networked processor is not commodity. You
can't buy a network processor at the cost of $500.
>>: What's the cost of a processor?
>> Sanjin Han: I don't know.
>>: $100,000.
>>: But they're not -- but they're also I think substantially less for parallel. I think there's
something like 60, 32, 500.
>>: Network processors are going out of fashion, as they often get [indiscernible]
processors doing well and ATC chasses, but first it's coming together so you need to
design a board with a network processor. So it's not a commodity averse. It's not
readily available technology like the graphics card.
>>: Things like GPUs will make network processors obsolete?
>>: I don't know. But network processor is very hard. Failing to provide a programming
environment that's like graphics processors the companies have done, it's one major
weakness to network processor marketing.
>>: So should I by reordering?
>> Sanjin Han: Reordering, yeah, no, there is no reordering in this system. If you ask
why, then -- I should have prepared back of the light ->>: Today?
>>: What about the slide.
>>: The answer was -- basically some reordering can happen. But packets in the same
flow will not be drilled together.
>>: The package ->>: The possibility of packets being reordered takes place from the mix to the queues in
the memory, in the driver. So from the mix is given, the packets are sent to the queues,
only the driver. That's the only time the reordering can take place, packet one, two,
three, four, arrived at the gate. One goes to queue one, the two goes to two and two
goes to three, and out of those 1234 packets, they might actually go to 1234 might come
out not is in order but out of order, but when Nick sends packets to his queues then it's
actually RSS support he is the five TUPL support at the nick.
So packets from the same flow go to the same cue. So packets from the same flow are
never -- then all the GPU processing that's all done in order. So only at that part there's
a room for reordering. But not for packets from the same ->>: The batch goes and comes back in the same order, that's how it's enforced?
>>: Yes. But the shading process, the preshader shader and post processor is all ->>: What exactly did you code the GPU for fall already? It seems like you always do it
in one shot. Like for each package there's one operation that happens on the GPU. But
looks like some checking would also be parallelized. Then you might want to do two
different things on the back, do some checking.
>> Sanjin Han: Basically we need to do something like that for IPv6 tunnelling, because
we do AS and [indiscernible] and this is done in separate stages. So we can do two
things. Not just one. Why not three.
>>: So it's general. Okay. That was my question. So can you quote the graph about
the IPv6 forwarding? This one. Have that right one forward. So is this assuming you're
doing a forwarding lookup per packet, or is this, like you would -- how many distinctions?
The average router only has a few thousand active destinations in it, right? And you
could cache them, is that mean that this is not happening except for like one in a
thousand packets or is it the case that you're going to actually do a lookup every time,
what's going on?
>> Sanjin Han: This is basically the worst case of IPv6 forwarding. We have enough
number of prefixes in the system and every packet shift by PacketShader has different.
>>: Destination.
>> Sanjin Han: Destination, IP addresses.
>>: Worst case from both the incoming stream and also at the routing table because
actually with an IPv6 routing table you would see less trees than IPv4 because you had
more consolation abroad.
>>: IPv6 there's no realistic forwarding table that's challenging. It's randomly generated
while IPv4 is more a reflection of today's router and operation.
>>: Presumably a very high percentage of your firends will take it from reserve from the
cache, actual looking up the tables? Is that wrong?
>>: I mean, actual ->>: I seen so if you had to use it when it's expensive you'd only cash the 500 flows this
patient would go where.
>>: [indiscernible] you can just not deal with different fast hookup and slow lookup or
packets. So you have to basically put the entire forwarding table on to the line card,
however you implement it.
>>: But yours is in a real router, right? Software router.
>>: You want me to realize, don't you?
>>: Why would you penalize yourself? Okay. Fair enough.
>> Sanjin Han: Thank you.
[applause]
Download